Window function in PySpark with Joins example using 2 Dataframes (inner join)
Input details:
#● File has json records
#● Each record has fields:
#○ user_id
#○ card_num
#○ merchant
#○ category
#○ amount
#○ ts
### Below analysis to be done
Sample data:
+------+--------+---------+--------+----------+-------+|amount|card_num| category|merchant| ts|user_id|+------+--------+---------+--------+----------+-------+| 243| C_108| food| M_102|1579532902| U_104|| 699| C_106|cosmetics| M_103|1581759040| U_103|| 228| C_104| children| M_110|1584161986| U_103|
Solution:
from pyspark.sql.functions import col
# this can be done without using window function
creditCardData = spark.read.json(“card_transactions.json”)
useridMaxSpendDF = df.groupby(‘user_id’).max(‘amount’)
useridMaxSpendDF=useridMaxSpendDF.withColumnRenamed(“max(amount)”,”max_amount”)
useridMaxSpendDF=useridMaxSpendDF.withColumnRenamed(“user_id”,”m_user_id”)
cond = [creditCardData.user_id == useridMaxSpendDF.m_user_id, creditCardData.amount == useridMaxSpendDF.max_amount]
joinedData = creditCardData.join(useridMaxSpendDF,cond,”inner”)
joinedData.select(col(“amount”),col(“category”),col(“user_id”)).distinct().show()
+------+-------------+-------+|amount| category|user_id|+------+-------------+-------+| 1000|entertainment| U_101|| 997| groceries| U_102|| 977| groceries| U_103|| 977| food| U_103|| 996| food| U_104|| 996| children| U_104|+------+-------------+-------+