Window function in PySpark with Joins example using 2 Dataframes (inner join)

Input details:

#● File has json records
#● Each record has fields:
#○ user_id
#○ card_num
#○ merchant
#○ category
#○ amount
#○ ts
### Below analysis to be done

Sample data:

+------+--------+---------+--------+----------+-------+|amount|card_num| category|merchant|        ts|user_id|+------+--------+---------+--------+----------+-------+|   243|   C_108|     food|   M_102|1579532902|  U_104||   699|   C_106|cosmetics|   M_103|1581759040|  U_103||   228|   C_104| children|   M_110|1584161986|  U_103|

Solution:

from pyspark.sql.functions import col

# this can be done without using window function
creditCardData = spark.read.json(“card_transactions.json”)

useridMaxSpendDF = df.groupby(‘user_id’).max(‘amount’)
useridMaxSpendDF=useridMaxSpendDF.withColumnRenamed(“max(amount)”,”max_amount”)
useridMaxSpendDF=useridMaxSpendDF.withColumnRenamed(“user_id”,”m_user_id”)
cond = [creditCardData.user_id == useridMaxSpendDF.m_user_id, creditCardData.amount == useridMaxSpendDF.max_amount]
joinedData = creditCardData.join(useridMaxSpendDF,cond,”inner”)
joinedData.select(col(“amount”),col(“category”),col(“user_id”)).distinct().show()

+------+-------------+-------+|amount|     category|user_id|+------+-------------+-------+|  1000|entertainment|  U_101||   997|    groceries|  U_102||   977|    groceries|  U_103||   977|         food|  U_103||   996|         food|  U_104||   996|     children|  U_104|+------+-------------+-------+

Popular posts from this blog

In detail Accumulators explanation in Spark (Java and Scala)

Hive joins in details with examples

RDD joins using PySpark examples