Credit Card Data Analysis using PySpark (Get the category in which the user has made the maximum expenditure) using window function

Input details:

#● File has json records
#● Each record has fields:
#○ user_id
#○ card_num
#○ merchant
#○ category
#○ amount
#○ ts
### Below analysis to be done

Sample data:

+------+--------+---------+--------+----------+-------+|amount|card_num| category|merchant|        ts|user_id|+------+--------+---------+--------+----------+-------+|   243|   C_108|     food|   M_102|1579532902|  U_104||   699|   C_106|cosmetics|   M_103|1581759040|  U_103||   228|   C_104| children|   M_110|1584161986|  U_103|

Application: Get the category in which the user has made the maximum expenditure

Solution:

from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number

df = spark.read.json(“card_transactions.json”)

#getting max amount spent by user
print(df.groupby(‘user_id’).max(‘amount’).collect())

windowDept = Window.partitionBy(“user_id”).orderBy(col(“amount”).desc())
df.withColumn(“row”,row_number().over(windowDept)) \
.filter(col(“row”) == 1).drop(“row”) \
.show()

Output:

[Row(user_id='U_102', max(amount)=997), Row(user_id='U_104', max(amount)=996), Row(user_id='U_101', max(amount)=1000), Row(user_id='U_103', max(amount)=977)]+------+--------+-------------+--------+----------+-------+|amount|card_num|     category|merchant|        ts|user_id|+------+--------+-------------+--------+----------+-------+|  1000|   C_101|entertainment|   M_100|1580163399|  U_101||   997|   C_103|    groceries|   M_103|1582876481|  U_102||   977|   C_104|    groceries|   M_101|1579402924|  U_103||   996|   C_108|         food|   M_106|1581391534|  U_104|+------+--------+-------------+--------+----------+-------+

Popular posts from this blog

In detail Accumulators explanation in Spark (Java and Scala)

Hive joins in details with examples

RDD joins using PySpark examples