Credit Card Data Analysis using PySpark (Get the category in which the user has made the maximum expenditure) without using PySpark window function
Input details:
#● File has json records
#● Each record has fields:
#○ user_id
#○ card_num
#○ merchant
#○ category
#○ amount
#○ ts
### Below analysis to be done
Sample data:
+------+--------+---------+--------+----------+-------+|amount|card_num| category|merchant| ts|user_id|+------+--------+---------+--------+----------+-------+| 243| C_108| food| M_102|1579532902| U_104|| 699| C_106|cosmetics| M_103|1581759040| U_103|| 228| C_104| children| M_110|1584161986| U_103|
Application: Get the category in which the user has made the maximum expenditure
Solution:
from pyspark.sql.functions import col
#5. Get the category in which the user has made the maximum expenditure
# this can be done by using window function
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
df = spark.read.json(“card_transactions.json”)
#getting max amount spent by user
print(df.groupby(‘user_id’).max(‘amount’).collect())
windowDept = Window.partitionBy(“user_id”).orderBy(col(“amount”).desc())
df.withColumn(“row”,row_number().over(windowDept)) \
.filter(col(“row”) == 1).drop(“row”) \
.show()
Output:
[Row(user_id='U_102', max(amount)=997), Row(user_id='U_104', max(amount)=996), Row(user_id='U_101', max(amount)=1000), Row(user_id='U_103', max(amount)=977)]+------+--------+-------------+--------+----------+-------+|amount|card_num| category|merchant| ts|user_id|+------+--------+-------------+--------+----------+-------+| 1000| C_101|entertainment| M_100|1580163399| U_101|| 997| C_103| groceries| M_103|1582876481| U_102|| 977| C_104| groceries| M_101|1579402924| U_103|| 996| C_108| food| M_106|1581391534| U_104|+------+--------+-------------+--------+----------+-------+