Credit Card Data Analysis using PySpark (Get the category in which the user has made the maximum expenditure) without using PySpark window function

Input details:

#● File has json records
#● Each record has fields:
#○ user_id
#○ card_num
#○ merchant
#○ category
#○ amount
#○ ts
### Below analysis to be done

Sample data:

+------+--------+---------+--------+----------+-------+|amount|card_num| category|merchant|        ts|user_id|+------+--------+---------+--------+----------+-------+|   243|   C_108|     food|   M_102|1579532902|  U_104||   699|   C_106|cosmetics|   M_103|1581759040|  U_103||   228|   C_104| children|   M_110|1584161986|  U_103|

Application: Get the category in which the user has made the maximum expenditure

Solution:

from pyspark.sql.functions import col

#5. Get the category in which the user has made the maximum expenditure

# this can be done by using window function

from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number

df = spark.read.json(“card_transactions.json”)

#getting max amount spent by user
print(df.groupby(‘user_id’).max(‘amount’).collect())

windowDept = Window.partitionBy(“user_id”).orderBy(col(“amount”).desc())
df.withColumn(“row”,row_number().over(windowDept)) \
.filter(col(“row”) == 1).drop(“row”) \
.show()

Output:

[Row(user_id='U_102', max(amount)=997), Row(user_id='U_104', max(amount)=996), Row(user_id='U_101', max(amount)=1000), Row(user_id='U_103', max(amount)=977)]+------+--------+-------------+--------+----------+-------+|amount|card_num|     category|merchant|        ts|user_id|+------+--------+-------------+--------+----------+-------+|  1000|   C_101|entertainment|   M_100|1580163399|  U_101||   997|   C_103|    groceries|   M_103|1582876481|  U_102||   977|   C_104|    groceries|   M_101|1579402924|  U_103||   996|   C_108|         food|   M_106|1581391534|  U_104|+------+--------+-------------+--------+----------+-------+

 

Popular posts from this blog

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

What is Garbage collection in Spark and its impact and resolution

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)