Credit Card Data Analysis using PySpark (Get the distinct list of categories in which the user has made expenditure) with execution plan
Input details:
#● File has json records
#● Each record has fields:
#○ user_id
#○ card_num
#○ merchant
#○ category
#○ amount
#○ ts
### Below analysis to be done
Sample data:
+------+--------+---------+--------+----------+-------+|amount|card_num| category|merchant| ts|user_id|+------+--------+---------+--------+----------+-------+| 243| C_108| food| M_102|1579532902| U_104|| 699| C_106|cosmetics| M_103|1581759040| U_103|| 228| C_104| children| M_110|1584161986| U_103|
Application: Get the distinct list of categories in which the user has made expenditure
Solution:
from pyspark.sql import functions as F
cardTnDF = spark.read.json(“card_transactions.json”)
#spark.read.json(“card_transactions.json”).show()
#cardTnDF.groupBy(‘user_id’)[‘category’].apply(list)
cardTnDF.groupby(‘user_id’).agg(F.collect_list(‘category’)).collect()
Output:
Row(user_id='U_102', collect_list(category)=['children', 'groceries', 'entertainment', 'children', 'groceries', 'children', 'entertainment', 'entertainment', 'food', 'cosmetics', 'food', 'food', 'children', 'groceries', 'children', 'cosmetics', 'entertainment', 'food', 'entertainment', 'entertainment', 'cosmetics', 'cosmetics', 'entertainment', 'cosmetics', 'groceries', 'entertainment', 'food', 'groceries', 'cosmetics', 'food', 'children', 'cosmetics', 'food', 'groceries', 'entertainment', 'groceries', 'entertainment', 'children', 'groceries', 'groceries', 'groceries', 'cosmetics', 'children', 'cosmetics', 'groceries', 'children', 'entertainment', 'groceries', 'entertainment', 'food', 'groceries', 'children', 'cosmetics', 'cosmetics', 'food', 'groceries', 'cosmetics', 'entertainment', 'cosmetics', 'cosmetics', 'cosmetics', 'cosmetics', 'cosmetics', 'food', 'groceries', 'entertainment', 'entertainment', 'food', 'food', 'groceries', 'groceries', 'entertainment', 'food', 'children', 'groceries', 'cosmetics', 'cosmetics', 'entertainment', 'cosmetics', 'groceries', 'cosmetics', 'cosmetics', 'entertainment', 'groceries', 'food', 'entertainment', 'groceries', 'children', 'cosmetics', 'groceries', 'groceries', 'children', 'children', 'cosmetics', 'children', 'cosmetics', 'cosmetics', 'children', 'children', 'groceries', 'children', 'children', 'children', 'children', 'groceries', 'cosmetics', 'children', 'cosmetics', 'groceries', 'entertainment', 'entertainment', 'cosmetics', 'entertainment', 'children', 'children', 'entertainment', 'food', 'entertainment', 'groceries', 'entertainment', 'cosmetics', 'children', 'groceries', 'cosmetics', 'cosmetics', 'children', 'groceries', 'groceries', 'entertainment', 'entertainment', 'entertainment', 'entertainment', 'cosmetics', 'children', 'children', 'children', 'entertainment', 'entertainment', 'entertainment', 'groceries', 'cosmetics', 'cosmetics', 'entertainment', 'food', 'cosmetics', 'children', 'children', 'entertainment', 'entertainment', 'entertainment', 'food', 'cosmetics', 'food', 'children', 'cosmetics', 'food', 'cosmetics', 'groceries', 'children', 'children', 'groceries', 'food', 'cosmetics', 'children', 'children', 'groceries', 'entertainment', 'groceries', 'entertainment', 'children', 'cosmetics', 'food', 'food', 'entertainment', 'food', 'entertainment', 'children', 'children', 'children', 'groceries', 'cosmetics', 'groceries', 'children', 'food', 'entertainment', 'children', 'cosmetics', 'children', 'cosmetics', 'cosmetics', 'entertainment', 'entertainment', 'entertainment', 'food', 'entertainment', 'groceries', 'entertainment', 'groceries', 'food', 'groceries', 'cosmetics', 'entertainment', 'food', 'cosmetics', 'children', 'food', 'entertainment', 'children', 'entertainment', 'entertainment', 'cosmetics', 'entertainment', 'children', 'children', 'food', 'entertainment', 'children', 'entertainment', 'entertainment', 'cosmetics', 'children', 'cosmetics', 'groceries', 'entertainment', 'children', 'entertainment', 'food', 'cosmetics', 'entertainment', 'cosmetics', 'children', 'entertainment', 'groceries', 'children', 'groceries', 'groceries', 'cosmetics', 'entertainment', 'cosmetics', 'groceries', 'entertainment', 'children', 'groceries', 'cosmetics', 'groceries', 'cosmetics', 'entertainment', 'food', 'entertainment', 'entertainment', 'cosmetics', 'cosmetics', 'food', 'entertainment', 'children']), Row(user_id='U_104', collect_list(category)=['food', 'groceries', 'groceries', 'entertainment', 'groceries', 'food', 'groceries', 'groceries', 'groceries', 'children', 'children', 'cosmetics', 'groceries', 'entertainment', 'cosmetics', 'children', 'food', 'entertainment', 'cosmetics', 'children', 'groceries', 'groceries', 'children', 'entertainment', 'children', 'groceries', 'children', 'cosmetics', 'cosmetics', 'cosmetics', 'children', 'children', 'food', 'food', 'food', 'entertainment', 'food', 'cosmetics', 'entertainment', 'groceries', 'entertainment', 'children', 'food', 'children', 'groceries', 'groceries', 'children', 'groceries', 'food', 'food', 'food', 'entertainment', 'entertainment', 'entertainment', 'children', 'entertainment', 'food', 'cosmetics', 'entertainment', 'children', 'cosmetics', 'cosmetics', 'cosmetics', 'children', 'cosmetics', 'entertainment', 'cosmetics', 'entertainment', 'food', 'entertainment', 'children', 'children', 'food', 'entertainment', 'food', 'groceries', 'food', 'cosmetics', 'groceries', 'entertainment', 'groceries', 'food', 'cosmetics', 'food', 'children', 'food', 'cosmetics', 'food', 'food', 'food', 'groceries', 'children', 'cosmetics', 'groceries', 'children', 'children', 'groceries', 'cosmetics', 'entertainment', 'entertainment', 'food', 'groceries', 'cosmetics', 'children', 'groceries', 'food', 'groceries', 'groceries', 'children', 'children', 'children', 'children', 'children', 'groceries', 'groceries', 'cosmetics', 'groceries', 'food', 'food', 'food', 'food', 'children', 'entertainment', 'cosmetics', 'children', 'cosmetics', 'food', 'entertainment', 'children', 'entertainment', 'cosmetics', 'entertainment', 'children', 'cosmetics', 'food', 'groceries', 'groceries', 'food', 'cosmetics', 'entertainment', 'cosmetics', 'cosmetics', 'cosmetics', 'children', 'cosmetics', 'entertainment', 'food', 'cosmetics', 'cosmetics', 'cosmetics', 'food', 'entertainment', 'entertainment', 'food', 'food', 'food', 'groceries', 'groceries', 'food', 'cosmetics', 'entertainment', 'groceries', 'groceries', 'entertainment', 'groceries', 'children', 'cosmetics', 'cosmetics', 'children', 'entertainment', 'groceries', 'cosmetics', 'cosmetics', 'cosmetics', 'children', 'cosmetics', 'cosmetics', 'groceries', 'cosmetics', 'children', 'entertainment', 'entertainment', 'food', 'children', 'groceries', 'cosmetics', 'groceries', 'entertainment', 'children', 'children', 'food', 'food', 'entertainment', 'cosmetics', 'children', 'groceries', 'entertainment', 'cosmetics', 'food', 'cosmetics', 'groceries', 'children', 'groceries', 'cosmetics', 'food', 'food', 'entertainment', 'groceries', 'cosmetics', 'food', 'children', 'children', 'groceries', 'children', 'cosmetics', 'entertainment', 'children', 'children', 'children', 'food', 'children', 'food', 'cosmetics', 'children', 'food', 'entertainment', 'groceries', 'groceries', 'children', 'groceries', 'groceries']), Row(user_id='U_101', collect_list(category)=['entertainment', 'food', 'cosmetics', 'entertainment', 'food', 'groceries', 'food', 'cosmetics', 'entertainment', 'entertainment', 'cosmetics', 'cosmetics', 'food', 'food', 'groceries', 'cosmetics', 'groceries', 'cosmetics', 'entertainment', 'children', 'children', 'entertainment', 'food', 'children', 'children', 'groceries', 'groceries', 'groceries', 'children', 'entertainment', 'children', 'food', 'entertainment', 'food', 'groceries', 'children', 'cosmetics', 'groceries', 'entertainment', 'food', 'children', 'entertainment', 'groceries', 'food', 'food', 'groceries', 'children', 'groceries', 'groceries', 'children', 'groceries', 'children', 'groceries', 'groceries', 'entertainment', 'children', 'children', 'children', 'entertainment', 'food', 'entertainment', 'food', 'food', 'cosmetics', 'children', 'cosmetics', 'food', 'groceries', 'entertainment', 'entertainment', 'children', 'children', 'entertainment', 'food', 'cosmetics', 'groceries', 'entertainment', 'cosmetics', 'entertainment', 'entertainment', 'groceries', 'entertainment', 'cosmetics', 'children', 'groceries', 'food', 'groceries', 'children', 'food', 'entertainment', 'cosmetics', 'entertainment', 'cosmetics', 'entertainment', 'groceries', 'cosmetics', 'entertainment', 'groceries', 'food', 'groceries', 'children', 'groceries', 'cosmetics', 'groceries', 'groceries', 'groceries', 'entertainment', 'children', 'groceries', 'cosmetics', 'entertainment', 'groceries', 'children', 'entertainment', 'cosmetics', 'food', 'groceries', 'children', 'children', 'groceries', 'entertainment', 'children', 'cosmetics', 'groceries', 'cosmetics', 'food', 'food', 'children', 'children', 'food', 'children', 'cosmetics', 'children', 'cosmetics', 'entertainment', 'cosmetics', 'groceries', 'entertainment', 'children', 'groceries', 'food', 'cosmetics', 'children', 'entertainment', 'cosmetics', 'cosmetics', 'cosmetics', 'entertainment', 'cosmetics', 'food', 'cosmetics', 'groceries', 'entertainment', 'entertainment', 'entertainment', 'food', 'food', 'cosmetics', 'cosmetics', 'entertainment', 'children', 'food', 'entertainment', 'children', 'groceries', 'entertainment', 'children', 'entertainment', 'cosmetics', 'entertainment', 'children', 'cosmetics', 'entertainment', 'children', 'entertainment', 'entertainment', 'children', 'food', 'entertainment', 'groceries', 'children', 'food', 'children', 'children', 'cosmetics', 'children', 'entertainment', 'entertainment', 'groceries', 'children', 'entertainment', 'groceries', 'children', 'entertainment', 'groceries', 'children', 'entertainment', 'entertainment', 'cosmetics', 'entertainment', 'entertainment', 'cosmetics', 'food', 'entertainment', 'groceries', 'food', 'groceries', 'children', 'entertainment', 'children', 'entertainment', 'entertainment', 'children', 'entertainment', 'cosmetics', 'groceries', 'groceries', 'food', 'food', 'children', 'groceries', 'groceries', 'food', 'groceries', 'children', 'food', 'entertainment', 'food', 'children', 'groceries', 'entertainment', 'food', 'cosmetics', 'cosmetics', 'food', 'food', 'cosmetics', 'children', 'cosmetics', 'cosmetics', 'food']), Row(user_id='U_103', collect_list(category)=['cosmetics', 'children', 'food', 'entertainment', 'groceries', 'cosmetics', 'entertainment', 'groceries', 'cosmetics', 'children', 'cosmetics', 'food', 'food', 'food', 'entertainment', 'children', 'food', 'entertainment', 'food', 'food', 'children', 'cosmetics', 'children', 'food', 'children', 'groceries', 'food', 'food', 'children', 'groceries', 'cosmetics', 'entertainment', 'children', 'children', 'entertainment', 'food', 'food', 'entertainment', 'cosmetics', 'cosmetics', 'children', 'groceries', 'cosmetics', 'groceries', 'food', 'cosmetics', 'groceries', 'cosmetics', 'cosmetics', 'cosmetics', 'groceries', 'groceries', 'children', 'food', 'cosmetics', 'groceries', 'entertainment', 'entertainment', 'cosmetics', 'groceries', 'food', 'groceries', 'groceries', 'cosmetics', 'cosmetics', 'children', 'food', 'food', 'food', 'groceries', 'entertainment', 'children', 'cosmetics', 'cosmetics', 'cosmetics', 'children', 'food', 'children', 'cosmetics', 'entertainment', 'food', 'entertainment', 'cosmetics', 'entertainment', 'groceries', 'entertainment', 'groceries', 'groceries', 'children', 'groceries', 'groceries', 'food', 'children', 'entertainment', 'groceries', 'groceries', 'cosmetics', 'groceries', 'cosmetics', 'entertainment', 'entertainment', 'children', 'food', 'cosmetics', 'entertainment', 'groceries', 'food', 'children', 'entertainment', 'groceries', 'children', 'entertainment', 'children', 'children', 'entertainment', 'food', 'groceries', 'entertainment', 'food', 'food', 'children', 'children', 'cosmetics', 'children', 'children', 'entertainment', 'groceries', 'entertainment', 'food', 'cosmetics', 'food', 'children', 'food', 'groceries', 'food', 'food', 'food', 'children', 'food', 'entertainment', 'children', 'groceries', 'groceries', 'groceries', 'food', 'children', 'groceries', 'food', 'cosmetics', 'children', 'groceries', 'groceries', 'food', 'cosmetics', 'groceries', 'food', 'cosmetics', 'food', 'food', 'cosmetics', 'groceries', 'entertainment', 'food', 'groceries', 'groceries', 'cosmetics', 'cosmetics', 'cosmetics', 'entertainment', 'cosmetics', 'children', 'children', 'food', 'food', 'groceries', 'food', 'food', 'groceries', 'groceries', 'food', 'children', 'food', 'entertainment', 'cosmetics', 'cosmetics', 'children', 'food', 'cosmetics', 'entertainment', 'entertainment', 'children', 'cosmetics', 'entertainment', 'groceries', 'groceries', 'cosmetics', 'children', 'groceries', 'cosmetics', 'children', 'groceries', 'entertainment', 'groceries', 'entertainment', 'entertainment', 'entertainment', 'food', 'cosmetics', 'food', 'children', 'food', 'entertainment', 'children', 'cosmetics', 'groceries', 'groceries', 'cosmetics', 'food', 'cosmetics', 'entertainment', 'children', 'cosmetics', 'groceries', 'cosmetics', 'cosmetics', 'children', 'children', 'food', 'entertainment', 'food', 'entertainment', 'entertainment', 'children', 'groceries', 'entertainment', 'entertainment', 'entertainment', 'children', 'food', 'children', 'children', 'groceries', 'children', 'entertainment', 'groceries', 'food', 'entertainment', 'food', 'cosmetics', 'children', 'cosmetics', 'cosmetics', 'food', 'entertainment', 'groceries', 'cosmetics', 'children', 'cosmetics', 'children', 'entertainment', 'children', 'cosmetics', 'groceries', 'cosmetics', 'entertainment', 'food', 'cosmetics', 'entertainment', 'entertainment', 'entertainment', 'food', 'food', 'cosmetics'])]
== Physical Plan ==AdaptiveSparkPlan (8)+- == Final Plan == ObjectHashAggregate (6) +- AQEShuffleRead (5) +- ShuffleQueryStage (4) +- Exchange (3) +- ObjectHashAggregate (2) +- Scan json (1)+- == Initial Plan == ObjectHashAggregate (7) +- Exchange (3) +- ObjectHashAggregate (2) +- Scan json (1)(1) Scan json Output [2]: [category#563, user_id#566]Batched: falseLocation: InMemoryFileIndex [file:/Users/dpq/Practice/card_transactions.json]ReadSchema: struct<category:string,user_id:string>(2) ObjectHashAggregateInput [2]: [category#563, user_id#566]Keys [1]: [user_id#566]Functions [1]: [partial_collect_list(category#563, 0, 0)]Aggregate Attributes [1]: [buf#583]Results [2]: [user_id#566, buf#584](3) ExchangeInput [2]: [user_id#566, buf#584]Arguments: hashpartitioning(user_id#566, 200), ENSURE_REQUIREMENTS, [id=#698](4) ShuffleQueryStageOutput [2]: [user_id#566, buf#584]Arguments: 0(5) AQEShuffleReadInput [2]: [user_id#566, buf#584]Arguments: coalesced(6) ObjectHashAggregateInput [2]: [user_id#566, buf#584]Keys [1]: [user_id#566]Functions [1]: [collect_list(category#563, 0, 0)]Aggregate Attributes [1]: [collect_list(category#563, 0, 0)#579]Results [2]: [user_id#566, collect_list(category#563, 0, 0)#579 AS collect_list(category)#580](7) ObjectHashAggregateInput [2]: [user_id#566, buf#584]Keys [1]: [user_id#566]Functions [1]: [collect_list(category#563, 0, 0)]Aggregate Attributes [1]: [collect_list(category#563, 0, 0)#579]Results [2]: [user_id#566, collect_list(category#563, 0, 0)#579 AS collect_list(category)#580](8) AdaptiveSparkPlanOutput [2]: [user_id#566, collect_list(category)#580]Arguments: isFinalPlan=true