Credit Card Data Analysis using PySpark (Get the distinct list of categories in which the user has made expenditure) with execution plan

Input details:

#● File has json records
#● Each record has fields:
#○ user_id
#○ card_num
#○ merchant
#○ category
#○ amount
#○ ts
### Below analysis to be done

Sample data:

+------+--------+---------+--------+----------+-------+|amount|card_num| category|merchant|        ts|user_id|+------+--------+---------+--------+----------+-------+|   243|   C_108|     food|   M_102|1579532902|  U_104||   699|   C_106|cosmetics|   M_103|1581759040|  U_103||   228|   C_104| children|   M_110|1584161986|  U_103|

Application: Get the distinct list of categories in which the user has made expenditure

Solution:

from pyspark.sql import functions as F

cardTnDF = spark.read.json(“card_transactions.json”)
#spark.read.json(“card_transactions.json”).show()
#cardTnDF.groupBy(‘user_id’)[‘category’].apply(list)
cardTnDF.groupby(‘user_id’).agg(F.collect_list(‘category’)).collect()

Output:
Row(user_id='U_102', collect_list(category)=['children', 'groceries', 'entertainment', 'children', 'groceries', 'children', 'entertainment', 'entertainment', 'food', 'cosmetics', 'food', 'food', 'children', 'groceries', 'children', 'cosmetics', 'entertainment', 'food', 'entertainment', 'entertainment', 'cosmetics', 'cosmetics', 'entertainment', 'cosmetics', 'groceries', 'entertainment', 'food', 'groceries', 'cosmetics', 'food', 'children', 'cosmetics', 'food', 'groceries', 'entertainment', 'groceries', 'entertainment', 'children', 'groceries', 'groceries', 'groceries', 'cosmetics', 'children', 'cosmetics', 'groceries', 'children', 'entertainment', 'groceries', 'entertainment', 'food', 'groceries', 'children', 'cosmetics', 'cosmetics', 'food', 'groceries', 'cosmetics', 'entertainment', 'cosmetics', 'cosmetics', 'cosmetics', 'cosmetics', 'cosmetics', 'food', 'groceries', 'entertainment', 'entertainment', 'food', 'food', 'groceries', 'groceries', 'entertainment', 'food', 'children', 'groceries', 'cosmetics', 'cosmetics', 'entertainment', 'cosmetics', 'groceries', 'cosmetics', 'cosmetics', 'entertainment', 'groceries', 'food', 'entertainment', 'groceries', 'children', 'cosmetics', 'groceries', 'groceries', 'children', 'children', 'cosmetics', 'children', 'cosmetics', 'cosmetics', 'children', 'children', 'groceries', 'children', 'children', 'children', 'children', 'groceries', 'cosmetics', 'children', 'cosmetics', 'groceries', 'entertainment', 'entertainment', 'cosmetics', 'entertainment', 'children', 'children', 'entertainment', 'food', 'entertainment', 'groceries', 'entertainment', 'cosmetics', 'children', 'groceries', 'cosmetics', 'cosmetics', 'children', 'groceries', 'groceries', 'entertainment', 'entertainment', 'entertainment', 'entertainment', 'cosmetics', 'children', 'children', 'children', 'entertainment', 'entertainment', 'entertainment', 'groceries', 'cosmetics', 'cosmetics', 'entertainment', 'food', 'cosmetics', 'children', 'children', 'entertainment', 'entertainment', 'entertainment', 'food', 'cosmetics', 'food', 'children', 'cosmetics', 'food', 'cosmetics', 'groceries', 'children', 'children', 'groceries', 'food', 'cosmetics', 'children', 'children', 'groceries', 'entertainment', 'groceries', 'entertainment', 'children', 'cosmetics', 'food', 'food', 'entertainment', 'food', 'entertainment', 'children', 'children', 'children', 'groceries', 'cosmetics', 'groceries', 'children', 'food', 'entertainment', 'children', 'cosmetics', 'children', 'cosmetics', 'cosmetics', 'entertainment', 'entertainment', 'entertainment', 'food', 'entertainment', 'groceries', 'entertainment', 'groceries', 'food', 'groceries', 'cosmetics', 'entertainment', 'food', 'cosmetics', 'children', 'food', 'entertainment', 'children', 'entertainment', 'entertainment', 'cosmetics', 'entertainment', 'children', 'children', 'food', 'entertainment', 'children', 'entertainment', 'entertainment', 'cosmetics', 'children', 'cosmetics', 'groceries', 'entertainment', 'children', 'entertainment', 'food', 'cosmetics', 'entertainment', 'cosmetics', 'children', 'entertainment', 'groceries', 'children', 'groceries', 'groceries', 'cosmetics', 'entertainment', 'cosmetics', 'groceries', 'entertainment', 'children', 'groceries', 'cosmetics', 'groceries', 'cosmetics', 'entertainment', 'food', 'entertainment', 'entertainment', 'cosmetics', 'cosmetics', 'food', 'entertainment', 'children']), Row(user_id='U_104', collect_list(category)=['food', 'groceries', 'groceries', 'entertainment', 'groceries', 'food', 'groceries', 'groceries', 'groceries', 'children', 'children', 'cosmetics', 'groceries', 'entertainment', 'cosmetics', 'children', 'food', 'entertainment', 'cosmetics', 'children', 'groceries', 'groceries', 'children', 'entertainment', 'children', 'groceries', 'children', 'cosmetics', 'cosmetics', 'cosmetics', 'children', 'children', 'food', 'food', 'food', 'entertainment', 'food', 'cosmetics', 'entertainment', 'groceries', 'entertainment', 'children', 'food', 'children', 'groceries', 'groceries', 'children', 'groceries', 'food', 'food', 'food', 'entertainment', 'entertainment', 'entertainment', 'children', 'entertainment', 'food', 'cosmetics', 'entertainment', 'children', 'cosmetics', 'cosmetics', 'cosmetics', 'children', 'cosmetics', 'entertainment', 'cosmetics', 'entertainment', 'food', 'entertainment', 'children', 'children', 'food', 'entertainment', 'food', 'groceries', 'food', 'cosmetics', 'groceries', 'entertainment', 'groceries', 'food', 'cosmetics', 'food', 'children', 'food', 'cosmetics', 'food', 'food', 'food', 'groceries', 'children', 'cosmetics', 'groceries', 'children', 'children', 'groceries', 'cosmetics', 'entertainment', 'entertainment', 'food', 'groceries', 'cosmetics', 'children', 'groceries', 'food', 'groceries', 'groceries', 'children', 'children', 'children', 'children', 'children', 'groceries', 'groceries', 'cosmetics', 'groceries', 'food', 'food', 'food', 'food', 'children', 'entertainment', 'cosmetics', 'children', 'cosmetics', 'food', 'entertainment', 'children', 'entertainment', 'cosmetics', 'entertainment', 'children', 'cosmetics', 'food', 'groceries', 'groceries', 'food', 'cosmetics', 'entertainment', 'cosmetics', 'cosmetics', 'cosmetics', 'children', 'cosmetics', 'entertainment', 'food', 'cosmetics', 'cosmetics', 'cosmetics', 'food', 'entertainment', 'entertainment', 'food', 'food', 'food', 'groceries', 'groceries', 'food', 'cosmetics', 'entertainment', 'groceries', 'groceries', 'entertainment', 'groceries', 'children', 'cosmetics', 'cosmetics', 'children', 'entertainment', 'groceries', 'cosmetics', 'cosmetics', 'cosmetics', 'children', 'cosmetics', 'cosmetics', 'groceries', 'cosmetics', 'children', 'entertainment', 'entertainment', 'food', 'children', 'groceries', 'cosmetics', 'groceries', 'entertainment', 'children', 'children', 'food', 'food', 'entertainment', 'cosmetics', 'children', 'groceries', 'entertainment', 'cosmetics', 'food', 'cosmetics', 'groceries', 'children', 'groceries', 'cosmetics', 'food', 'food', 'entertainment', 'groceries', 'cosmetics', 'food', 'children', 'children', 'groceries', 'children', 'cosmetics', 'entertainment', 'children', 'children', 'children', 'food', 'children', 'food', 'cosmetics', 'children', 'food', 'entertainment', 'groceries', 'groceries', 'children', 'groceries', 'groceries']), Row(user_id='U_101', collect_list(category)=['entertainment', 'food', 'cosmetics', 'entertainment', 'food', 'groceries', 'food', 'cosmetics', 'entertainment', 'entertainment', 'cosmetics', 'cosmetics', 'food', 'food', 'groceries', 'cosmetics', 'groceries', 'cosmetics', 'entertainment', 'children', 'children', 'entertainment', 'food', 'children', 'children', 'groceries', 'groceries', 'groceries', 'children', 'entertainment', 'children', 'food', 'entertainment', 'food', 'groceries', 'children', 'cosmetics', 'groceries', 'entertainment', 'food', 'children', 'entertainment', 'groceries', 'food', 'food', 'groceries', 'children', 'groceries', 'groceries', 'children', 'groceries', 'children', 'groceries', 'groceries', 'entertainment', 'children', 'children', 'children', 'entertainment', 'food', 'entertainment', 'food', 'food', 'cosmetics', 'children', 'cosmetics', 'food', 'groceries', 'entertainment', 'entertainment', 'children', 'children', 'entertainment', 'food', 'cosmetics', 'groceries', 'entertainment', 'cosmetics', 'entertainment', 'entertainment', 'groceries', 'entertainment', 'cosmetics', 'children', 'groceries', 'food', 'groceries', 'children', 'food', 'entertainment', 'cosmetics', 'entertainment', 'cosmetics', 'entertainment', 'groceries', 'cosmetics', 'entertainment', 'groceries', 'food', 'groceries', 'children', 'groceries', 'cosmetics', 'groceries', 'groceries', 'groceries', 'entertainment', 'children', 'groceries', 'cosmetics', 'entertainment', 'groceries', 'children', 'entertainment', 'cosmetics', 'food', 'groceries', 'children', 'children', 'groceries', 'entertainment', 'children', 'cosmetics', 'groceries', 'cosmetics', 'food', 'food', 'children', 'children', 'food', 'children', 'cosmetics', 'children', 'cosmetics', 'entertainment', 'cosmetics', 'groceries', 'entertainment', 'children', 'groceries', 'food', 'cosmetics', 'children', 'entertainment', 'cosmetics', 'cosmetics', 'cosmetics', 'entertainment', 'cosmetics', 'food', 'cosmetics', 'groceries', 'entertainment', 'entertainment', 'entertainment', 'food', 'food', 'cosmetics', 'cosmetics', 'entertainment', 'children', 'food', 'entertainment', 'children', 'groceries', 'entertainment', 'children', 'entertainment', 'cosmetics', 'entertainment', 'children', 'cosmetics', 'entertainment', 'children', 'entertainment', 'entertainment', 'children', 'food', 'entertainment', 'groceries', 'children', 'food', 'children', 'children', 'cosmetics', 'children', 'entertainment', 'entertainment', 'groceries', 'children', 'entertainment', 'groceries', 'children', 'entertainment', 'groceries', 'children', 'entertainment', 'entertainment', 'cosmetics', 'entertainment', 'entertainment', 'cosmetics', 'food', 'entertainment', 'groceries', 'food', 'groceries', 'children', 'entertainment', 'children', 'entertainment', 'entertainment', 'children', 'entertainment', 'cosmetics', 'groceries', 'groceries', 'food', 'food', 'children', 'groceries', 'groceries', 'food', 'groceries', 'children', 'food', 'entertainment', 'food', 'children', 'groceries', 'entertainment', 'food', 'cosmetics', 'cosmetics', 'food', 'food', 'cosmetics', 'children', 'cosmetics', 'cosmetics', 'food']), Row(user_id='U_103', collect_list(category)=['cosmetics', 'children', 'food', 'entertainment', 'groceries', 'cosmetics', 'entertainment', 'groceries', 'cosmetics', 'children', 'cosmetics', 'food', 'food', 'food', 'entertainment', 'children', 'food', 'entertainment', 'food', 'food', 'children', 'cosmetics', 'children', 'food', 'children', 'groceries', 'food', 'food', 'children', 'groceries', 'cosmetics', 'entertainment', 'children', 'children', 'entertainment', 'food', 'food', 'entertainment', 'cosmetics', 'cosmetics', 'children', 'groceries', 'cosmetics', 'groceries', 'food', 'cosmetics', 'groceries', 'cosmetics', 'cosmetics', 'cosmetics', 'groceries', 'groceries', 'children', 'food', 'cosmetics', 'groceries', 'entertainment', 'entertainment', 'cosmetics', 'groceries', 'food', 'groceries', 'groceries', 'cosmetics', 'cosmetics', 'children', 'food', 'food', 'food', 'groceries', 'entertainment', 'children', 'cosmetics', 'cosmetics', 'cosmetics', 'children', 'food', 'children', 'cosmetics', 'entertainment', 'food', 'entertainment', 'cosmetics', 'entertainment', 'groceries', 'entertainment', 'groceries', 'groceries', 'children', 'groceries', 'groceries', 'food', 'children', 'entertainment', 'groceries', 'groceries', 'cosmetics', 'groceries', 'cosmetics', 'entertainment', 'entertainment', 'children', 'food', 'cosmetics', 'entertainment', 'groceries', 'food', 'children', 'entertainment', 'groceries', 'children', 'entertainment', 'children', 'children', 'entertainment', 'food', 'groceries', 'entertainment', 'food', 'food', 'children', 'children', 'cosmetics', 'children', 'children', 'entertainment', 'groceries', 'entertainment', 'food', 'cosmetics', 'food', 'children', 'food', 'groceries', 'food', 'food', 'food', 'children', 'food', 'entertainment', 'children', 'groceries', 'groceries', 'groceries', 'food', 'children', 'groceries', 'food', 'cosmetics', 'children', 'groceries', 'groceries', 'food', 'cosmetics', 'groceries', 'food', 'cosmetics', 'food', 'food', 'cosmetics', 'groceries', 'entertainment', 'food', 'groceries', 'groceries', 'cosmetics', 'cosmetics', 'cosmetics', 'entertainment', 'cosmetics', 'children', 'children', 'food', 'food', 'groceries', 'food', 'food', 'groceries', 'groceries', 'food', 'children', 'food', 'entertainment', 'cosmetics', 'cosmetics', 'children', 'food', 'cosmetics', 'entertainment', 'entertainment', 'children', 'cosmetics', 'entertainment', 'groceries', 'groceries', 'cosmetics', 'children', 'groceries', 'cosmetics', 'children', 'groceries', 'entertainment', 'groceries', 'entertainment', 'entertainment', 'entertainment', 'food', 'cosmetics', 'food', 'children', 'food', 'entertainment', 'children', 'cosmetics', 'groceries', 'groceries', 'cosmetics', 'food', 'cosmetics', 'entertainment', 'children', 'cosmetics', 'groceries', 'cosmetics', 'cosmetics', 'children', 'children', 'food', 'entertainment', 'food', 'entertainment', 'entertainment', 'children', 'groceries', 'entertainment', 'entertainment', 'entertainment', 'children', 'food', 'children', 'children', 'groceries', 'children', 'entertainment', 'groceries', 'food', 'entertainment', 'food', 'cosmetics', 'children', 'cosmetics', 'cosmetics', 'food', 'entertainment', 'groceries', 'cosmetics', 'children', 'cosmetics', 'children', 'entertainment', 'children', 'cosmetics', 'groceries', 'cosmetics', 'entertainment', 'food', 'cosmetics', 'entertainment', 'entertainment', 'entertainment', 'food', 'food', 'cosmetics'])]
== Physical Plan ==AdaptiveSparkPlan (8)+- == Final Plan ==   ObjectHashAggregate (6)   +- AQEShuffleRead (5)      +- ShuffleQueryStage (4)         +- Exchange (3)            +- ObjectHashAggregate (2)               +- Scan json  (1)+- == Initial Plan ==   ObjectHashAggregate (7)   +- Exchange (3)      +- ObjectHashAggregate (2)         +- Scan json  (1)(1) Scan json Output [2]: [category#563, user_id#566]Batched: falseLocation: InMemoryFileIndex [file:/Users/dpq/Practice/card_transactions.json]ReadSchema: struct<category:string,user_id:string>(2) ObjectHashAggregateInput [2]: [category#563, user_id#566]Keys [1]: [user_id#566]Functions [1]: [partial_collect_list(category#563, 0, 0)]Aggregate Attributes [1]: [buf#583]Results [2]: [user_id#566, buf#584](3) ExchangeInput [2]: [user_id#566, buf#584]Arguments: hashpartitioning(user_id#566, 200), ENSURE_REQUIREMENTS, [id=#698](4) ShuffleQueryStageOutput [2]: [user_id#566, buf#584]Arguments: 0(5) AQEShuffleReadInput [2]: [user_id#566, buf#584]Arguments: coalesced(6) ObjectHashAggregateInput [2]: [user_id#566, buf#584]Keys [1]: [user_id#566]Functions [1]: [collect_list(category#563, 0, 0)]Aggregate Attributes [1]: [collect_list(category#563, 0, 0)#579]Results [2]: [user_id#566, collect_list(category#563, 0, 0)#579 AS collect_list(category)#580](7) ObjectHashAggregateInput [2]: [user_id#566, buf#584]Keys [1]: [user_id#566]Functions [1]: [collect_list(category#563, 0, 0)]Aggregate Attributes [1]: [collect_list(category#563, 0, 0)#579]Results [2]: [user_id#566, collect_list(category#563, 0, 0)#579 AS collect_list(category)#580](8) AdaptiveSparkPlanOutput [2]: [user_id#566, collect_list(category)#580]Arguments: isFinalPlan=true

Popular posts from this blog

In detail Accumulators explanation in Spark (Java and Scala)

Hive joins in details with examples

RDD joins using PySpark examples