How to join 2 data frames in PySpark(inner join example)

- October 14, 2022

Input details:

#● File has json records
#● Each record has fields:
#○ user_id
#○ card_num
#○ merchant
#○ category
#○ amount
#○ ts
### Below analysis to be done

Sample data:

+------+--------+---------+--------+----------+-------+|amount|card_num| category|merchant|        ts|user_id|+------+--------+---------+--------+----------+-------+|   243|   C_108|     food|   M_102|1579532902|  U_104||   699|   C_106|cosmetics|   M_103|1581759040|  U_103||   228|   C_104| children|   M_110|1584161986|  U_103|

Solution:

from pyspark.sql.functions import col

# this can be done without using window function
creditCardData = spark.read.json(“card_transactions.json”)

useridMaxSpendDF = df.groupby(‘user_id’).max(‘amount’)
useridMaxSpendDF=useridMaxSpendDF.withColumnRenamed(“max(amount)”,”max_amount”)
useridMaxSpendDF=useridMaxSpendDF.withColumnRenamed(“user_id”,”m_user_id”)
cond = [creditCardData.user_id == useridMaxSpendDF.m_user_id, creditCardData.amount == useridMaxSpendDF.max_amount]
joinedData = creditCardData.join(useridMaxSpendDF,cond,”inner”)
joinedData.select(col(“amount”),col(“category”),col(“user_id”)).distinct().show()

+------+-------------+-------+|amount|     category|user_id|+------+-------------+-------+|  1000|entertainment|  U_101||   997|    groceries|  U_102||   977|    groceries|  U_103||   977|         food|  U_103||   996|         food|  U_104||   996|     children|  U_104|+------+-------------+-------+

Tech Studio Online

How to join 2 data frames in PySpark(inner join example)

Popular posts from this blog

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

What is Garbage collection in Spark and its impact and resolution

Window function in PySpark with Joins example using 2 Dataframes (inner join)