RDD joins using PySpark examples

Full Outer Join:

Final result will return all records from both RDDs

 

rdd1 = sc.parallelize([(101,’John’),(102,’Matthew’),(103,’Aakash’),(104,’Sandeep’)
,(105,’Sakura’),(106,’Abed’),(107,’Mary’),(108,’Kamla’)],2)
rdd2 = sc.parallelize([(101,’USA’),(108,’UK’),(105,’Japan’),(102,’Sri Lanka’)
,(103,’India’),(104,’UAE’),(107,’Germany’),(110,’Australia’)],2)
#print(rdd1.glom().collect())
#print(rdd2.glom().collect())

fullOuterJoin = rdd1.fullOuterJoin(rdd2)
print(fullOuterJoin.collect())

Output:

[(104, ('Sandeep', 'UAE')), (108, ('Kamla', 'UK')), (101, ('John', 'USA')), (105, ('Sakura', 'Japan')), (102, ('Matthew', 'Sri Lanka')), (106, ('Abed', None)), (110, (None, 'Australia')), (103, ('Aakash', 'India')), (107, ('Mary', 'Germany'))]

 

Inner Join:

Final result will return matching records from both RDDs

rdd1 = sc.parallelize([(101,’John’),(102,’Matthew’),(103,’Aakash’),(104,’Sandeep’)
,(105,’Sakura’),(106,’Abed’),(107,’Mary’),(108,’Kamla’)],2)
rdd2 = sc.parallelize([(101,’USA’),(108,’UK’),(105,’Japan’),(102,’Sri Lanka’)
,(103,’India’),(104,’UAE’),(107,’Germany’),(110,’Australia’)],2)
#print(rdd1.glom().collect())
#print(rdd2.glom().collect())

innerJoin = rdd1.join(rdd2)
print(innerJoin.collect())

Output:

[(104, ('Sandeep', 'UAE')), (108, ('Kamla', 'UK')), (101, ('John', 'USA')), (105, ('Sakura', 'Japan')), (102, ('Matthew', 'Sri Lanka')), (103, ('Aakash', 'India')), (107, ('Mary', 'Germany'))]

Left Join:

Final result will return all records from Left RDD and matching records from Right RDD

rdd1 = sc.parallelize([(101,’John’),(102,’Matthew’),(103,’Aakash’),(104,’Sandeep’)
,(105,’Sakura’),(106,’Abed’),(107,’Mary’),(108,’Kamla’)],2)
rdd2 = sc.parallelize([(101,’USA’),(108,’UK’),(105,’Japan’),(102,’Sri Lanka’)
,(103,’India’),(104,’UAE’),(107,’Germany’),(110,’Australia’)],2)
#print(rdd1.glom().collect())
#print(rdd2.glom().collect())

leftOuterjoin = rdd1.leftOuterJoin(rdd2)
print(leftOuterjoin.collect())

Output:

[(104, ('Sandeep', 'UAE')), (108, ('Kamla', 'UK')), (101, ('John', 'USA')), (105, ('Sakura', 'Japan')), (102, ('Matthew', 'Sri Lanka')), (110, (None, 'Australia')), (103, ('Aakash', 'India')), (107, ('Mary', 'Germany'))]

Right Join:

Final result will return all records from Right RDD and matching records from Left RDD

rdd1 = sc.parallelize([(101,’John’),(102,’Matthew’),(103,’Aakash’),(104,’Sandeep’)
,(105,’Sakura’),(106,’Abed’),(107,’Mary’),(108,’Kamla’)],2)
rdd2 = sc.parallelize([(101,’USA’),(108,’UK’),(105,’Japan’),(102,’Sri Lanka’)
,(103,’India’),(104,’UAE’),(107,’Germany’),(110,’Australia’)],2)
#print(rdd1.glom().collect())
#print(rdd2.glom().collect())

rightOuterjoin = rdd1.rightOuterJoin(rdd2)
print(rightOuterjoin.collect())

Output:

[(104, ('Sandeep', 'UAE')), (108, ('Kamla', 'UK')), (101, ('John', 'USA')), (105, ('Sakura', 'Japan')), (102, ('Matthew', 'Sri Lanka')), (106, ('Abed', None)), (103, ('Aakash', 'India')), (107, ('Mary', 'Germany'))]

 

Popular posts from this blog

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

What is Garbage collection in Spark and its impact and resolution

Credit Card Data Analysis using PySpark (how to use auto broadcast join after disabling it)