RDD joins using PySpark examples

Full Outer Join:

Final result will return all records from both RDDs

 

rdd1 = sc.parallelize([(101,’John’),(102,’Matthew’),(103,’Aakash’),(104,’Sandeep’)
,(105,’Sakura’),(106,’Abed’),(107,’Mary’),(108,’Kamla’)],2)
rdd2 = sc.parallelize([(101,’USA’),(108,’UK’),(105,’Japan’),(102,’Sri Lanka’)
,(103,’India’),(104,’UAE’),(107,’Germany’),(110,’Australia’)],2)
#print(rdd1.glom().collect())
#print(rdd2.glom().collect())

fullOuterJoin = rdd1.fullOuterJoin(rdd2)
print(fullOuterJoin.collect())

Output:

[(104, ('Sandeep', 'UAE')), (108, ('Kamla', 'UK')), (101, ('John', 'USA')), (105, ('Sakura', 'Japan')), (102, ('Matthew', 'Sri Lanka')), (106, ('Abed', None)), (110, (None, 'Australia')), (103, ('Aakash', 'India')), (107, ('Mary', 'Germany'))]

 

Inner Join:

Final result will return matching records from both RDDs

rdd1 = sc.parallelize([(101,’John’),(102,’Matthew’),(103,’Aakash’),(104,’Sandeep’)
,(105,’Sakura’),(106,’Abed’),(107,’Mary’),(108,’Kamla’)],2)
rdd2 = sc.parallelize([(101,’USA’),(108,’UK’),(105,’Japan’),(102,’Sri Lanka’)
,(103,’India’),(104,’UAE’),(107,’Germany’),(110,’Australia’)],2)
#print(rdd1.glom().collect())
#print(rdd2.glom().collect())

innerJoin = rdd1.join(rdd2)
print(innerJoin.collect())

Output:

[(104, ('Sandeep', 'UAE')), (108, ('Kamla', 'UK')), (101, ('John', 'USA')), (105, ('Sakura', 'Japan')), (102, ('Matthew', 'Sri Lanka')), (103, ('Aakash', 'India')), (107, ('Mary', 'Germany'))]

Left Join:

Final result will return all records from Left RDD and matching records from Right RDD

rdd1 = sc.parallelize([(101,’John’),(102,’Matthew’),(103,’Aakash’),(104,’Sandeep’)
,(105,’Sakura’),(106,’Abed’),(107,’Mary’),(108,’Kamla’)],2)
rdd2 = sc.parallelize([(101,’USA’),(108,’UK’),(105,’Japan’),(102,’Sri Lanka’)
,(103,’India’),(104,’UAE’),(107,’Germany’),(110,’Australia’)],2)
#print(rdd1.glom().collect())
#print(rdd2.glom().collect())

leftOuterjoin = rdd1.leftOuterJoin(rdd2)
print(leftOuterjoin.collect())

Output:

[(104, ('Sandeep', 'UAE')), (108, ('Kamla', 'UK')), (101, ('John', 'USA')), (105, ('Sakura', 'Japan')), (102, ('Matthew', 'Sri Lanka')), (110, (None, 'Australia')), (103, ('Aakash', 'India')), (107, ('Mary', 'Germany'))]

Right Join:

Final result will return all records from Right RDD and matching records from Left RDD

rdd1 = sc.parallelize([(101,’John’),(102,’Matthew’),(103,’Aakash’),(104,’Sandeep’)
,(105,’Sakura’),(106,’Abed’),(107,’Mary’),(108,’Kamla’)],2)
rdd2 = sc.parallelize([(101,’USA’),(108,’UK’),(105,’Japan’),(102,’Sri Lanka’)
,(103,’India’),(104,’UAE’),(107,’Germany’),(110,’Australia’)],2)
#print(rdd1.glom().collect())
#print(rdd2.glom().collect())

rightOuterjoin = rdd1.rightOuterJoin(rdd2)
print(rightOuterjoin.collect())

Output:

[(104, ('Sandeep', 'UAE')), (108, ('Kamla', 'UK')), (101, ('John', 'USA')), (105, ('Sakura', 'Japan')), (102, ('Matthew', 'Sri Lanka')), (106, ('Abed', None)), (103, ('Aakash', 'India')), (107, ('Mary', 'Germany'))]

 

Popular posts from this blog

Window function in PySpark with Joins example using 2 Dataframes (inner join)

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)

Credit Card Data Analysis using PySpark (how to use auto broadcast join after disabling it)