RDD joins using PySpark examples
Full Outer Join:
Final result will return all records from both RDDs
rdd1 = sc.parallelize([(101,’John’),(102,’Matthew’),(103,’Aakash’),(104,’Sandeep’)
,(105,’Sakura’),(106,’Abed’),(107,’Mary’),(108,’Kamla’)],2)
rdd2 = sc.parallelize([(101,’USA’),(108,’UK’),(105,’Japan’),(102,’Sri Lanka’)
,(103,’India’),(104,’UAE’),(107,’Germany’),(110,’Australia’)],2)
#print(rdd1.glom().collect())
#print(rdd2.glom().collect())
fullOuterJoin = rdd1.fullOuterJoin(rdd2)
print(fullOuterJoin.collect())
Output:
[(104, ('Sandeep', 'UAE')), (108, ('Kamla', 'UK')), (101, ('John', 'USA')), (105, ('Sakura', 'Japan')), (102, ('Matthew', 'Sri Lanka')), (106, ('Abed', None)), (110, (None, 'Australia')), (103, ('Aakash', 'India')), (107, ('Mary', 'Germany'))]
Inner Join:
Final result will return matching records from both RDDs
rdd1 = sc.parallelize([(101,’John’),(102,’Matthew’),(103,’Aakash’),(104,’Sandeep’)
,(105,’Sakura’),(106,’Abed’),(107,’Mary’),(108,’Kamla’)],2)
rdd2 = sc.parallelize([(101,’USA’),(108,’UK’),(105,’Japan’),(102,’Sri Lanka’)
,(103,’India’),(104,’UAE’),(107,’Germany’),(110,’Australia’)],2)
#print(rdd1.glom().collect())
#print(rdd2.glom().collect())
innerJoin = rdd1.join(rdd2)
print(innerJoin.collect())
Output:
[(104, ('Sandeep', 'UAE')), (108, ('Kamla', 'UK')), (101, ('John', 'USA')), (105, ('Sakura', 'Japan')), (102, ('Matthew', 'Sri Lanka')), (103, ('Aakash', 'India')), (107, ('Mary', 'Germany'))]
Left Join:
Final result will return all records from Left RDD and matching records from Right RDD
rdd1 = sc.parallelize([(101,’John’),(102,’Matthew’),(103,’Aakash’),(104,’Sandeep’)
,(105,’Sakura’),(106,’Abed’),(107,’Mary’),(108,’Kamla’)],2)
rdd2 = sc.parallelize([(101,’USA’),(108,’UK’),(105,’Japan’),(102,’Sri Lanka’)
,(103,’India’),(104,’UAE’),(107,’Germany’),(110,’Australia’)],2)
#print(rdd1.glom().collect())
#print(rdd2.glom().collect())
leftOuterjoin = rdd1.leftOuterJoin(rdd2)
print(leftOuterjoin.collect())
Output:
[(104, ('Sandeep', 'UAE')), (108, ('Kamla', 'UK')), (101, ('John', 'USA')), (105, ('Sakura', 'Japan')), (102, ('Matthew', 'Sri Lanka')), (110, (None, 'Australia')), (103, ('Aakash', 'India')), (107, ('Mary', 'Germany'))]
Right Join:
Final result will return all records from Right RDD and matching records from Left RDD
rdd1 = sc.parallelize([(101,’John’),(102,’Matthew’),(103,’Aakash’),(104,’Sandeep’)
,(105,’Sakura’),(106,’Abed’),(107,’Mary’),(108,’Kamla’)],2)
rdd2 = sc.parallelize([(101,’USA’),(108,’UK’),(105,’Japan’),(102,’Sri Lanka’)
,(103,’India’),(104,’UAE’),(107,’Germany’),(110,’Australia’)],2)
#print(rdd1.glom().collect())
#print(rdd2.glom().collect())
rightOuterjoin = rdd1.rightOuterJoin(rdd2)
print(rightOuterjoin.collect())
Output:
[(104, ('Sandeep', 'UAE')), (108, ('Kamla', 'UK')), (101, ('John', 'USA')), (105, ('Sakura', 'Japan')), (102, ('Matthew', 'Sri Lanka')), (106, ('Abed', None)), (103, ('Aakash', 'India')), (107, ('Mary', 'Germany'))]