Posts

Showing posts from August, 2022

Analyse Weather input Dataset and apply Aggregation RDD functions

#Input weather dataset fileRdd = sc.sparkContext.textFile(‘weather.csv’) fileRdd.collect() Output: ['2016-05-09,234893,34', '2019-09-08,234896,3', '2019-11-19,234895,24', '2017-04-04,234900,43', '2013-12-04,234900,47', '2019-08-28,234894,5', '2013-11-29,234897,-5', '2018-08-19,234895,40', '2019-06-06,234890,9', '2017-02-09,234900,21', '2017-03-30,234893,36', '2019-01-01,234895,-9', '2010-12-23,234898,-1', '2015-09-03,234890,13', '2011-07-19,234898,25', '2014-09-27,234897,29', '2013-11-18,234891,-9', '2012-02-09,234893,10', '2014-07-03,234897,29', '2011-11-07,234895,38', '2014-02-14,234891,24', '2012-02-18,234893,5', '2010-01-31,234896,-8', '2015-08-19,234890,-7', '2017-03-26,234891,-1', '2011-04-23,234894,23', '2014-09-15,234898,-8', '2011-06-16,234890,33', '201

RDD joins using PySpark examples

Full Outer Join: Final result will return all records from both RDDs   rdd1 = sc.parallelize([(101,’John’),(102,’Matthew’),(103,’Aakash’),(104,’Sandeep’) ,(105,’Sakura’),(106,’Abed’),(107,’Mary’),(108,’Kamla’)],2) rdd2 = sc.parallelize([(101,’USA’),(108,’UK’),(105,’Japan’),(102,’Sri Lanka’) ,(103,’India’),(104,’UAE’),(107,’Germany’),(110,’Australia’)],2) #print(rdd1.glom().collect()) #print(rdd2.glom().collect()) fullOuterJoin = rdd1.fullOuterJoin(rdd2) print(fullOuterJoin.collect()) Output: [(104, ('Sandeep', 'UAE')), (108, ('Kamla', 'UK')), (101, ('John', 'USA')), (105, ('Sakura', 'Japan')), (102, ('Matthew', 'Sri Lanka')), (106, ('Abed', None)), (110, (None, 'Australia')), (103, ('Aakash', 'India')), (107, ('Mary', 'Germany'))]   Inner Join: Final result will return matching records from both RDDs rdd1 = sc.parallelize([(101,’John’),(102,’Matthew’),(103,’Aa