Posts

Showing posts from June, 2021

Spark - RDD Transformation & Actions examples in java

Transformation: With transformation, we get a new RDD. There are many ways to achieve this, such as: Input in a Hadoop file system (such as HDFS, Hive and HBase) to create a RDD. Convert the parent RDD to get a new RDD. Create a distributed RDD using the data on a single machine through parallelize or makeRDD functions (Difference: A) The makeRDD function provides the location information of the data, which is not provided in the parallelize function. B) The returned values of both functions are ParallelCollectionRDD. But in parallelize function, you can specify the number of partitions, while in the makeRDD function, it is fixed as the size of the seq parameter). Create a RDD based on the DB (MySQL), NoSQL (HBase), S3(SC3) and data streams. Action: With action, we get a value or a result (directly caching the RDD to the memory) All the transformations adopt the lazy strategy. The computation won’t be triggered if you submit a transformation function only. Instead, the computation is t