Spark - RDD Transformation & Actions examples in java

- June 09, 2021

Transformation:

With transformation, we get a new RDD. There are many ways to achieve this, such as:

Input in a Hadoop file system (such as HDFS, Hive and HBase) to create a RDD.
Convert the parent RDD to get a new RDD.
Create a distributed RDD using the data on a single machine through parallelize or makeRDD functions (Difference: A) The makeRDD function provides the location information of the data, which is not provided in the parallelize function. B) The returned values of both functions are ParallelCollectionRDD. But in parallelize function, you can specify the number of partitions, while
in the makeRDD function, it is fixed as the size of the seq parameter).
Create a RDD based on the DB (MySQL), NoSQL (HBase), S3(SC3) and data streams.

Action:

With action, we get a value or a result (directly caching the RDD to the memory)

All the transformations adopt the lazy strategy. The computation won’t be triggered if you submit a transformation function only. Instead, the computation is triggered only after you submit the action function.

Tech Studio Online

Spark - RDD Transformation & Actions examples in java

Transformation:

Action:

Popular posts from this blog

What is Garbage collection in Spark and its impact and resolution

Window function in PySpark with Joins example using 2 Dataframes (inner join)

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)