Spark - RDD Transformation & Actions examples in java

Transformation:

With transformation, we get a new RDD. There are many ways to achieve this, such as:

  • Input in a Hadoop file system (such as HDFS, Hive and HBase) to create a RDD.
  • Convert the parent RDD to get a new RDD.
  • Create a distributed RDD using the data on a single machine through parallelize or makeRDD functions (Difference: A) The makeRDD function provides the location information of the data, which is not provided in the parallelize function. B) The returned values of both functions are ParallelCollectionRDD. But in parallelize function, you can specify the number of partitions, while
  • in the makeRDD function, it is fixed as the size of the seq parameter).
  • Create a RDD based on the DB (MySQL), NoSQL (HBase), S3(SC3) and data streams.

Action:

With action, we get a value or a result (directly caching the RDD to the memory)

All the transformations adopt the lazy strategy. The computation won’t be triggered if you submit a transformation function only. Instead, the computation is triggered only after you submit the action function.

Popular posts from this blog

Window function in PySpark with Joins example using 2 Dataframes (inner join)

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)

Credit Card Data Analysis using PySpark (how to use auto broadcast join after disabling it)