what could be the reasons that reduce is given as an Action & reduceByKey is given as transformation by spark developers?

- September 10, 2021

Because reduce is aggregating/combining all the elements, while reduceByKey defined on RDDs of pairs is aggregating/combining all the elements for a specific key thereby its output is a Map<Key, Value> and since it may still be processed with other transformations, and still being a potentially large distributed collection, why not letting it continue to be an RDD[Key,Value], it is optimal from a pipelining perspective. The reduce cannot result in an RDD simply because it is a single value as output.

The key differences between reduce() and reduceByKey() are

reduce() outputs a collection which does not add to the directed acyclic graph (DAG) so is implemented as an action. Because once the collection is returned, we know no longer refer to it as an RDD which is the basic dataset unit in spark. However, reduceByKey() returns an RDD which is just another level/state in the DAG, therefore is a transformation.
reduce() is a function that operates on an RDD of objects while reduceByKey() is a function that operates on an RDD of key-value pairs. To put it more technically, reduce() function is a member of RDD[T] class while reduceByKey() is a member of the PairRDDFunctions[K, V] class.
11.5K viewsView 21 upvotes

Tech Studio Online

what could be the reasons that reduce is given as an Action & reduceByKey is given as transformation by spark developers?

Popular posts from this blog

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

What is Garbage collection in Spark and its impact and resolution

Window function in PySpark with Joins example using 2 Dataframes (inner join)