what could be the reasons that reduce is given as an Action & reduceByKey is given as transformation by spark developers?
Because reduce is aggregating/combining all the elements, while reduceByKey defined on RDDs of pairs is aggregating/combining all the elements for a specific key thereby its output is a Map<Key, Value> and since it may still be processed with other transformations, and still being a potentially large distributed collection, why not letting it continue to be an RDD[Key,Value], it is optimal from a pipelining perspective. The reduce cannot result in an RDD simply because it is a single value as output.
The key differences between reduce() and reduceByKey() are
reduce() outputs a collection which does not add to the directed acyclic graph (DAG) so is implemented as an action. Because once the collection is returned, we know no longer refer to it as an RDD which is the basic dataset unit in spark. However,
reduce() outputs a collection which does not add to the directed acyclic graph (DAG) so is implemented as an action. Because once the collection is returned, we know no longer refer to it as an RDD which is the basic dataset unit in spark. However, reduceByKey() returns an RDD which is just another level/state in the DAG, therefore is a transformation.
reduce() is a function that operates on an RDD of objects while reduceByKey() is a function that operates on an RDD of key-value pairs. To put it more technically, reduce() function is a member of RDD[T] class while reduceByKey() is a member of the PairRDDFunctions[K, V] class.
11.5K viewsView 21 upvotes