In detail Accumulators explanation in Spark (Java and Scala)
Accumulators are
- Shared variables in Spark
- Only “added” through an associative and commutative operation
- Used to perform counters (Similar to Map-reduce counters) or sum operations
Spark by default supports creating accumulators of any numeric type and provide the capability to add custom accumulator types.
Following accumulators can be created by Developers
- named accumulators – can be seen on Spark web UI under the “Accumulator” tab, On this tab, you will see two tables; the first table “accumulable” – consists of all named accumulator variables and their values. And on the second table “Tasks” – value for each accumulator modified by a task
- unnamed accumulators – there are not shown to UI so for all practical purposes it is advisable to use name accumulators
Spark by default provides accumulator methods for long, double and collection types
For example, you can create long accumulator on spark-shell using
scala> val accum = sc.longAccumulator("SumAccumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(SumAccumulator), value: 0)
The above statement creates a named accumulator “SumAccumulator”. Now, Let’s see how to add up the elements from an array to this accumulator.
Each of these accumulator classes has several methods, among these, add()
method call from tasks running on the cluster. Tasks can’t read the values from the accumulator and only the driver program can read accumulators value using the value()
method.
Long Accumulator
longAccumulator() methods from SparkContext returns LongAccumulator
Syntax
You can create named accumulators for long type using SparkContext.longAccumulator(v)
and for unnamed use signature that doesn’t take an argument.
LongAccumulator class provides following methods
- isZero
- copy
- reset
- add
- count
- sum
- avg
- merge
- value
Double Accumulator
For named double type using SparkContext.doubleAccumulator(v)
and for unnamed use signature that doesn’t take an argument.
Syntax
DoubleAccumulator class also provides methods similar to LongAccumulator
Collection Accumulator
For named collection type using SparkContext.collectionAccumulator(v)
and for unnamed use signature that doesn’t take an argument.
Syntax
CollectionAccumulator class provides following methods
- isZero
- copyAndReset
- copy
- reset
- add
- merge
- value
Note: Each of these accumulator classes has several methods, among these, add()
method call from tasks running on the cluster. Tasks can’t read the values from the accumulator and only the driver program can read accumulators value using the value()
method.