Spark RDD Operations with Examples
Resilient Distributed Datasets (RDDs)
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
RDD method types:
- Transformation: filter, map, flatMap, reduceByKey
- action : count, take, collect
- persistance : cache
Transformation Examples
Filter:val numbers = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val numbersRdd=sc.parallelize(numbers, 1)
numbersRdd.filter(num => num%2==0).collect
Map function example – used for one to one transformation
val newMapRdd =numbersRdd.map(num=>num*2).collect
FlatMap:
val mapObj = Array(“Deepak B” , “Deepak K” , “Bhardwaj D”)
val mapRdd = sc.parallelize(mapObj)
mapRdd.flatMap(data => data.split(” “)).collect
val groupArray = Array(Tuple2(“Citi”, “101”) , Tuple2(“Citi”, “101”) , Tuple2(“Citi”, “101”) , Tuple2(“Citi2”, “10”))
groupbyrdd.groupByKey().collect
Note: All examples are written in Python language