Spark RDD Operations with Examples

Resilient Distributed Datasets (RDDs)

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

RDD method types:
  • Transformation: filter, map, flatMap, reduceByKey
  • action : count, take, collect
  • persistance : cache

Transformation Examples

Filter:val numbers = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val numbersRdd=sc.parallelize(numbers, 1)
numbersRdd.filter(num => num%2==0).collect

Map function example – used for one to one transformation
val newMapRdd =numbersRdd.map(num=>num*2).collect

FlatMap:

val mapObj = Array(“Deepak B” , “Deepak K” , “Bhardwaj D”)
val mapRdd = sc.parallelize(mapObj)
mapRdd.flatMap(data => data.split(” “)).collect

val groupArray = Array(Tuple2(“Citi”, “101”) , Tuple2(“Citi”, “101”) , Tuple2(“Citi”, “101”) , Tuple2(“Citi2”, “10”))
groupbyrdd.groupByKey().collect

 

Note: All examples are written in Python language

Popular posts from this blog

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

What is Garbage collection in Spark and its impact and resolution

Window function in PySpark with Joins example using 2 Dataframes (inner join)