When will you choose RDD over SparkSQL and Dataframe

- September 21, 2021

RDDs or Resilient Distributed Datasets is the fundamental data structure of the Spark. It is the collection of objects which is capable of storing the data partitioned across the multiple nodes of the cluster and also allows them to do processing in parallel.

It is fault-tolerant if you perform multiple transformations on the RDD and then due to any reason any node fails. The RDD, in that case, is capable of recovering automatically.

When to use RDDs?

We can use RDDs in the following situations-

you want low-level transformation and actions and control on your dataset;
your data is unstructured, such as media streams or streams of text;
you want to manipulate your data with functional programming constructs than domain specific expressions;
you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column; and
you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

Tech Studio Online

When will you choose RDD over SparkSQL and Dataframe

When to use RDDs?

Popular posts from this blog

What is Garbage collection in Spark and its impact and resolution

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

Window function in PySpark with Joins example using 2 Dataframes (inner join)