When will you choose RDD over SparkSQL and Dataframe

RDDs or Resilient Distributed Datasets is the fundamental data structure of the Spark. It is the collection of objects which is capable of storing the data partitioned across the multiple nodes of the cluster and also allows them to do processing in parallel.

It is fault-tolerant if you perform multiple transformations on the RDD and then due to any reason any node fails. The RDD, in that case, is capable of recovering automatically.

When to use RDDs?

We can use RDDs in the following situations-

  1. you want low-level transformation and actions and control on your dataset;
  2. your data is unstructured, such as media streams or streams of text;
  3. you want to manipulate your data with functional programming constructs than domain specific expressions;
  4. you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column; and
  5. you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

Popular posts from this blog

Window function in PySpark with Joins example using 2 Dataframes (inner join)

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)

Credit Card Data Analysis using PySpark (how to use auto broadcast join after disabling it)