Rdd vs Dataframe vs DataSet

Initially, in 2011 in they came up with the concept of RDDs, then in 2013 with Dataframes and later in 2015 with the concept of Datasets. None of them has been depreciated, we can still use all of them. In this article, we will understand and see the difference between all three of them.

RDDs vs Dataframes vs Datasets

RDDsDataframesDatasets
Data RepresentationRDD is a distributed collection of data elements without any schema.It is also the distributed collection organized into the named columnsIt is an extension of Dataframes with more features like type-safety and object-oriented interface.
OptimizationNo in-built optimization engine for RDDs. Developers need to write the optimized code themselves.It uses a catalyst optimizer for optimization.It also uses a catalyst optimizer for optimization purposes.
Projection of SchemaHere, we need to define the schema manually.It will automatically find out the schema of the dataset.It will also automatically find out the schema of the dataset by using the SQL Engine.
Aggregation OperationRDD is slower than both Dataframes and Datasets to perform simple operations like grouping the data.It provides an easy API to perform aggregation operations. It performs aggregation faster than both RDDs and Datasets.Dataset is faster than RDDs but a bit slower than Dataframes.

 

End Notes

In this article, we have seen the difference between the three major APIs of Apache Spark. So to conclude, if you want rich semantics, high-level abstractions, type-safety then go for Dataframes or Datasets. If you need more control over the pre-processing part, you can always use the RDDs.

Popular posts from this blog

Window function in PySpark with Joins example using 2 Dataframes (inner join)

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)

Credit Card Data Analysis using PySpark (how to use auto broadcast join after disabling it)