Difference between DataFrame and DataSet in Apache Spark

Difference between DataFrame and DataSet in Apache Spark

Dataframe
========
** Dataframe is nothing but a Dataset[Row]
** Dataframe is weakly-typed & provides runtime safety
** Dataframe API is available in Java, Scala, Python, and R
** Conversion from RDD to Dataframe and vice versa is possible

Dataset
=======
** Dataset is strongly-typed, provides compile-time safety.
** Dataset is available in Scala and Java only.
** Conversion from Dataframe to Dataset and vice versa is seamless.

Let’s understand this with a simple example..
consider you are trying to refer a column that does not exist,
Then in the case of Dataframes, we will get to know this at runtime.

However, in the case of Datasets, we get to know earlier at compile time.

So in summary,
Dataset is an extension of Dataframe,
thus we can consider a Dataframe as an untyped view of a Dataset.

Popular posts from this blog

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

What is Garbage collection in Spark and its impact and resolution

Window function in PySpark with Joins example using 2 Dataframes (inner join)