Difference between DataFrame and DataSet in Apache Spark
Difference between DataFrame and DataSet in Apache Spark
Dataframe
========
** Dataframe is nothing but a Dataset[Row]
** Dataframe is weakly-typed & provides runtime safety
** Dataframe API is available in Java, Scala, Python, and R
** Conversion from RDD to Dataframe and vice versa is possible
Dataset
=======
** Dataset is strongly-typed, provides compile-time safety.
** Dataset is available in Scala and Java only.
** Conversion from Dataframe to Dataset and vice versa is seamless.
Let’s understand this with a simple example..
consider you are trying to refer a column that does not exist,
Then in the case of Dataframes, we will get to know this at runtime.
However, in the case of Datasets, we get to know earlier at compile time.
So in summary,
Dataset is an extension of Dataframe,
thus we can consider a Dataframe as an untyped view of a Dataset.