Airline dataset processing with SparkSQL
Now let’s explore the Dataset using Spark SQL and DataFrame transformations. After we register the DataFrame as a SQL temporary view, we can use SQL functions on the SparkSession to run SQL queries, which will return the results as a DataFrame. We cache the DataFrame, since we will reuse it and because Spark can cache DataFrames or Tables in columnar format in memory, which can improve memory usage and performance. // cache DataFrame in columnar format in memory df . cache // create Table view of DataFrame for Spark SQL df . createOrReplaceTempView ( "flights" ) // cache flights table in columnar format in memory spark . catalog . cacheTable ( "flights" ) Below, we display information for the top five longest departure delays with Spark SQL and with DataFrame transformations (where a delay is considered greater than 40 minutes): // Spark SQL spark . sql ( "select carrier,origin, dest, depdelay,crsdephour, dist, dofW from flights where depdelay > 40 order by ...