Airline dataset processing with SparkSQL
Now let’s explore the Dataset using Spark SQL and DataFrame transformations. After we register the DataFrame as a SQL temporary view, we can use SQL functions on the SparkSession to run SQL queries, which will return the results as a DataFrame. We cache the DataFrame, since we will reuse it and because Spark can cache DataFrames or Tables in columnar format in memory, which can improve memory usage and performance.
// cache DataFrame in columnar format in memorydf.cache// create Table view of DataFrame for Spark SQLdf.createOrReplaceTempView("flights")// cache flights table in columnar format in memoryspark.catalog.cacheTable("flights")
Below, we display information for the top five longest departure delays with Spark SQL and with DataFrame transformations (where a delay is considered greater than 40 minutes):
// Spark SQLspark.sql("select carrier,origin, dest, depdelay,crsdephour, dist, dofW from flights where depdelay > 40 order by depdelay desc limit 5").show// same query using DataFrame transformationsdf.select($"carrier",$"origin",$"dest",$"depdelay", $"crsdephour").filter($"depdelay" > 40).orderBy(desc( "depdelay" )).show(5)---result:---+-------+------+----+--------+----------+|carrier|origin|dest|depdelay|crsdephour|+-------+------+----+--------+----------+| AA| SFO| ORD| 1440.0| 8|| DL| BOS| ATL| 1185.0| 17|| UA| DEN| EWR| 1138.0| 12|| DL| ORD| ATL| 1087.0| 19|| UA| MIA| EWR| 1072.0| 20|+-------+------+----+--------+----------+
Below, we display the average departure delay by carrier:
// DataFrame transformationsdf.groupBy("carrier").agg(avg("depdelay")).show---result:---+-------+------------------+|carrier| avg(depdelay)|+-------+------------------+| UA|17.477878450696764|| AA| 10.45768118831622|| DL|15.316061660865241|| WN|13.491000418585182|+-------+-------------------
Let’s explore this data for flight delays, when the departure delay is greater than 40 minutes. Below, we see that United Airlines and Delta have the highest count of flight delays for January and February 2017 (the training set).
// __Count of Departure Delays by Carrier (where delay=40 minutes)__df.filter($"depdelay" > 40).groupBy("carrier").count.orderBy(desc( "count")).show(5)---result:---+-------+-----+|carrier|count|+-------+-----+| UA| 2420|| DL| 1043|| AA| 757|| WN| 244|+-------+-----+
// Count of Departure Delays by Carrier (where delay=40 minutes)%sqlselect carrier, count(depdelay)from flights where depdelay > 40group by carrier
In the query below, we see that Monday (1), Tuesday (2), and Sunday (7) have the highest count of flight delays.
// Count of Departure Delays by Day of the Week%sqlselect dofW, count(depdelay)from flights where depdelay > 40group by dofW
In the query below, we see that the hours between 13:00-19:00 have the highest count of flight delays.
%sqlselect crsdephour, count(depdelay)from flights where depdelay > 40group by crsdephour order by crsdephour
In the query below, we see that the originating airports, Chicago and Atlanta, have the highest count of flight delays.
%sqlselect origin, count(depdelay)from flights where depdelay > 40group by originORDER BY count(depdelay) desc
In the query below, we see the count of departure delays by origin and destination. The routes ORD->SFO and DEN->SFO have the highest delays, maybe because of weather in January and February. Adding weather to this Dataset would give better results.
%sqlselect origin, dest, count(depdelay)from flights where depdelay > 40group by origin, destORDER BY count(depdelay) desc