Sampling Data with SparkSql Examples using explode functions in Map and Array
What is explode function
Spark SQL explode function is used to create or split an array or map DataFrame columns to rows. Spark defines several flavors of this function; explode_outer – to handle nulls and empty, posexplode – which explodes with a position of element and posexplode_outer – to handle nulls.
Difference between explode vs explode_outer
explode – creates a row for each element in the array or map column by ignoring null or empty values in array. whereas explode_outer returns all values in array or map including null or empty.
Difference between explode vs posexplode
explode – creates a row for each element in the array or map column. whereas posexplode creates a row for each element in the array and creates two columns ‘pos’ to hold the position of the array element and the ‘col’ to hold the actual array value. And, for the map, it creates 3 columns ‘pos’, ‘key’ and ‘value’
package com.dpq.spark.dataframe.functionsimport com.dpq.spark.dataframe.functions.ArraTypeExample.sparkimport org.apache.spark.sql.{Row, SparkSession}import org.apache.spark.sql.functions._import org.apache.spark.sql.types.{ArrayType, IntegerType, MapType, StringType, StructType}object ExplodeArrayAndMap{ def main(args:Array[String]) : Unit = { val spark = SparkSession.builder().appName("dpq.spark.example.com") .master("local[1]") .getOrCreate() // create DataFrame val arrayData = Seq( Row("deepak",List("Java","Scala","C++"),Map("hair"->"black","eye"->"brown")), Row("anant",List("Spark","Java","C++",null),Map("hair"->"brown","eye"->null)), Row("lokesh",List("CSharp","Python",""),Map("hair"->"red","eye"->"")), Row("pune",null,null), Row("Jeferson",List(),Map()) ) val arraySchema = new StructType() .add("name",StringType) .add("knownLanguages", ArrayType(StringType)) .add("properties", MapType(StringType,StringType)) val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData),arraySchema) df.printSchema() df.show() import spark.implicits._ // Below are Array examples //explode df.select($"name",explode($"knownLanguages")) .show() //explode_outer df.select($"name",explode_outer($"knownLanguages")) .show() //posexplode df.select($"name",posexplode($"knownLanguages")) .show() //posexplode_outer df.select($"name",posexplode_outer($"knownLanguages")) .show() // Below are Map examples //explode df.select($"name",explode($"properties")) .show() //explode_outer df.select($"name",explode_outer($"properties")) .show() //posexplode df.select($"name",posexplode($"properties")) .show() //posexplode_outer df.select($"name",posexplode_outer($"properties")) .show() }}