Sampling Data with SparkSql Examples using explode functions in Map and Array

What is explode function

Spark SQL explode function is used to create or split an array or map DataFrame columns to rows. Spark defines several flavors of this function; explode_outer – to handle nulls and empty, posexplode – which explodes with a position of element and posexplode_outer – to handle nulls.

Difference between explode vs explode_outer

explode – creates a row for each element in the array or map column by ignoring null or empty values in array. whereas explode_outer returns all values in array or map including null or empty.

Difference between explode vs posexplode

explode – creates a row for each element in the array or map column. whereas posexplode creates a row for each element in the array and creates two columns ‘pos’ to hold the position of the array element and the ‘col’ to hold the actual array value. And, for the map, it creates 3 columns ‘pos’, ‘key’ and ‘value’

package com.dpq.spark.dataframe.functionsimport com.dpq.spark.dataframe.functions.ArraTypeExample.sparkimport org.apache.spark.sql.{Row, SparkSession}import org.apache.spark.sql.functions._import org.apache.spark.sql.types.{ArrayType, IntegerType, MapType, StringType, StructType}object ExplodeArrayAndMap{  def main(args:Array[String]) : Unit = {    val spark = SparkSession.builder().appName("dpq.spark.example.com")      .master("local[1]")      .getOrCreate()     // create DataFrame    val arrayData = Seq(      Row("deepak",List("Java","Scala","C++"),Map("hair"->"black","eye"->"brown")),      Row("anant",List("Spark","Java","C++",null),Map("hair"->"brown","eye"->null)),      Row("lokesh",List("CSharp","Python",""),Map("hair"->"red","eye"->"")),      Row("pune",null,null),      Row("Jeferson",List(),Map())    )    val arraySchema = new StructType()      .add("name",StringType)      .add("knownLanguages", ArrayType(StringType))      .add("properties", MapType(StringType,StringType))    val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData),arraySchema)    df.printSchema()    df.show()    import spark.implicits._    // Below are Array examples    //explode    df.select($"name",explode($"knownLanguages"))      .show()    //explode_outer    df.select($"name",explode_outer($"knownLanguages"))      .show()    //posexplode    df.select($"name",posexplode($"knownLanguages"))      .show()    //posexplode_outer    df.select($"name",posexplode_outer($"knownLanguages"))      .show()    // Below are Map examples    //explode    df.select($"name",explode($"properties"))      .show()    //explode_outer    df.select($"name",explode_outer($"properties"))      .show()    //posexplode    df.select($"name",posexplode($"properties"))      .show()    //posexplode_outer    df.select($"name",posexplode_outer($"properties"))      .show()  }}

Popular posts from this blog

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

What is Garbage collection in Spark and its impact and resolution

Window function in PySpark with Joins example using 2 Dataframes (inner join)