Spark SQL Built-in Standard Functions

Spark SQL String Functions

String functions are grouped as “ string_funcs” in spark SQL. Below is a list of functions defined under this group. Click on each link to learn with a Scala example.

Spark SQL Date and Time Functions:

Spark SQL Collection Functions:

mapCreates a new map column.
map_keysReturns an array containing the keys of the map.
map_valuesReturns an array containing the values of the map.
map_concatMerges maps specified in arguments.
map_from_entriesReturns a map from the given array of StructType entries.
map_entriesReturns an array of all StructType in the given map.
explode(e: Column)Creates a new row for every key-value pair in the map by ignoring null & empty. It creates two new columns one for key and one for value.
explode_outer(e: Column)Creates a new row for every key-value pair in the map including null & empty. It creates two new columns one for key and one for value.
posexplode(e: Column)Creates a new row for each key-value pair in a map by ignoring null & empty. It also creates 3 columns “pos” to hold the position of the map element, “key” and “value” columns for every row.
posexplode_outer(e: Column)Creates a new row for each key-value pair in a map including null & empty. It also creates 3 columns “pos” to hold the position of the map element, “key” and “value” columns for every row.
transform_keys(expr: Column, f: (Column, Column) => Column)Transforms map by applying functions to every key-value pair and returns a transformed map.
transform_values(expr: Column, f: (Column, Column) => Column)Transforms map by applying functions to every key-value pair and returns a transformed map.
map_zip_with( left: Column, right: Column, f: (Column, Column, Column) => Column)Merges two maps into a single map.
element_at(column: Column, value: Any)Returns a value of a key in a map.
size(e: Column)Returns length of a map column.

Spark SQL Math Functions

Below are a subset of Mathematical and Statistical functions

Aggregate Functions

AGGREGATE FUNCTION SYNTAXAGGREGATE FUNCTION DESCRIPTION
approx_count_distinct(e: Column)Returns the count of distinct items in a group.
approx_count_distinct(e: Column, rsd: Double)Returns the count of distinct items in a group.
avg(e: Column)Returns the average of values in the input column.
collect_list(e: Column)Returns all values from an input column with duplicates.
collect_set(e: Column)Returns all values from an input column with duplicate values .eliminated.
corr(column1: Column, column2: Column)Returns the Pearson Correlation Coefficient for two columns.
count(e: Column)Returns number of elements in a column.
countDistinct(expr: Column, exprs: Column*)Returns number of distinct elements in the columns.
covar_pop(column1: Column, column2: Column)Returns the population covariance for two columns.
covar_samp(column1: Column, column2: Column)Returns the sample covariance for two columns.
first(e: Column, ignoreNulls: Boolean)Returns the first element in a column when ignoreNulls is set to true, it returns first non null element.
first(e: Column): ColumnReturns the first element in a column.
grouping(e: Column)Indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
kurtosis(e: Column)Returns the kurtosis of the values in a group.
last(e: Column, ignoreNulls: Boolean)Returns the last element in a column. when ignoreNulls is set to true, it returns last non null element.
last(e: Column)Returns the last element in a column.
max(e: Column)Returns the maximum value in a column.
mean(e: Column)Alias for Avg. Returns the average of the values in a column.
min(e: Column)Returns the minimum value in a column.
skewness(e: Column)Returns the skewness of the values in a group.
stddev(e: Column)alias for `stddev_samp`.
stddev_samp(e: Column)Returns the sample standard deviation of values in a column.
stddev_pop(e: Column)Returns the population standard deviation of the values in a column.
sum(e: Column)Returns the sum of all values in a column.
sumDistinct(e: Column)Returns the sum of all distinct values in a column.
variance(e: Column)alias for `var_samp`.
var_samp(e: Column)Returns the unbiased variance of the values in a column.
var_pop(e: Column)returns the population variance of the values in a column.

Window Functions

WINDOW FUNCTION SYNTAXWINDOW FUNCTION DESCRIPTION
row_number(): ColumnReturns a sequential number starting from 1 within a window partition
rank(): ColumnReturns the rank of rows within a window partition, with gaps.
percent_rank(): ColumnReturns the percentile rank of rows within a window partition.
dense_rank(): ColumnReturns the rank of rows within a window partition without any gaps. Where as Rank() returns rank with gaps.
ntile(n: Int): ColumnReturns the ntile id in a window partition
cume_dist(): ColumnReturns the cumulative distribution of values within a window partition
lag(e: Column, offset: Int): Column
lag(columnName: String, offset: Int): Column
lag(columnName: String, offset: Int, defaultValue: Any): Column
returns the value that is `offset` rows before the current row, and `null` if there is less than `offset` rows before the current row.
lead(columnName: String, offset: Int): Column
lead(columnName: String, offset: Int): Column
lead(columnName: String, offset: Int, defaultValue: Any): Column
returns the value that is `offset` rows after the current row, and `null` if there is less than `offset` rows after the current row.

Sorting Functions

SORT FUNCTION SYNTAXSORT FUNCTION DESCRIPTION
asc(columnName: String): Columnasc function is used to specify the ascending order of the sorting column on DataFrame or DataSet
asc_nulls_first(columnName: String): ColumnSimilar to asc function but null values return first and then non-null values
asc_nulls_last(columnName: String): ColumnSimilar to asc function but non-null values return first and then null values
desc(columnName: String): Columndesc function is used to specify the descending order of the DataFrame or DataSet sorting column.
desc_nulls_first(columnName: String): ColumnSimilar to desc function but null values return first and then non-null values.
desc_nulls_last(columnName: String): ColumnSimilar to desc function but non-null values return first and then null values.

I will try to cover most of these functions with examples in upcoming posts

Popular posts from this blog

Window function in PySpark with Joins example using 2 Dataframes (inner join)

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)

Credit Card Data Analysis using PySpark (how to use auto broadcast join after disabling it)