Spark SQL Built-in Standard Functions

Spark SQL String Functions

String functions are grouped as “ string_funcs” in spark SQL. Below is a list of functions defined under this group. Click on each link to learn with a Scala example.

Spark SQL Date and Time Functions:

Spark SQL Collection Functions:

mapCreates a new map column.
map_keysReturns an array containing the keys of the map.
map_valuesReturns an array containing the values of the map.
map_concatMerges maps specified in arguments.
map_from_entriesReturns a map from the given array of StructType entries.
map_entriesReturns an array of all StructType in the given map.
explode(e: Column)Creates a new row for every key-value pair in the map by ignoring null & empty. It creates two new columns one for key and one for value.
explode_outer(e: Column)Creates a new row for every key-value pair in the map including null & empty. It creates two new columns one for key and one for value.
posexplode(e: Column)Creates a new row for each key-value pair in a map by ignoring null & empty. It also creates 3 columns “pos” to hold the position of the map element, “key” and “value” columns for every row.
posexplode_outer(e: Column)Creates a new row for each key-value pair in a map including null & empty. It also creates 3 columns “pos” to hold the position of the map element, “key” and “value” columns for every row.
transform_keys(expr: Column, f: (Column, Column) => Column)Transforms map by applying functions to every key-value pair and returns a transformed map.
transform_values(expr: Column, f: (Column, Column) => Column)Transforms map by applying functions to every key-value pair and returns a transformed map.
map_zip_with( left: Column, right: Column, f: (Column, Column, Column) => Column)Merges two maps into a single map.
element_at(column: Column, value: Any)Returns a value of a key in a map.
size(e: Column)Returns length of a map column.

Spark SQL Math Functions

Below are a subset of Mathematical and Statistical functions

Aggregate Functions

AGGREGATE FUNCTION SYNTAXAGGREGATE FUNCTION DESCRIPTION
approx_count_distinct(e: Column)Returns the count of distinct items in a group.
approx_count_distinct(e: Column, rsd: Double)Returns the count of distinct items in a group.
avg(e: Column)Returns the average of values in the input column.
collect_list(e: Column)Returns all values from an input column with duplicates.
collect_set(e: Column)Returns all values from an input column with duplicate values .eliminated.
corr(column1: Column, column2: Column)Returns the Pearson Correlation Coefficient for two columns.
count(e: Column)Returns number of elements in a column.
countDistinct(expr: Column, exprs: Column*)Returns number of distinct elements in the columns.
covar_pop(column1: Column, column2: Column)Returns the population covariance for two columns.
covar_samp(column1: Column, column2: Column)Returns the sample covariance for two columns.
first(e: Column, ignoreNulls: Boolean)Returns the first element in a column when ignoreNulls is set to true, it returns first non null element.
first(e: Column): ColumnReturns the first element in a column.
grouping(e: Column)Indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
kurtosis(e: Column)Returns the kurtosis of the values in a group.
last(e: Column, ignoreNulls: Boolean)Returns the last element in a column. when ignoreNulls is set to true, it returns last non null element.
last(e: Column)Returns the last element in a column.
max(e: Column)Returns the maximum value in a column.
mean(e: Column)Alias for Avg. Returns the average of the values in a column.
min(e: Column)Returns the minimum value in a column.
skewness(e: Column)Returns the skewness of the values in a group.
stddev(e: Column)alias for `stddev_samp`.
stddev_samp(e: Column)Returns the sample standard deviation of values in a column.
stddev_pop(e: Column)Returns the population standard deviation of the values in a column.
sum(e: Column)Returns the sum of all values in a column.
sumDistinct(e: Column)Returns the sum of all distinct values in a column.
variance(e: Column)alias for `var_samp`.
var_samp(e: Column)Returns the unbiased variance of the values in a column.
var_pop(e: Column)returns the population variance of the values in a column.

Window Functions

WINDOW FUNCTION SYNTAXWINDOW FUNCTION DESCRIPTION
row_number(): ColumnReturns a sequential number starting from 1 within a window partition
rank(): ColumnReturns the rank of rows within a window partition, with gaps.
percent_rank(): ColumnReturns the percentile rank of rows within a window partition.
dense_rank(): ColumnReturns the rank of rows within a window partition without any gaps. Where as Rank() returns rank with gaps.
ntile(n: Int): ColumnReturns the ntile id in a window partition
cume_dist(): ColumnReturns the cumulative distribution of values within a window partition
lag(e: Column, offset: Int): Column
lag(columnName: String, offset: Int): Column
lag(columnName: String, offset: Int, defaultValue: Any): Column
returns the value that is `offset` rows before the current row, and `null` if there is less than `offset` rows before the current row.
lead(columnName: String, offset: Int): Column
lead(columnName: String, offset: Int): Column
lead(columnName: String, offset: Int, defaultValue: Any): Column
returns the value that is `offset` rows after the current row, and `null` if there is less than `offset` rows after the current row.

Sorting Functions

SORT FUNCTION SYNTAXSORT FUNCTION DESCRIPTION
asc(columnName: String): Columnasc function is used to specify the ascending order of the sorting column on DataFrame or DataSet
asc_nulls_first(columnName: String): ColumnSimilar to asc function but null values return first and then non-null values
asc_nulls_last(columnName: String): ColumnSimilar to asc function but non-null values return first and then null values
desc(columnName: String): Columndesc function is used to specify the descending order of the DataFrame or DataSet sorting column.
desc_nulls_first(columnName: String): ColumnSimilar to desc function but null values return first and then non-null values.
desc_nulls_last(columnName: String): ColumnSimilar to desc function but non-null values return first and then null values.

I will try to cover most of these functions with examples in upcoming posts

Popular posts from this blog

What is Garbage collection in Spark and its impact and resolution

Window function in PySpark with Joins example using 2 Dataframes (inner join)

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)