Spark SQL Built-in Standard Functions

Spark SQL String Functions

String functions are grouped as “ string_funcs” in spark SQL. Below is a list of functions defined under this group. Click on each link to learn with a Scala example.

STRING FUNCTION SIGNATURE	STRING FUNCTION DESCRIPTION
ascii(e: Column): Column	Computes the numeric value of the first character of the string column, and returns the result as an int column.
base64(e: Column): Column	Computes the BASE64 encoding of a binary column and returns it as a string column.This is the reverse of unbase64.
concat_ws(sep: String, exprs: Column*): Column	Concatenates multiple input string columns together into a single string column, using the given separator.
decode(value: Column, charset: String): Column	Computes the first argument into a string from a binary using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
encode(value: Column, charset: String): Column	Computes the first argument into a binary from a string using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
format_number(x: Column, d: Int): Column	Formats numeric column x to a format like ‘#,###,###.##’, rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column.
format_string(format: String, arguments: Column*): Column	Formats the arguments in printf-style and returns the result as a string column.
initcap(e: Column): Column	Returns a new string column by converting the first letter of each word to uppercase. Words are delimited by whitespace. For example, “hello world” will become “Hello World”.
instr(str: Column, substring: String): Column	Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.
length(e: Column): Column	Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros.
lower(e: Column): Column	Converts a string column to lower case.
levenshtein ( l : Column , r : Column ) : Column	Computes the Levenshtein distance of the two given string columns.
locate(substr: String, str: Column): Column	Locate the position of the first occurrence of substr.
locate(substr: String, str: Column, pos: Int): Column	Locate the position of the first occurrence of substr in a string column, after position pos.
lpad(str: Column, len: Int, pad: String): Column	Left-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.
ltrim(e: Column): Column	Trim the spaces from left end for the specified string value.
regexp_extract(e: Column, exp: String, groupIdx: Int): Column	Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned.
regexp_replace(e: Column, pattern: String, replacement: String): Column	Replace all substrings of the specified string value that match regexp with rep.
regexp_replace(e: Column, pattern: Column, replacement: Column): Column	Replace all substrings of the specified string value that match regexp with rep.
unbase64(e: Column): Column	Decodes a BASE64 encoded string column and returns it as a binary column. This is the reverse of base64.
rpad(str: Column, len: Int, pad: String): Column	Right-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.
repeat(str: Column, n: Int): Column	Repeats a string column n times, and returns it as a new string column.
rtrim(e: Column): Column	Trim the spaces from right end for the specified string value.
rtrim(e: Column, trimString: String): Column	Trim the specified character string from right end for the specified string column.
soundex(e: Column): Column	Returns the soundex code for the specified expression
split(str: Column, regex: String): Column	Splits str around matches of the given regex.
split(str: Column, regex: String, limit: Int): Column	Splits str around matches of the given regex.
substring(str: Column, pos: Int, len: Int): Column	Substring starts at `pos` and is of length `len` when str is String type or returns the slice of byte array that starts at `pos` in byte and is of length `len` when str is Binary type
substring_index(str: Column, delim: String, count: Int): Column	Returns the substring from string str before count occurrences of the delimiter delim. * If count is positive, everything the left of the final delimiter (counting from left) is * returned. If count is negative, every to the right of the final delimiter (counting from the * right) is returned. substring_index performs a case-sensitive match when searching for delim.
overlay(src: Column, replaceString: String, pos: Int, len: Int): Column	Overlay the specified portion of `src` with `replaceString`, * starting from byte position `pos` of `inputString` and proceeding for `len` bytes.
overlay(src: Column, replaceString: String, pos: Int): Column	Overlay the specified portion of `src` with `replaceString`, * starting from byte position `pos` of `inputString`.
translate(src: Column, matchingString: String, replaceString: String): Column	Translate any character in the src by a character in replaceString. * The characters in replaceString correspond to the characters in matchingString. * The translate will happen when any character in the string matches the character * in the `matchingString`.
trim(e: Column): Column	Trim the spaces from both ends for the specified string column.
trim(e: Column, trimString: String): Column	Trim the specified character from both ends for the specified string column.
upper(e: Column): Column	Converts a string column to upper case.

Spark SQL Date and Time Functions:

DATE FUNCTIONS SYNTAX	DATE FUNCTION DESCRIPTION
current_date () : Column	Returns the current date as a date column.
date_format(dateExpr: Column, format: String): Column	Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
to_date(e: Column): Column	Converts the column into `DateType` by casting rules to `DateType`.
to_date(e: Column, fmt: String): Column	Converts the column into a `DateType` with a specified format
add_months(startDate: Column, numMonths: Int): Column	Returns the date that is `numMonths` after `startDate`.
date_add(start: Column, days: Int): Column date_sub(start: Column, days: Int): Column	Returns the date that is `days` days after `start`
datediff(end: Column, start: Column): Column	Returns the number of days from `start` to `end`.
months_between(end: Column, start: Column): Column	Returns number of months between dates `start` and `end`. A whole number is returned if both inputs have the same day of month or both are the last day of their respective months. Otherwise, the difference is calculated assuming 31 days per month.
months_between(end: Column, start: Column, roundOff: Boolean): Column	Returns number of months between dates `end` and `start`. If `roundOff` is set to true, the result is rounded off to 8 digits; it is not rounded otherwise.
next_day(date: Column, dayOfWeek: String): Column	Returns the first date which is later than the value of the `date` column that is on the specified day of the week. For example, `next_day(‘2015-07-27’, “Sunday”)` returns 2015-08-02 because that is the first Sunday after 2015-07-27.
trunc(date: Column, format: String): Column	Returns date truncated to the unit specified by the format. For example, `trunc(“2018-11-19 12:01:19”, “year”)` returns 2018-01-01 format: ‘year’, ‘yyyy’, ‘yy’ to truncate by year, ‘month’, ‘mon’, ‘mm’ to truncate by month
date_trunc(format: String, timestamp: Column): Column	Returns timestamp truncated to the unit specified by the format. For example, `date_trunc(“year”, “2018-11-19 12:01:19”)` returns 2018-01-01 00:00:00 format: ‘year’, ‘yyyy’, ‘yy’ to truncate by year, ‘month’, ‘mon’, ‘mm’ to truncate by month, ‘day’, ‘dd’ to truncate by day, Other options are: ‘second’, ‘minute’, ‘hour’, ‘week’, ‘month’, ‘quarter’
year(e: Column): Column	Extracts the year as an integer from a given date/timestamp/string
quarter(e: Column): Column	Extracts the quarter as an integer from a given date/timestamp/string.
month(e: Column): Column	Extracts the month as an integer from a given date/timestamp/string
dayofweek(e: Column): Column	Extracts the day of the week as an integer from a given date/timestamp/string. Ranges from 1 for a Sunday through to 7 for a Saturday
dayofmonth(e: Column): Column	Extracts the day of the month as an integer from a given date/timestamp/string.
dayofyear(e: Column): Column	Extracts the day of the year as an integer from a given date/timestamp/string.
weekofyear(e: Column): Column	Extracts the week number as an integer from a given date/timestamp/string. A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601
last_day(e: Column): Column	Returns the last day of the month which the given date belongs to. For example, input “2015-07-27” returns “2015-07-31” since July 31 is the last day of the month in July 2015.
from_unixtime(ut: Column): Column	Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format.
from_unixtime(ut: Column, f: String): Column	Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
unix_timestamp(): Column	Returns the current Unix timestamp (in seconds) as a long
unix_timestamp(s: Column): Column	Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale.
unix_timestamp(s: Column, p: String): Column	Converts time string with given pattern to Unix timestamp (in seconds).

TIMESTAMP FUNCTION SYNTAX	TIMESTAMP FUNCTION DESCRIPTION
current_timestamp () : Column	Returns the current timestamp as a timestamp column
hour(e: Column): Column	Extracts the hours as an integer from a given date/timestamp/string.
minute(e: Column): Column	Extracts the minutes as an integer from a given date/timestamp/string.
second(e: Column): Column	Extracts the seconds as an integer from a given date/timestamp/string.
to_timestamp(s: Column): Column	Converts to a timestamp by casting rules to `TimestampType`.
to_timestamp(s: Column, fmt: String): Column	Converts time string with the given pattern to timestamp.

DATE & TIME WINDOW FUNCTION SYNTAX	DATE & TIME WINDOW FUNCTION DESCRIPTION
window(timeColumn: Column, windowDuration: String, slideDuration: String, startTime: String): Column	Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported.
window(timeColumn: Column, windowDuration: String, slideDuration: String): Column	Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC
window(timeColumn: Column, windowDuration: String): Column	Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC.

Spark SQL Collection Functions:

ARRAY FUNCTION SYNTAX	ARRAY FUNCTION DESCRIPTION
array_contains(column: Column, value: Any)	Check if a value presents in an array column. Return below values. true – Returns if value presents in an array. false – When valu eno presents. null – when array is null.
array_distinct(e: Column)	Return distinct values from the array after removing duplicates.
array_except(col1: Column, col2: Column)	Returns all elements from col1 array but not in col2 array.
array_intersect(col1: Column, col2: Column)	Returns all elements that are present in col1 and col2 arrays.
array_join(column: Column, delimiter: String, nullReplacement: String) array_join(column: Column, delimiter: String)	Concatenates all elments of array column with using provided delimeter. When Null valeus are present, they replaced with ‘nullReplacement’ string
array_max(e: Column)	Return maximum values in an array
array_min(e: Column)	Return minimum values in an array
array_position(column: Column, value: Any)	Returns a position/index of first occurrence of the ‘value’ in the given array. Returns position as long type and the position is not zero based instead starts with 1. Returns zero when value is not found. Returns null when any of the arguments are null.
array_remove(column: Column, element: Any)	Returns an array after removing all provided ‘value’ from the given array.
array_repeat(e: Column, count: Int)	Creates an array containing the first argument repeated the number of times given by the second argument.
array_repeat(left: Column, right: Column)	Creates an array containing the first argument repeated the number of times given by the second argument.
array_sort(e: Column)	Returns the sorted array of the given input array. All null values are placed at the end of the array.
array_union(col1: Column, col2: Column)	Returns an array of elements that are present in both arrays (all elements from both arrays) with out duplicates.
arrays_overlap(a1: Column, a2: Column)	true – if `a1` and `a2` have at least one non-null element in common false – if `a1` and `a2` have completely different elements. null – if both the arrays are non-empty and any of them contains a `null`
arrays_zip(e: Column*)	Returns a merged array of structs in which the N-th struct contains all N-th values of input
concat(exprs: Column*)	Concatenates all elements from a given columns
element_at(column: Column, value: Any)	Returns an element of an array located at the ‘value’ input position.
exists(column: Column, f: Column => Column)	Checks if the column presents in an array column.
explode(e: Column)	Create a row for each element in the array column
explode_outer ( e : Column )	Create a row for each element in the array column. Unlike explode, if the array is null or empty, it returns null.
filter(column: Column, f: Column => Column) filter(column: Column, f: (Column, Column) => Column)	Returns an array of elements for which a predicate holds in a given array
flatten(e: Column)	Creates a single array from an array of arrays column.
forall(column: Column, f: Column => Column)	Returns whether a predicate holds for every element in the array.
posexplode(e: Column)	Creates a row for each element in the array and creaes a two columns “pos’ to hold the position of the array element and the ‘col’ to hold the actual array value.
posexplode_outer(e: Column)	Creates a row for each element in the array and creaes a two columns “pos’ to hold the position of the array element and the ‘col’ to hold the actual array value. Unlike posexplode, if the array is null or empty, it returns null,null for pos and col columns.
reverse(e: Column)	Returns the array of elements in a reverse order.
sequence(start: Column, stop: Column)	Generate the sequence of numbers from start to stop number.
sequence ( start : Column , stop : Column , step : Column )	Generate the sequence of numbers from start to stop number by incrementing with given step value.
shuffle(e: Column)	Shuffle the given array
size(e: Column)	Return the length of an array.
slice(x: Column, start: Int, length: Int)	Returns an array of elements from position ‘start’ and the given length.
sort_array(e: Column)	Sorts the array in an ascending order. Null values are placed at the beginning.
sort_array(e: Column, asc: Boolean)	Sorts the array in an ascending or descending order based of the boolean parameter. For assending, Null values are placed at the beginning. And for desending they are places at the end.
transform(column: Column, f: Column => Column) transform(column: Column, f: (Column, Column) => Column)	Returns an array of elments after applying transformation.
zip_with(left: Column, right: Column, f: (Column, Column) => Column)	Merges two input arrays.
aggregate( expr: Column, zero: Column, merge: (Column, Column) => Column, finish: Column => Column)	Aggregates

map	Creates a new map column.
map_keys	Returns an array containing the keys of the map.
map_values	Returns an array containing the values of the map.
map_concat	Merges maps specified in arguments.
map_from_entries	Returns a map from the given array of StructType entries.
map_entries	Returns an array of all StructType in the given map.
explode(e: Column)	Creates a new row for every key-value pair in the map by ignoring null & empty. It creates two new columns one for key and one for value.
explode_outer(e: Column)	Creates a new row for every key-value pair in the map including null & empty. It creates two new columns one for key and one for value.
posexplode(e: Column)	Creates a new row for each key-value pair in a map by ignoring null & empty. It also creates 3 columns “pos” to hold the position of the map element, “key” and “value” columns for every row.
posexplode_outer(e: Column)	Creates a new row for each key-value pair in a map including null & empty. It also creates 3 columns “pos” to hold the position of the map element, “key” and “value” columns for every row.
transform_keys(expr: Column, f: (Column, Column) => Column)	Transforms map by applying functions to every key-value pair and returns a transformed map.
transform_values(expr: Column, f: (Column, Column) => Column)	Transforms map by applying functions to every key-value pair and returns a transformed map.
map_zip_with( left: Column, right: Column, f: (Column, Column, Column) => Column)	Merges two maps into a single map.
element_at(column: Column, value: Any)	Returns a value of a key in a map.
size(e: Column)	Returns length of a map column.

Spark SQL Math Functions

Below are a subset of Mathematical and Statistical functions

MATH TRIGONOMETRIC FUNCTION SYNTAX	MATH TRIGONOMETRIC FUNCTION DESCRIPTION
asin(e: Column): Column asin(columnName: String): Column	Return arcsine or inverse sine of the input argument, same as java.lang.Math.asin() function.
acos(e: Column): Column acos(columnName: String): Column	Return arccosine or inverse cosine of input argument, same as java.lang.Math.acos() function.
atan(e: Column): Column atan(columnName: String): Column	Return arctangent or inverse tangent of input argument, same as java.lang.Math.atan() function.
sin ( e : Column ) : Column sin(columnName: String): Column	Return sine of the angle, same as java.lang.Math.sin() function.
sinh ( e : Column ) : Column sinh(columnName: String): Column	Return hyperbolic sine of the given value, same as java.lang.Math.sinh() function.
cos(e: Column): Column cos(columnName: String): Column	Return cosine of the angle, same as java.lang.Math.cos() function.
cosh(e: Column): Column cosh(columnName: String): Column	Return hyperbolic cosine of the angle, same as java.lang.Math.cosh() function.
tan(e: Column): Column tan(columnName: String): Column	Return tangent of the given value, same as java.lang.Math.tan() function.
tanh(e: Column): Column tanh(columnName: String): Column	Return hyperbolic tangent of the given value, same as java.lang.Math.tanh() function.

SPARK SQL LOG MATH FUNCTIONS SIGNATURE	SPARK FUNCTIONS DESCRIPTION
log(columnName: String): Column	Computes the natural logarithm of the given column.
log(base: Double, a: Column): Column log(base: Double, columnName: String): Column	Returns the first argument-base logarithm of the second argument.
log10(e: Column): Column log10(columnName: String): Column	Computes the logarithm of the given value in base 10.
log1p(e: Column): Column log1p(columnName: String): Column	Computes the natural logarithm of the given value plus one.
log2(expr: Column): Column log2(columnName: String): Column	Computes the logarithm of the given column in base 2.

Aggregate Functions

AGGREGATE FUNCTION SYNTAX	AGGREGATE FUNCTION DESCRIPTION
approx_count_distinct(e: Column)	Returns the count of distinct items in a group.
approx_count_distinct(e: Column, rsd: Double)	Returns the count of distinct items in a group.
avg(e: Column)	Returns the average of values in the input column.
collect_list(e: Column)	Returns all values from an input column with duplicates.
collect_set(e: Column)	Returns all values from an input column with duplicate values .eliminated.
corr(column1: Column, column2: Column)	Returns the Pearson Correlation Coefficient for two columns.
count(e: Column)	Returns number of elements in a column.
countDistinct(expr: Column, exprs: Column*)	Returns number of distinct elements in the columns.
covar_pop(column1: Column, column2: Column)	Returns the population covariance for two columns.
covar_samp(column1: Column, column2: Column)	Returns the sample covariance for two columns.
first(e: Column, ignoreNulls: Boolean)	Returns the first element in a column when ignoreNulls is set to true, it returns first non null element.
first(e: Column): Column	Returns the first element in a column.
grouping(e: Column)	Indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
kurtosis(e: Column)	Returns the kurtosis of the values in a group.
last(e: Column, ignoreNulls: Boolean)	Returns the last element in a column. when ignoreNulls is set to true, it returns last non null element.
last(e: Column)	Returns the last element in a column.
max(e: Column)	Returns the maximum value in a column.
mean(e: Column)	Alias for Avg. Returns the average of the values in a column.
min(e: Column)	Returns the minimum value in a column.
skewness(e: Column)	Returns the skewness of the values in a group.
stddev(e: Column)	alias for `stddev_samp`.
stddev_samp(e: Column)	Returns the sample standard deviation of values in a column.
stddev_pop(e: Column)	Returns the population standard deviation of the values in a column.
sum(e: Column)	Returns the sum of all values in a column.
sumDistinct(e: Column)	Returns the sum of all distinct values in a column.
variance(e: Column)	alias for `var_samp`.
var_samp(e: Column)	Returns the unbiased variance of the values in a column.
var_pop(e: Column)	returns the population variance of the values in a column.

Window Functions

WINDOW FUNCTION SYNTAX	WINDOW FUNCTION DESCRIPTION
row_number(): Column	Returns a sequential number starting from 1 within a window partition
rank(): Column	Returns the rank of rows within a window partition, with gaps.
percent_rank(): Column	Returns the percentile rank of rows within a window partition.
dense_rank(): Column	Returns the rank of rows within a window partition without any gaps. Where as Rank() returns rank with gaps.
ntile(n: Int): Column	Returns the ntile id in a window partition
cume_dist(): Column	Returns the cumulative distribution of values within a window partition
lag(e: Column, offset: Int): Column lag(columnName: String, offset: Int): Column lag(columnName: String, offset: Int, defaultValue: Any): Column	returns the value that is `offset` rows before the current row, and `null` if there is less than `offset` rows before the current row.
lead(columnName: String, offset: Int): Column lead(columnName: String, offset: Int): Column lead(columnName: String, offset: Int, defaultValue: Any): Column	returns the value that is `offset` rows after the current row, and `null` if there is less than `offset` rows after the current row.

Sorting Functions

SORT FUNCTION SYNTAX	SORT FUNCTION DESCRIPTION
asc(columnName: String): Column	asc function is used to specify the ascending order of the sorting column on DataFrame or DataSet
asc_nulls_first(columnName: String): Column	Similar to asc function but null values return first and then non-null values
asc_nulls_last(columnName: String): Column	Similar to asc function but non-null values return first and then null values
desc(columnName: String): Column	desc function is used to specify the descending order of the DataFrame or DataSet sorting column.
desc_nulls_first(columnName: String): Column	Similar to desc function but null values return first and then non-null values.
desc_nulls_last(columnName: String): Column	Similar to desc function but non-null values return first and then null values.

I will try to cover most of these functions with examples in upcoming posts

Tech Studio Online

Spark SQL Built-in Standard Functions

Spark SQL String Functions

Spark SQL Date and Time Functions:

Spark SQL Collection Functions:

Spark SQL Math Functions

Aggregate Functions

Window Functions

Sorting Functions

Popular posts from this blog

Window function in PySpark with Joins example using 2 Dataframes (inner join)

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)

Credit Card Data Analysis using PySpark (how to use auto broadcast join after disabling it)