Internal working of Spark SQL

Spark SQL query goes through various phases.

Let’s understand these

1. Parsed Logical Plan – unresolved

query is parsed and It checks for any of the syntax errors.

if syntax is correct then it goes to step 2.

2. Resolved/Analyzed Logical plan

It will try to resolve the table name, column names etc.

It refers to the catalog to resolve these.

if the column name or table name is not available then we will get analysis exception.

In case if all is fine it goes to step 3.

3. Optimized Logical Plan

Resolved Logical plan goes through catalyst optimizer.

it’s a rule based engine.

The plan is optimized based on various rules.

some of the rules are filter push down, combining of filters, combining of projections.

There are many such rules which are already in place.

If we want we can add our own custom rules in the catalyst optimizer.

4. Generation of physical plan

Optimized logical plan is converted to multiple physical plans.

Out of these the one with lowest cost is selected.

5. Code generation

The selected physical plan is converted to Lower Level API RDD code.

This is then executed.

Popular posts from this blog

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

What is Garbage collection in Spark and its impact and resolution

Window function in PySpark with Joins example using 2 Dataframes (inner join)