Internal working of Spark SQL
Spark SQL query goes through various phases.
Let’s understand these
1. Parsed Logical Plan – unresolved
query is parsed and It checks for any of the syntax errors.
if syntax is correct then it goes to step 2.
2. Resolved/Analyzed Logical plan
It will try to resolve the table name, column names etc.
It refers to the catalog to resolve these.
if the column name or table name is not available then we will get analysis exception.
In case if all is fine it goes to step 3.
3. Optimized Logical Plan
Resolved Logical plan goes through catalyst optimizer.
it’s a rule based engine.
The plan is optimized based on various rules.
some of the rules are filter push down, combining of filters, combining of projections.
There are many such rules which are already in place.
If we want we can add our own custom rules in the catalyst optimizer.
4. Generation of physical plan
Optimized logical plan is converted to multiple physical plans.
Out of these the one with lowest cost is selected.
5. Code generation
The selected physical plan is converted to Lower Level API RDD code.
This is then executed.