How does Spark decide stages and tasks during execution of a Job?

Let’s see this with an example. Here is our series of instructions in our Spark code. Let’s see how Spark decide on stages and tasks with the below set of instructions.

  1. READ dataset_X
  2. FILTER on dataset_X
  3. MAP operation on dataset_X
  4. READ dataset_Y
  5. MAP operation on dataset_Y
  6. JOIN dataset_X and dataset_Y
  7. FILTER on joined dataset
  8. SAVE the output

Stages

Spark will create a stage for each dataset

All consecutive narrow transformations (for eg. FILTER, MAP etc.) will be grouped together inside the stage

Spark will create a stage when it encounter a wide transformation (for eg. JOIN, reduceByKey etc.).

For the above set of instructions, Spark will create 3 stages –

First stage – Instructions 1, 2 and 3

Second stage – Instructions 4 and 5

Third stage – Instructions 6, 7 and 8

Tasks

Spark creates a task to execute a set of instructions inside a stage.

Number of tasks equals the number of partitions in a dataset.

Task execute all consecutive narrow transformations inside a stage – it is called pipelining.

Task in first stage will execute instructions 1, 2 and 3

Task in second stage will execute instructions 4 and 5

Task in the third stage will execute instructions 6, 7 and 8.

Popular posts from this blog

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

What is Garbage collection in Spark and its impact and resolution

Window function in PySpark with Joins example using 2 Dataframes (inner join)