How does Spark decide stages and tasks during execution of a Job?

- August 15, 2021

Let’s see this with an example. Here is our series of instructions in our Spark code. Let’s see how Spark decide on stages and tasks with the below set of instructions.

READ dataset_X
FILTER on dataset_X
MAP operation on dataset_X
READ dataset_Y
MAP operation on dataset_Y
JOIN dataset_X and dataset_Y
FILTER on joined dataset
SAVE the output

Stages

Spark will create a stage for each dataset

All consecutive narrow transformations (for eg. FILTER, MAP etc.) will be grouped together inside the stage

Spark will create a stage when it encounter a wide transformation (for eg. JOIN, reduceByKey etc.).

For the above set of instructions, Spark will create 3 stages –

First stage – Instructions 1, 2 and 3

Second stage – Instructions 4 and 5

Third stage – Instructions 6, 7 and 8

Tasks

Spark creates a task to execute a set of instructions inside a stage.

Number of tasks equals the number of partitions in a dataset.

Task execute all consecutive narrow transformations inside a stage – it is called pipelining.

Task in first stage will execute instructions 1, 2 and 3

Task in second stage will execute instructions 4 and 5

Task in the third stage will execute instructions 6, 7 and 8.

Tech Studio Online