How does Spark decide stages and tasks during execution of a Job?
Let’s see this with an example. Here is our series of instructions in our Spark code. Let’s see how Spark decide on stages and tasks with the below set of instructions.
- READ dataset_X
- FILTER on dataset_X
- MAP operation on dataset_X
- READ dataset_Y
- MAP operation on dataset_Y
- JOIN dataset_X and dataset_Y
- FILTER on joined dataset
- SAVE the output
Stages
Spark will create a stage for each dataset
All consecutive narrow transformations (for eg. FILTER, MAP etc.) will be grouped together inside the stage
Spark will create a stage when it encounter a wide transformation (for eg. JOIN, reduceByKey etc.).
For the above set of instructions, Spark will create 3 stages –
First stage – Instructions 1, 2 and 3
Second stage – Instructions 4 and 5
Third stage – Instructions 6, 7 and 8
Tasks
Spark creates a task to execute a set of instructions inside a stage.
Number of tasks equals the number of partitions in a dataset.
Task execute all consecutive narrow transformations inside a stage – it is called pipelining.
Task in first stage will execute instructions 1, 2 and 3
Task in second stage will execute instructions 4 and 5
Task in the third stage will execute instructions 6, 7 and 8.