What are applications, jobs, stages and tasks in Spark?

We get a lot of questions on the differences in Spark applications, jobs, stages and tasks. Also we see there is a lot of misunderstanding about these topics with new learners and experienced Spark developers alike.

Task

Task is the smallest execution unit in Spark. A task in spark executes a series of instructions. For eg. reading data, filtering and applying map() on data can be combined into a task. Tasks are executed inside an executor.

Stage

A stage comprises several tasks and every task in the stage executes the same set of instructions.

Job

A job comprises several stages. When Spark encounters a function that requires a shuffle it creates a new stage. Transformation functions like reduceByKey(), Join() etc will trigger a shuffle and will result in a new stage. Spark will also create a stage when you are reading a dataset.

Application

An application comprises several jobs. A job is created, whenever you execute an action function like write().

Summary

A Spark application can have many jobs. A job can have many stages. A stage can have many tasks. A task executes a series of instructions.

Popular posts from this blog

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

What is Garbage collection in Spark and its impact and resolution

Window function in PySpark with Joins example using 2 Dataframes (inner join)