Why do I see 200 tasks in Spark execution?

It is quite common to see 200 tasks in one of your stages and more specifically at a stage which requires wide transformation. The reason for this is, wide transformations in Spark requires a shuffle. Operations like join, group by etc. are wide transform operations and they trigger a shuffle.

By default, Spark creates 200 partitions whenever there is a need for shuffle. Each partition will be processed by a task. So, you will end up with 200 tasks during execution.

How to change the default 200 tasks?

spark.sql.shuffle.partitions  property controls the number of partitions during a shuffle and the default value of this property is 200.

Change the value of spark.sql.shuffle.partitions  to change the number of partitions during a shuffle.

sqlContext.setConf("spark.sql.shuffle.partitions", "4”)

Should you change the default?

Short answer is – it depends.

200 partitions could be a lot when your data volume is small. Why? Because, each partition is processed by a task and each task is processing a small amount of data and this could result in performance issues. Very simply, you can decrease the number of partitions which would result in fewer tasks and that would result in better performance.

If you are writing the data after a shuffle and if your data volume is small, you would result with 200 small files by default and this is another reason why you might want to consider changing the default to a smaller number.

200 partitions could be low when the amount of data involved in shuffle is huge. If a task is processing a lot of data you could see out of memory exceptions or slower tasks executions. This could be rectified by increasing the number of partitions there by increasing the number tasks which in turn allow each task to process a manageable amount of data.

Popular posts from this blog

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

What is Garbage collection in Spark and its impact and resolution

Window function in PySpark with Joins example using 2 Dataframes (inner join)