How 'sort merge' helps in performance while merging 2 tables in spark
Apache Spark Join Optimization – 2 Large Tables
Intent should be to minimize shuffling & maximize parallelism.
Minimize Shuffling –
Try filtering the data before shuffle.
Cut down the size as early as possible to minimize shuffling,
do any of the aggregation before only.
Maximize Parallelism –
1. procure the right number of CPU cores.
If you request for 20 executors with 5 cores,
then at the max we can have 100 tasks running in parallel.
2. set the right number of shuffle partitions.
If you have set it to 50,
then you can have at the max 50 parallel tasks.
3. cardinality should be high.
If you have 20 distinct keys,
then at the max 20 tasks will be doing work in parallel.
so if you get 20 containers with 5 cores each,
your shuffle partitions is set to 50 and,
there are 20 distinct keys.
Then you can have 20 tasks running in parallel.
Max Parallelism = Min(total_CPU_Cores, shuffle_Partitions, Cardinality)
4. Avoid partition skew – Consider 20 tasks. 1 of them is overloaded with work.
The completion of job is dependent on slowest performing task.
Answer:
Here intent should be minimize shuffling so it can be achieved by merge-sort join.
Most of the joining algorithms, first perform repartitioning and then joining. So after repartitioning – Merge-sort involves first sorting and then merging, and sorting is done at partition-level only and that too parallelly. So at the time of merging, the spark engine simply has to merge sorted data present in each partition.
A simple concept – sorting a bigger dataset will take longer than sorting a smaller dataset. So first sort and then simply merge.
Also, since Spark 2.3(or 2.4) the property value spark.sql.join.preferSortMergeJoin is true by default