Posts

Showing posts from March, 2022

Spark Join Strategies (Internals of Spark Joins & Spark’s choice of Join Strategy)

While dealing with data, we have to deal with different kinds of joins, be it inner ,  outer ,  left  or (maybe) left-semi . This article covers the different join strategies employed by Spark to perform the  join operations. Knowing spark join internals comes in handy to optimize tricky join operations, in finding root cause of some out of memory errors, and for improved performance of spark jobs(we all want that, don’t we?). Please read on to find out. Broadcast Hash Join Before beginning the Broadcast Hash join spark, let’s first understand  Hash Join, in general : As the name suggests, Hash Join is performed by first creating a Hash Table based on join_key of smaller relation and then looping over larger relation to match the hashed join_key values. Also, this is only supported for ‘=’ join. In spark, Hash Join plays a role at per node level and the strategy is used to join partitions available on the node. Now, coming to Broadcast Hash Join. In broadcast hash join, copy of one of