Broadcast Nested Loop in detail in spark
Broadcast Nested Loop join works by broadcasting one of the entire datasets and performing a nested loop to join the data. So essentially every record from dataset 1 is attempted to join with every record from dataset 2. As you could guess, Broadcast Nested Loop is not preferred and could be quite slow. It works for both equi and non-equi joins and it is picked by default when you have a non-equi join. Example We don’t change the default values for both spark.sql.join.preferSortMergeJoin and spark.sql.autoBroadcastJoinThreshold . scala> spark.conf.get("spark.sql.join.preferSortMergeJoin")res0: String = truescala> spark.conf.get("spark.sql.autoBroadcastJoinThreshold")res1: String = 10485760 scala> val data1 = Seq(10, 20, 20, 30, 40, 10, 40, 20, 20, 20, 20, 50)data1: Seq[Int] = List(10, 20, 20, 30, 40, 10, 40, 20, 20, 20, 20, 50)scala> val df1 = data1.toDF("id1")df1: org.apache.spark.sql.DataFrame = [id1: int]scala> val data2 = Seq(30, ...