Does Broadcast variable works for Data frame in Spark
We can’t create broadcast variable for a DataFrame. If we would like to do broadcast joins, however, if we are using Spark 1.5 or newer, you can still do that like following:
from pyspark.sql import SQLContextfrom pyspark.sql.functions import broadcastsqlContext = SQLContext(sc)df_tiny = sqlContext.sql('select * from tiny_table')df_large = sqlContext.sql('select * from massive_table')df3 = df_large.join(broadcast(df_tiny), df_large.some_sort_of_key == df_tiny.key)By using broadcast function on your DataFrame, you will mark it as small enough for use in broadcast joins.