Does Broadcast variable works for Data frame in Spark

We can’t create broadcast variable for a DataFrame. If we would like to do broadcast joins, however, if we are using Spark 1.5 or newer, you can still do that like following:

from pyspark.sql import SQLContextfrom pyspark.sql.functions import broadcastsqlContext = SQLContext(sc)df_tiny = sqlContext.sql('select * from tiny_table')df_large = sqlContext.sql('select * from massive_table')df3 = df_large.join(broadcast(df_tiny), df_large.some_sort_of_key == df_tiny.key)

By using broadcast function on your DataFrame, you will mark it as small enough for use in broadcast joins.

Popular posts from this blog

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

What is Garbage collection in Spark and its impact and resolution

Window function in PySpark with Joins example using 2 Dataframes (inner join)