Spark scenario based questions

consider you have a 40 node Spark cluster (32 cores X 128 GB)

you are processing roughly 3.75 TB Data

The processing involves filtering, aggregations, joins etc.

you are getting out of memory error when you run your spark job.

Question 1:
===========

What could be all possible reasons for out of memory errors?

Question 2:
==========

what are the ways to identify the exact issue?

Question 3:
==========

what are the ways to fix it?

Popular posts from this blog

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

What is Garbage collection in Spark and its impact and resolution

Window function in PySpark with Joins example using 2 Dataframes (inner join)