What is Garbage collection in Spark and its impact and resolution

Garbage Collection

Spark runs on the Java Virtual Machine (JVM). Because Spark can store large amounts of data in memory, it has a major reliance on Java’s memory management and garbage collection (GC). Therefore, garbage collection  (GC) can be a major issue that can affect many Spark applications.

Common symptoms of excessive GC in Spark are:

  • Slowness of application
  • Executor heartbeat timeout
  • GC overhead limit exceeded error

Spark’s memory-centric approach and data-intensive applications make it a more common issue than other Java applications. Thankfully, it’s easy to diagnose if your Spark application is suffering from a GC problem. The Spark UI marks executors in red if they have spent too much time doing GC.

Spark executors are spending a significant amount of CPU cycles performing garbage collection. This can be determined by looking at the “Executors” tab in the Spark application UI. Spark will mark an executor in red if the executor has spent more than 10% of the time in garbage collection than the task time as you can see in the diagram below.

The Spark UI indicates excessive GC in red

ADDRESSING GARBAGE COLLECTION ISSUES

Here are some of the basic things we can do to try to address GC issues.

DATA STRUCTURES

If using RDD based applications, use data structures with fewer objects. For example, use an array instead of a list.

SPECIALIZED DATA STRUCTURES

If you are dealing with primitive data types, consider using specialized data structures like Koloboke or fastutil. These structures optimize memory usage for primitive types.

STORING DATA OFF-HEAP

The Spark execution engine and Spark storage can both store data off-heap. You can switch on off-heap storage using

  • –conf spark.memory.offHeap.enabled = true
  • –conf spark.memory.offHeap.size = Xgb.

Be careful when using off-heap storage as it does not impact on-heap memory size i.e. it won’t shrink heap memory. So to define an overall memory limit, assign a smaller heap size.

BUILT-IN VS USER DEFINED FUNCTIONS (UDFS)

If you are using Spark SQL, try to use the built-in functions as much as possible, rather than writing new UDFs. Most of the SPARK UDFs can work on UnsafeRow and don’t need to convert to wrapper data types. This avoids creating garbage, also it plays well with code generation.

BE STINGY ABOUT OBJECT CREATION

Remember we may be working with billions of rows. If we create even a small temporary object with 100-byte size for each row, it will create 1 billion * 100 bytes of garbage.

Popular posts from this blog

Window function in PySpark with Joins example using 2 Dataframes (inner join)

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)

Credit Card Data Analysis using PySpark (how to use auto broadcast join after disabling it)