What is Apache Spark and it's life cycle with components in details

Apache Spark is a general-purpose cluster computing system to process big data workloads. What sets Spark apart from its predecessors, such as MapReduce, is its speed, ease-of-use, and sophisticated analytics.

Apache Spark was originally developed at AMPLab, UC Berkeley, in 2009. It was made open source in 2010 under the BSD license and switched to the Apache 2.0 license in 2013. Toward the later part of 2013, the creators of Spark founded Databricks to focus on Spark’s development and future releases.

Talking about speed, Spark can achieve sub-second latency on big data workloads. To achieve such low latency, Spark makes use of the memory for storage. In MapReduce, memory is primarily used for actual computation. Spark uses memory both to compute and store objects.

Getting Started with Apache Spark 2 Spark also provides a unified runtime connecting to various big data storage sources, such as HDFS, Cassandra, HBase, and S3. It also provides a rich set of higher-level libraries for different big data compute tasks, such as machine learning, SQL processing, graph processing, and real-time streaming. These libraries make development faster and can be combined in an arbitrary fashion.

The following figure shows the Spark ecosystem

The Spark runtime runs on top of a variety of cluster managers, including YARN (Hadoop’s compute framework), Mesos, and Spark’s own cluster manager called standalone mode.

Tachyon is a memory-centric distributed file system that enables reliable file sharing at memory speed across cluster frameworks. In short, it is an off-heap storage layer in memory, which helps share data across jobs and users.

Mesos is a cluster manager, which is evolving into a data center operating system.

YARN is Hadoop’s compute framework that has a robust resource management feature that Spark can seamlessly use.

 

The following diagram depicts the high-level architecture of a Spark cluster:

Yet another resource negotiator (YARN) is Hadoop’s compute framework that runs on top of HDFS, which is Hadoop’s storage layer. YARN follows the master slave architecture. The master daemon is called ResourceManager and the slave daemon is called NodeManager. Besides this application, life cycle management is done by ApplicationMaster, which can be spawned on any slave node and is alive for the lifetime of an application. When Spark is run on YARN, ResourceManager performs the role of Spark master and NodeManagers work as executor nodes. While running Spark with YARN, each Spark executor is run as YARN container.

Spark applications on YARN run in two modes:

  • yarn-client: Spark Driver runs in the client process outside of YARN cluster, and ApplicationMaster is only used to negotiate resources from ResourceManager
  • yarn-cluster: Spark Driver runs in ApplicationMaster spawned by NodeManager on a slave node

The yarn-cluster mode is recommended for production deployments, while the yarnclient mode is good for development and debugging when you would like to see immediate output. There is no need to specify Spark master in either mode as it’s picked from the Hadoop configuration, and the master parameter is either yarn-client or yarn-cluster. The following figure shows how Spark is run with YARN in the client mode:

The following figure shows how Spark is run with YARN in the client mode:

 

In the YARN mode, the following configuration parameters can be set:

  •  –num-executors: Configure how many executors will be allocated
  • –executor-memory: RAM per executor
  • –executor-cores: CPU cores per executor

Tachyon in detail:

Using Tachyon as an off-heap storage layer:

Tachyon is a memory-centric distributed file system that enables reliable file sharing at memory speed across cluster frameworks. In short, it is an off-heap storage layer in memory, which helps share data across jobs and users.

Spark RDDs are a great way to store datasets in memory while ending up with multiple copies of the same data in different applications. Tachyon solves some of the challenges with Spark RDD management. A few of them are:

  • RDD only exists for the duration of the Spark application
  • The same process performs the compute and RDD in-memory storage; so, if a process crashes, in-memory storage also goes away
  • Different jobs cannot share an RDD even if they are for the same underlying data, for example, an HDFS block that leads to: “Slow writes to disk” and “Duplication of data in memory, higher memory footprint”
  • If the output of one application needs to be shared with the other application, it’s slow due to the replication in the disk
  • Tachyon provides an off-heap memory layer to solve these problems. This layer, being off-heap, is immune to process crashes and is also not subject to garbage collection.

This also lets RDDs be shared across applications and outlive a specific job or session; in essence, one single copy of data resides in memory, as shown in the following figure:

 

Popular posts from this blog

Window function in PySpark with Joins example using 2 Dataframes (inner join)

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)

Credit Card Data Analysis using PySpark (how to use auto broadcast join after disabling it)