Spark Advanced Tutorials (Complete guide book)

Introduction

Apache Spark is a general-purpose cluster computing system to process big data workloads. What sets Spark apart from its predecessors, such as MapReduce, is its speed, ease-of-use, and sophisticated analytics.

Apache Spark was originally developed at AMPLab, UC Berkeley, in 2009. It was made open source in 2010 under the BSD license and switched to the Apache 2.0 license in 2013. Toward the later part of 2013, the creators of Spark founded Databricks to focus on Spark’s development and future releases.

Talking about speed, Spark can achieve sub-second latency on big data workloads. To achieve such low latency, Spark makes use of the memory for storage. In MapReduce, memory is primarily used for actual computation. Spark uses memory both to compute and store objects.

Spark also provides a unified runtime connecting to various big data storage sources, such as HDFS, Cassandra, HBase, and S3. It also provides a rich set of higher-level libraries
for different big data compute tasks, such as machine learning, SQL processing, graph processing, and real-time streaming. These libraries make development faster and can be combined in an arbitrary fashion.

Though Spark is written in Scala, Spark also supports Java and Python.

Spark is an open source community project, and everyone uses the pure open source Apache distributions for deployments, unlike Hadoop, which has multiple distributions available with vendor enhancements.

The following figure shows the Spark ecosystem:

The Spark runtime runs on top of a variety of cluster managers, including YARN (Hadoop’s compute framework), Mesos, and Spark’s own cluster manager called standalone mode. Tachyon is a memory-centric distributed file system that enables reliable file sharing at memory speed across cluster frameworks. In short, it is an off-heap storage layer in memory, which helps share data across jobs and users. Mesos is a cluster manager, which is evolving into a data center operating system. YARN is Hadoop’s compute framework that has a robust resource management feature that Spark can seamlessly use.

Spark can be either built from the source code or precompiled binaries can be downloaded from http://spark.apache.org. For a standard use case, binaries are good enough, and this recipe will focus on installing Spark using binaries.

Getting ready

All the recipes in this are developed using Ubuntu Linux but should work fine on any POSIX environment. Spark expects Java to be installed and the JAVA_HOME environment variable to be set.

In Linux/Unix systems, there are certain standards for the location of files and directories, which we are going to follow, The following is a quick cheat sheet:

Directory Description

/bin Essential command binaries

/etc Host-specific system configuration

/opt Add-on application software packages

/var Variable data

/tmp Temporary files

/home User home directories

How to do it…

At the time of writing this, Spark’s current version is 1.4. Please check the latest version from Spark’s download page at http://spark.apache.org/downloads.html. Binaries are developed with a most recent and stable version of Hadoop. To use a specific version of Hadoop, the recommended approach is to build from sources, which will be covered in the next recipe.

The following are the installation steps:
1. Open the terminal and download binaries using the following command:

       $ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.0-bin-       hadoop2.4.tgz

2. Unpack binaries:

   $ tar -zxf spark-1.4.0-bin-hadoop2.4.tgz

Rename the folder containing binaries by stripping the version information: $ sudo mv spark-1.4.0-bin-hadoop2.4 spark
Move the configuration folder to the /etc folder so that it can be made a symbolic link later: $ sudo mv spark/conf/* /etc/spark
Create your company-specific installation directory under /opt. As the recipes in this book are tested on infoobjects sandbox, we are going to use infoobjects as directory name. Create the /opt/infoobjects directory: $ sudo mkdir -p /opt/infoobjects
Move the spark directory to /opt/infoobjects as it’s an add-on software package: $ sudo mv spark /opt/infoobjects/
Change the ownership of the spark home directory to root: $ sudo chown -R root:root /opt/infoobjects/spark
Change permissions of the spark home directory, 0755 = user:read-write- execute group:read-execute world:read-execute: $ sudo chmod -R 755 /opt/infoobjects/spark
Move to the spark home directory: $ cd /opt/infoobjects/spark
Create the symbolic link: $ sudo ln -s /etc/spark conf
Append to PATH in .bashrc:
$ echo “export PATH=$PATH:/opt/infoobjects/spark/bin” >> /home/ hduser/.bashrc
Open a new terminal.
Create the log directory in /var: $ sudo mkdir -p /var/log/spark
Make hduser the owner of the Spark log directory.
$ sudo chown -R hduser:hduser /var/log/spark
Create the Spark tmp directory: $ mkdir /tmp/spark

16. Configure Spark with the help of the following command lines:

   $ cd /etc/spark

       $ echo "export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop"       >> spark-env.sh

       $ echo "export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop"       >> spark-env.sh

       $ echo "export SPARK_LOG_DIR=/var/log/spark" >> spark-env.sh       $ echo "export SPARK_WORKER_DIR=/tmp/spark" >> spark-env.sh

Building the Spark source code with Maven

Installing Spark using binaries works fine in most cases. For advanced cases, such as the following (but not limited to), compiling from the source code is a better option:

f Compiling for a specific Hadoop version f Adding the Hive integration
f Adding the YARN integration

Getting ready

The following are the prerequisites for this recipe to work: f Java 1.6 or a later version

f Maven 3.x

How to do it…

The following are the steps to build the Spark source code with Maven:

Increase MaxPermSize for heap: $ echo “export _JAVA_OPTIONS=\”-XX:MaxPermSize=1G\”” >> /home/ hduser/.bashrc
Open a new terminal window and download the Spark source code from GitHub: $ wget https://github.com/apache/spark/archive/branch-1.4.zip

Chapter 1

Unpack the archive: $ gunzip branch-1.4.zip
Move to the spark directory: $ cd spark

Compile the sources with these flags: Yarn enabled, Hadoop version 2.4, Hive enabled, and skipping tests for faster compilation: $ mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package
Move the conf folder to the etc folder so that it can be made a symbolic link: $ sudo mv spark/conf /etc/
Move the spark directory to /opt as it’s an add-on software package: $ sudo mv spark /opt/infoobjects/spark
Change the ownership of the spark home directory to root: $ sudo chown -R root:root /opt/infoobjects/spark
Change the permissions of the spark home directory 0755 = user:rwx group:r-x world:r-x: $ sudo chmod -R 755 /opt/infoobjects/spark
Move to the spark home directory: $ cd /opt/infoobjects/spark
Create a symbolic link: $ sudo ln -s /etc/spark conf
Put the Spark executable in the path by editing .bashrc:
$ echo “export PATH=$PATH:/opt/infoobjects/spark/bin” >> /home/ hduser/.bashrc
Create the log directory in /var:
$ sudo mkdir -p /var/log/spark
Make hduser the owner of the Spark log directory:
$ sudo chown -R hduser:hduser /var/log/spark
Create the Spark tmp directory: $ mkdir /tmp/spark
Configure Spark with the help of the following command lines: $ cd /etc/spark $ echo “export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop” >> spark-env.sh $ echo “export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop” >> spark-env.sh $ echo “export SPARK_LOG_DIR=/var/log/spark” >> spark-env.sh $ echo “export SPARK_WORKER_DIR=/tmp/spark” >> spark-env.sh

Launching Spark on Amazon EC2

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute instances in the cloud. Amazon EC2 provides the following features:

f On-demand delivery of IT resources via the Internet
f The provision of as many instances as you like
f Payment for the hours you use instances like your utility bill
f No setup cost, no installation, and no overhead at all
f When you no longer need instances, you either shut down or terminate and walk away f The availability of these instances on all familiar operating systems

EC2 provides different types of instances to meet all compute needs, such as general-purpose instances, micro instances, memory-optimized instances, storage-optimized instances, and others. They have a free tier of micro-instances to try.

Getting ready

The spark-ec2 script comes bundled with Spark and makes it easy to launch, manage, and shut down clusters on Amazon EC2.

Before you start, you need to do the following things:

Log in to the Amazon AWS account (http://aws.amazon.com).
Click on Security Credentials under your account name in the top-right corner.
Click on Access Keys and Create New Access Key:

Note down the access key ID and secret access key.
Now go to Services | EC2.

Click on Key Pairs in left-hand menu under NETWORK & SECURITY.
Click on Create Key Pair and enter kp-spark as key-pair name:
Download the private key file and copy it in the /home/hduser/keypairs folder.
Set permissions on key file to 600.
Set environment variables to reflect access key ID and secret access key (please replace sample values with your own values): $ echo “export AWS_ACCESS_KEY_ID=\”AKIAOD7M2LOWATFXFKQ\”” >> / home/hduser/.bashrc $ echo “export AWS_SECRET_ACCESS_KEY=\”+Xr4UroVYJxiLiY8DLT4DLT4D4s xc3ijZGMx1D3pfZ2q\”” >> /home/hduser/.bashrc $ echo “export PATH=$PATH:/opt/infoobjects/spark/ec2” >> /home/ hduser/.bashrc

How to do it…

1. Spark comes bundled with scripts to launch the Spark cluster on Amazon EC2. Let’s launch the cluster using the following command:

       $ cd /home/hduser

       $ spark-ec2 -k <key-pair> -i <key-file> -s <num-slaves> launch       <cluster-name>

2. Launch the cluster with the example value:

   $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem   --hadoop-major-version 2  -s 3 launch spark-cluster

f <key-pair>: This is the name of EC2 key-pair created in AWS f <key-file>: This is the private key file you downloaded
f <num-slaves>: This is the number of slave nodes to launch f <cluster-name>: This is the name of the cluster

Sometimes, the default availability zones are not available; in that case, retry sending the request by specifying the specific availability zone you are requesting: $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -z us-east-1b –hadoop-major-version 2 -s 3 launch spark-cluster
If your application needs to retain data after the instance shuts down, attach EBS volume to it (for example, a 10 GB space): $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem –hadoop-major-version 2 -ebs-vol-size 10 -s 3 launch spark- cluster
If you use Amazon spot instances, here’s the way to do it: $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -spot-price=0.15 –hadoop-major-version 2 -s 3 launch spark- cluster Spot instances allow you to name your own price for Amazon EC2 computing capacity. You simply bid on spare Amazon EC2 instances and run them whenever your bid exceeds the current spot price, which varies in real-time based on supply and demand (source: amazon.com).
After everything is launched, check the status of the cluster by going to the web UI URL that will be printed at the end.

7. Check the status of the cluster:

8. Now, to access the Spark cluster on EC2, let’s connect to the master node using secure shell protocol (SSH):

   $ spark-ec2 -k kp-spark -i /home/hduser/kp/kp-spark.pem  login   spark-cluster

You should get something like the following:

9. Check directories in the master node and see what they do:

External Data Sources

One of the strengths of Spark is that it provides a single runtime that can connect with various underlying data sources.

In this chapter, we will connect to different data sources. This chapter is divided into the following recipes:

Loading data from the local filesystem
Loading data from HDFS
Loading data from HDFS using a custom InputFormat f Loading data from Amazon S3
Loading data from Apache Cassandra
Loading data from relational databases

Introduction

Spark provides a unified runtime for big data. HDFS, which is Hadoop’s filesystem, is the most used storage platform for Spark as it provides cost-effective storage for unstructured and semi-structured data on commodity hardware. Spark is not limited to HDFS and can work with any Hadoop-supported storage.

Hadoop supported storage means a storage format that can work with Hadoop’s InputFormat and OutputFormat interfaces. InputFormat is responsible for creating InputSplits from input data and dividing it further into records. OutputFormat is responsible for writing to storage.

We will start with writing to the local filesystem and then move over to loading data from HDFS. In the Loading data from HDFS recipe, we will cover the most common file format: regular text files. In the next recipe, we will cover how to use any InputFormat interface to load data in Spark. We will also explore loading data stored in Amazon S3, a leading cloud storage platform.

We will explore loading data from Apache Cassandra, which is a NoSQL database. Finally, we will explore loading data from a relational database.

Loading data from the local filesystem

Though the local filesystem is not a good fit to store big data due to disk size limitations and lack of distributed nature, technically you can load data in distributed systems using the local filesystem. But then the file/directory you are accessing has to be available on each node.

Please note that if you are planning to use this feature to load side data, it is not a good idea. To load side data, Spark has a broadcast variable feature, which will be discussed in upcoming chapters.

In this recipe, we will look at how to load data in Spark from the local filesystem.

How to do it…

Let’s start with the example of Shakespeare’s “to be or not to be”:

Create the words directory by using the following command: $ mkdir words
Get into the words directory: $ cd words
Create the sh.txt text file and enter “to be or not to be” in it: $ echo “to be or not to be” > sh.txt
Start the Spark shell: $ spark-shell
Load the words directory as RDD:
scala> val words = sc.textFile(“file:///home/hduser/words”)
Count the number of lines: scala> words.count
Divide the line (or lines) into multiple words: scala> val wordsFlatMap = words.flatMap(_.split(“\\W+”))
Convert word to (word,1)—that is, output 1 as the value for each occurrence of word as a key: scala> val wordsMap = wordsFlatMap.map( w => (w,1))

Use the reduceByKey method to add the number of occurrences for each word as a key (this function works on two consecutive values at a time, represented by a and b): scala> val wordCount = wordsMap.reduceByKey( (a,b) => (a+b))
Print the RDD: scala> wordCount.collect.foreach(println)
Doing all of the preceding operations in one step is as follows: scala> sc.textFile(“file:///home/hduser/ words”). flatMap(_. split(“\\W+”)).map( w => (w,1)). reduceByKey( (a,b) => (a+b)). foreach(println)

This gives the following output:

Loading data from HDFS

HDFS is the most widely used big data storage system. One of the reasons for the wide adoption of HDFS is schema-on-read. What this means is that HDFS does not put any restriction on data when data is being written. Any and all kinds of data are welcome and can be stored in a raw format. This feature makes it ideal storage for raw unstructured data and semi-structured data.

When it comes to reading data, even unstructured data needs to be given some structure to make sense. Hadoop uses InputFormat to determine how to read the data. Spark provides complete support for Hadoop’s InputFormat so anything that can be read by Hadoop can be read by Spark as well.

The default InputFormat is TextInputFormat. TextInputFormat takes the byte offset of a line as a key and the content of a line as a value. Spark uses the sc.textFile method to read using TextInputFormat. It ignores the byte offset and creates an RDD of strings.

Sometimes the filename itself contains useful information, for example, time-series data. In that case, you may want to read each file separately. The sc.wholeTextFiles method allows you to do that. It creates an RDD with the filename and path (for example, hdfs://localhost:9000/ user/hduser/words) as a key and the content of the whole file as the value.

Spark also supports reading various serialization and compression-friendly formats such as Avro, Parquet, and JSON using DataFrames. These formats will be covered in coming chapters.

In this recipe, we will look at how to load data in the Spark shell from HDFS.

Let’s do the word count, which counts the number of occurrences of each word. In this recipe, we will load data from HDFS:

Create the words directory by using the following command: $ mkdir words
Change the directory to words: $ cd words
Create the sh.txt text file and enter “to be or not to be” in it: $ echo “to be or not to be” > sh.txt
Start the Spark shell: $ spark-shell
Load the words directory as the RDD:
scala> val words = sc.textFile(“hdfs://localhost:9000/user/hduser/ words”)

The sc.textFile method also supports passing an additional argument for the number of partitions. By default, Spark creates one partition for each InputSplit class, which roughly corresponds to one block.

You can ask for a higher number of partitions. It works really well for compute-intensive jobs such as in machine learning. As one partition cannot contain more than one block, having fewer partitions than blocks is not allowed.

Count the number of lines (the result will be 1): scala> words.count
Divide the line (or lines) into multiple words: scala> val wordsFlatMap = words.flatMap(_.split(“\\W+”))
Convert word to (word,1)—that is, output 1 as a value for each occurrence of word as a key: scala> val wordsMap = wordsFlatMap.map( w => (w,1))
Use the reduceByKey method to add the number of occurrences of each word as a key (this function works on two consecutive values at a time, represented by a and b): scala> val wordCount = wordsMap.reduceByKey( (a,b) => (a+b))

10. Print the RDD:

   scala> wordCount.collect.foreach(println)

11. Doing all of the preceding operations in one step is as follows:

       scala> sc.textFile("hdfs://localhost:9000/user/hduser/words").       flatMap(_.split("\\W+")).map( w => (w,1)). reduceByKey( (a,b) =>       (a+b)).foreach(println)

Data here is divided in such a way that each filename contains useful information, that is, USAF-WBAN-year, where USAF is the US air force station number and WBAN is the weather bureau army navy location number.

You will also notice that all files are compressed as gzip with a .gz extension. Compression is handled automatically so all you need to do is to upload data in HDFS. We will come back to this dataset in the coming chapters.

Since the whole dataset is not large, it can be uploaded in HDFS in the pseudo-distributed mode also:

Download data: $ wget -r ftp://ftp.ncdc.noaa.gov/pub/data/noaa/
Load the weather data in HDFS: $ hdfs dfs -put ftp.ncdc.noaa.gov/pub/data/noaa weather/
Start the Spark shell: $ spark-shell
Load weather data for 1901 in the RDD: scala> val weatherFileRDD = sc.wholeTextFiles(“hdfs:// localhost:9000/user/hduser/weather/1901”)
Cache weather in the RDD so that it is not recomputed every time it’s accessed: scala> val weatherRDD = weatherFileRDD.cache

In Spark, there are various StorageLevels at which the RDD can be persisted. rdd.cache is a shorthand for the rdd. persist(MEMORY_ONLY) StorageLevel.

6. Count the number of elements:

   scala> weatherRDD.count

Since the whole contents of a file are loaded as an element, we need to manually interpret the data, so let’s load the first element: scala> val firstElement = weatherRDD.first
Read the value of the first RDD: scala> val firstValue = firstElement._2 The firstElement contains tuples in the form (string, string). Tuples can be accessed in two ways:
- ‰ Using a positional function starting with _1.
- ‰ Using the productElement method, for example, tuple. productElement(0). Indexes here start with 0 like most other methods.
Split firstValue by lines: scala> val firstVals = firstValue.split(“\\n”)

10. Count the number of elements in firstVals: scala> firstVals.size

11. The schema of weather data is very rich with the position of the text working as a delimiter. You can get more information about schemas at the national weather service website. Let’s get wind speed, which is from section 66-69 (in meter/sec):

       scala> val windSpeed = firstVals.map(line => line.substring(65,69)

Loading data from HDFS using a custom InputFormat

Sometimes you need to load data in a specific format and TextInputFormat is not a good fit for that. Spark provides two methods for this purpose:

f sparkContext.hadoopFile: This supports the old MapReduce API
f sparkContext.newAPIHadoopFile: This supports the new MapReduce API

These two methods provide support for all of Hadoop’s built-in InputFormats interfaces as well as any custom InputFormat.

We are going to load text data in key-value format and load it in Spark using KeyValueTextInputFormat:

Create the currency directory by using the following command: $ mkdir currency
Change the current directory to currency: $ cd currency
Create the na.txt text file and enter currency values in key-value format delimited by tab (key: country, value: currency):
$ vi na.txt
United States of America US Dollar Canada Canadian Dollar Mexico Peso You can create more files for each continent.
Upload the currency folder to HDFS: $ hdfs dfs -put currency /user/hduser/currency
Start the Spark shell: $ spark-shell
Import statements: scala> import org.apache.hadoop.io.Text scala> import org.apache.hadoop.mapreduce.lib.input. KeyValueTextInputFormat
Load the currency directory as the RDD: val currencyFile = sc.newAPIHadoopFile(“hdfs://localhost:9000/ user/hduser/currency”,classOf[KeyValueTextInputFormat],classOf[Tex t],classOf[Text])
Convert it from tuple of (Text,Text) to tuple of (String,String): val currencyRDD = currencyFile.map( t => (t._1.toString,t._2. toString))
Count the number of elements in the RDD: scala> currencyRDD.count

10. Print the values:

       scala> currencyRDD.collect.foreach(println)

Amazon Simple Storage Service (S3) provides developers and IT teams with a secure, durable, and scalable storage platform. The biggest advantage of Amazon S3 is that there is no up-front IT investment and companies can build capacity (just by clicking a button a button) as they need.

Though Amazon S3 can be used with any compute platform, it integrates really well with Amazon’s cloud services such as Amazon Elastic Compute Cloud (EC2) and Amazon Elastic Block Storage (EBS). For this reason, companies who use Amazon Web Services (AWS) are likely to have significant data is already stored on Amazon S3.

This makes a good case for loading data in Spark from Amazon S3 and that is exactly what this recipe is about.

Enter the bucket name—for example, com.infoobjects.wordcount. Please make sure you enter a unique bucket name (no two S3 buckets can have the same name globally).
Select Region, click on Create, and then on the bucket name you created and you will see the following screen

Click on Create Folder and enter words as the folder name.
Create the sh.txt text file on the local filesystem: $ echo “to be or not to be” > sh.txt
Navigate to Words | Upload | Add Files and choose sh.txt from the dialog box, as shown in the following screenshot:

Tech Studio Online

Spark Advanced Tutorials (Complete guide book)

External Data Sources

Popular posts from this blog

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

What is Garbage collection in Spark and its impact and resolution

Window function in PySpark with Joins example using 2 Dataframes (inner join)