Loading and data from Apache Cassandra v/s HDFS Storage and why Cassandra?

Apache Cassandra is a NoSQL database with a masterless ring cluster structure. While HDFS is a good fit for streaming data access, it does not work well with random access. For example, HDFS will work well when your average file size is 100 MB and you want to read the whole file. If you frequently access the nth line in a file or some other part as a record, HDFS would be too slow.

Relational databases have traditionally provided a solution to that, providing low latency, random access, but they do not work well with big data. NoSQL databases such as Cassandra fill the gap by providing relational database type access but in a distributed architecture on commodity servers.

To data from Cassandra as a Spark RDD. To make that happen Datastax, the company behind Cassandra, has contributed spark-cassandra-connector. This connector lets you load Cassandra tables as Spark RDDs, write Spark RDDs back to Cassandra, and execute CQL queries.

Command to use to load data from Cassandra

Perform the following steps to load data from Cassandra:
1. Create a keyspace named people in Cassandra using the CQL shell:

       cqlsh> CREATE KEYSPACE people WITH replication = {'class':       'SimpleStrategy', 'replication_factor': 1 };

2. Create a column family (from CQL 3.0 onwards, it can also be called a table) person in newer versions of Cassandra:

       cqlsh> create columnfamily person(id int primary key,first_name       varchar,last_name varchar);

3. Insert a few records in the column family:

       cqlsh> insert into person(id,first_name,last_name)       values(1,'Barack','Obama');
       cqlsh> insert into person(id,first_name,last_name)       values(2,'Joe','Smith');

4. Add Cassandra connector dependency to SBT:

       "com.datastax.spark" %% "spark-cassandra-connector" % 1.2.0

5. You can also add the Cassandra dependency to Maven:

       <dependency>         <groupId>com.datastax.spark</groupId>         <artifactId>spark-cassandra-connector_2.10</artifactId>         <version>1.2.0</version></dependency>
Alternatively, you can also download the spark-cassandra-connector JAR to use directly with the Spark shell:       $ wget http://central.maven.org/maven2/com/datastax/spark/spark-       cassandra-connector_2.10/1.1.0/spark-cassandra-connector_2.10-       1.2.0.jar
  1. Now start the Spark shell.
  2. Set the spark.cassandra.connection.host property in the Spark shell:
           scala> sc.getConf.set("spark.cassandra.connection.host",       "localhost")
  3. Import Cassandra-specific libraries:
           scala> import com.datastax.spark.connector._
  4. Load the person column family as an RDD:
    scala> val personRDD = sc.cassandraTable(“people”,”person”)
  5. Count the number of records in the RDD:
           scala> personRDD.count
  6. Print data in the RDD:
           scala> personRDD.collect.foreach(println)
  7. Retrieve the first row:
           scala> val firstRow = personRDD.first
  8. Get the column names:
           scala> firstRow.columnNames
  9. Cassandra can also be accessed through Spark SQL. It has a wrapper around SQLContext called CassandraSQLContext; let’s load it:
           scala> val cc = new org.apache.spark.sql.cassandra.       CassandraSQLContext(sc)
  10. Load the person data as SchemaRDD:
    scala> val p = cc.sql(“select * from people.person”)
  11. Retrieve the person data:
    scala> p.collect.foreach(println)

Spark Cassandra’s connector library has a lot of dependencies. The connector itself and several of its dependencies are third-party to Spark and are not available as part of the Spark installation.

Popular posts from this blog

Window function in PySpark with Joins example using 2 Dataframes (inner join)

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)

Credit Card Data Analysis using PySpark (how to use auto broadcast join after disabling it)