Posts

Showing posts with the label HDFS

Loading and data from Apache Cassandra v/s HDFS Storage and why Cassandra?

Apache Cassandra is a NoSQL database with a masterless ring cluster structure. While HDFS is a good fit for streaming data access, it does not work well with random access. For example, HDFS will work well when your average file size is 100 MB and you want to read the whole file. If you frequently access the nth line in a file or some other part as a record, HDFS would be too slow. Relational databases have traditionally provided a solution to that, providing low latency, random access, but they do not work well with big data. NoSQL databases such as Cassandra fill the gap by providing relational database type access but in a distributed architecture on commodity servers. To data from Cassandra as a Spark RDD. To make that happen Datastax, the company behind Cassandra, has contributed spark-cassandra-connector. This connector lets you load Cassandra tables as Spark RDDs, write Spark RDDs back to Cassandra, and execute CQL queries. Command to use to load data from Cassandra Perform the

Word count program by MapReduce job

This is simple Map Reduce Job to process any text file and give us word with occurrences as an output Program: package com . dpq . retail ; mport java . io . IOException ; import org . apache . hadoop . conf . Configuration ; import org . apache . hadoop . fs . Path ; import org . apache . hadoop . io . IntWritable ; import org . apache . hadoop . io . LongWritable ; import org . apache . hadoop . io . Text ; import org . apache . hadoop . mapreduce . Job ; import org . apache . hadoop . mapreduce . Mapper ; import org . apache . hadoop . mapreduce . Reducer ; import org . apache . hadoop . mapreduce . lib . input . FileInputFormat ; import org . apache . hadoop . mapreduce . lib . output . FileOutputFormat ; import org . apache . hadoop . util . GenericOptionsParser ; public class WordCountDriver { public static void main ( String [] args ) throws IOException , ClassNotFoundException , InterruptedException { Configuration c = new Configuration () ; String [] fi

Analysing Stocks data and getting maximum selling price from stock dataset using MapReduce job

MaxClosingPricingMapReduceApp Here we are getting maximum selling price for each and every stock symbol for last 20 years I have provided small sample dataset and run same progam with 10 GB data on cluster with 10 mappers and it took around 35 secs to process data We have added Partioner just to understand how partition is partiioning data and mapper is being assigned to process that particular partition We have used Map Reduce job and HDFS storage to get above stats later I am planning to compare it with Spark and we will See how Spark is speeding up process and how it will reduce I/O basically Also sample data is present in inputFile directory in same project and to see what is insite in data Sample Data: ABCSE,B7J,2009-08-14,7.93,7.94,7.70,7.85,64600,7.68 ABCSE,B6J,2009-08-14,7.93,7.94,7.70,4.55,64600,6.68 ABCSE,B8J,2009-08-14,7.93,7.94,7.70,6.85,64600,4.68 ABCSE,B9J,2009-08-14,7.93,7.94,7.70,8.85,64600,73.68 ABCSE,A7J,2009-08-14,7.93,7.94,7.70,9.85,64600,7.68 ABCSE,S7J,2009-08-14,7

HDFS important and useful commands

To check Hadoop Version $ hadoop version Creating user home directory hadoop fs -mkdir -p /user/dpq  (-p will create directory if directory not present if directory already present then it wont throw exception) hadoop fs -mkdir  /user/retails (without -p  if directory already present then it will throw exception) List all directories hdfs dfs -ls / hdfs dfs -ls /user Copy file from local to HDFS hadoop fs -copyFromLocal /etc/data/retial_data.csv /user/retailshadoop fs -put /etc/data/retial_data.csv /user/retails (if 'retail_data.csv' already present inside '/user/retails' hdfs directory then it will throw excpetion 'File already present')hadoop fs -put -f /etc/data/retial_data.csv /user/retails (-f will forcefully override retail_data.csv if already present) Checking data in HDFS hdfs dfs -cat /user/retails/retial_data.csv Create empty file on HDFS hdfs dfs -touchz /user/retails/empty.csv Copy file from HDFS hdfs dfs -get /user/retails/retial_data.csv hdfs dfs -

What is Apache Spark and it's life cycle with components in details

Image
Apache Spark is a general-purpose cluster computing system to process big data workloads. What sets Spark apart from its predecessors, such as MapReduce, is its speed, ease-of-use, and sophisticated analytics. Apache Spark was originally developed at AMPLab, UC Berkeley, in 2009. It was made open source in 2010 under the BSD license and switched to the Apache 2.0 license in 2013. Toward the later part of 2013, the creators of Spark founded Databricks to focus on Spark’s development and future releases. Talking about speed, Spark can achieve sub-second latency on big data workloads. To achieve such low latency, Spark makes use of the memory for storage. In MapReduce, memory is primarily used for actual computation. Spark uses memory both to compute and store objects. Getting Started with Apache Spark 2 Spark also provides a unified runtime connecting to various big data storage sources, such as HDFS, Cassandra, HBase, and S3. It also provides a rich set of higher-level libraries for dif