Posts

Showing posts from January, 2022

Covid Data Analysis with Bed availability and other details

Covid Data Analysis: I have provided small sample dataset and run same progam with 10 GB data on cluster with 10 mappers and it took around 25 secs to process data We have added Partioner just to understand how partition is partiioning data and mapper is being assigned to process that particular partition Implemented cache for performance booster Country wise total cases Country wise new cases Country wise other details like available beds, booster details etc for more details please follow below git details: https://github.com/Deepak-Bhardwaj-Architect/CovidDataAnalysis Implementation: package com.citi.covid.spark.driver; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class CovidDataAnalysis { public static void main(String[] args) throws InterruptedException { JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName

Everything about YARN and its mode

Yet another resource negotiator (YARN) is Hadoop’s compute framework that runs on top of HDFS, which is Hadoop’s storage layer. YARN follows the master slave architecture. The master daemon is called ResourceManager and the slave daemon is called NodeManager. Besides this application, life cycle management is done by ApplicationMaster, which can be spawned on any slave node and is alive for the lifetime of an application. When Spark is run on YARN, ResourceManager performs the role of Spark master and NodeManagers work as executor nodes. While running Spark with YARN, each Spark executor is run as YARN container. Spark applications on YARN run in two modes:   yarn-client: Spark Driver runs in the client process outside of YARN cluster, and ApplicationMaster is only used to negotiate resources from ResourceManager yarn-cluster : Spark Driver runs in ApplicationMaster spawned by NodeManager on a slave node The yarn-cluster mode is recommended for production deployments, while the yarn-

How to load data from Relations Database?

A lot of important data lies in relational databases that Spark needs to query. JdbcRDD is a Spark feature that allows relational tables to be loaded as RDDs. This recipe will explain how to use JdbcRDD. Spark SQL to be another option includes a data source for JDBC. This should be preferred over the current recipe as results are returned as DataFrames (to be introduced in the next chapter), which can be easily processed by Spark SQL and also joined with other data sources. Please make sure that the JDBC driver JAR is visible on the client node and all slaves nodes on which executor will run. Perform the following steps to load data from relational databases: 1. Create a table named person in MySQL using the following DDL: CREATE TABLE 'person' ( 'person_id' int(11) NOT NULL AUTO_INCREMENT, 'first_name' varchar(30) DEFAULT NULL, 'last_name' varchar(30) DEFAULT NULL, 'gender' char(1) DEFAULT NULL, PRI

Loading and data from Apache Cassandra v/s HDFS Storage and why Cassandra?

Apache Cassandra is a NoSQL database with a masterless ring cluster structure. While HDFS is a good fit for streaming data access, it does not work well with random access. For example, HDFS will work well when your average file size is 100 MB and you want to read the whole file. If you frequently access the nth line in a file or some other part as a record, HDFS would be too slow. Relational databases have traditionally provided a solution to that, providing low latency, random access, but they do not work well with big data. NoSQL databases such as Cassandra fill the gap by providing relational database type access but in a distributed architecture on commodity servers. To data from Cassandra as a Spark RDD. To make that happen Datastax, the company behind Cassandra, has contributed spark-cassandra-connector. This connector lets you load Cassandra tables as Spark RDDs, write Spark RDDs back to Cassandra, and execute CQL queries. Command to use to load data from Cassandra Perform the