Posts

Spark Join Strategies (Internals of Spark Joins & Spark’s choice of Join Strategy)

While dealing with data, we have to deal with different kinds of joins, be it inner ,  outer ,  left  or (maybe) left-semi . This article covers the different join strategies employed by Spark to perform the  join operations. Knowing spark join internals comes in handy to optimize tricky join operations, in finding root cause of some out of memory errors, and for improved performance of spark jobs(we all want that, don’t we?). Please read on to find out. Broadcast Hash Join Before beginning the Broadcast Hash join spark, let’s first understand  Hash Join, in general : As the name suggests, Hash Join is performed by first creating a Hash Table based on join_key of smaller relation and then looping over larger relation to match the hashed join_key values. Also, this is only supported for ‘=’ join. In spark, Hash Join plays a role at per node level and the strategy is used to join partitions available on the node. Now, coming to Broadcast Hash Join. In broadcast hash join, copy of one of...

Rest API simple Application with http mothods

We will be developing REST API using JAX-RS (Jersey) and Tomcat server and we will be implementing basic 4 methods so lets get started Here is pom.xml file: <project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"><modelVersion>4.0.0</modelVersion><groupId>org.dpq.webservices</groupId><artifactId>SimpleRestApiApp</artifactId><packaging>war</packaging><version>0.0.1-SNAPSHOT</version><name>SimpleRestApiApp</name><build> <finalName>SimpleRestApiApp</finalName> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.8.1...

Rest API understanding with its Architecture

REST is an acronym for  RE presentational  S tate  T ransfer and an architectural style for  distributed hypermedia systems . 1. Principles of REST:     Uniform Interface: The following four constraints can achieve a uniform REST interface: Identification of resources  – The interface must uniquely identify each resource involved in the interaction between the client and the server. Manipulation of resources through representations  – The resources should have uniform representations in the server response. API consumers should use these representations to modify the resources state in the server. Self-descriptive messages  – Each resource representation should carry enough information to describe how to process the message. It should also provide information of the additional actions that the client can perform on the resource. Hypermedia as the engine of application state  – The client should have only the initial URI of the application. The client application should dynamically driv...

United Kingdom Sponsor Data Analysis with Java Streams and Parallel Streams and comparisons

United Kingdom Sponsor Data Analysis with Java Streams and Parallel Streams and comparisons Things which are covered: Reading CSV data using CSVReader check count preparing List of objects to process evaluating range check checking whether a given company is registered or not processing above point by streams and parallel streams comparing times b/w Streams and parallel streams inconsistency in parallel streams Inconsistency in parallel streams: Here while searching element in records which returned from parallelstream objects don’t give guarantee because records are splits to be processed in parallel and while comparing one chunk of data is beings considered to compare, that’s why there is inconsistency in parallel streams and how Spark will handle this with minimum code with consistency we will see in upcoming post Later we will compare the same analysis with Spark and see differences in execution and with respect to time comoplexity Implementation: package com . dpq . sp...

Covid Data Analysis with Bed availability and other details

Covid Data Analysis: I have provided small sample dataset and run same progam with 10 GB data on cluster with 10 mappers and it took around 25 secs to process data We have added Partioner just to understand how partition is partiioning data and mapper is being assigned to process that particular partition Implemented cache for performance booster Country wise total cases Country wise new cases Country wise other details like available beds, booster details etc for more details please follow below git details: https://github.com/Deepak-Bhardwaj-Architect/CovidDataAnalysis Implementation: package com.citi.covid.spark.driver; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class CovidDataAnalysis { public static void main(String[] args) throws InterruptedException { JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName...

Everything about YARN and its mode

Yet another resource negotiator (YARN) is Hadoop’s compute framework that runs on top of HDFS, which is Hadoop’s storage layer. YARN follows the master slave architecture. The master daemon is called ResourceManager and the slave daemon is called NodeManager. Besides this application, life cycle management is done by ApplicationMaster, which can be spawned on any slave node and is alive for the lifetime of an application. When Spark is run on YARN, ResourceManager performs the role of Spark master and NodeManagers work as executor nodes. While running Spark with YARN, each Spark executor is run as YARN container. Spark applications on YARN run in two modes:   yarn-client: Spark Driver runs in the client process outside of YARN cluster, and ApplicationMaster is only used to negotiate resources from ResourceManager yarn-cluster : Spark Driver runs in ApplicationMaster spawned by NodeManager on a slave node The yarn-cluster mode is recommended for production deployments, while...

How to load data from Relations Database?

A lot of important data lies in relational databases that Spark needs to query. JdbcRDD is a Spark feature that allows relational tables to be loaded as RDDs. This recipe will explain how to use JdbcRDD. Spark SQL to be another option includes a data source for JDBC. This should be preferred over the current recipe as results are returned as DataFrames (to be introduced in the next chapter), which can be easily processed by Spark SQL and also joined with other data sources. Please make sure that the JDBC driver JAR is visible on the client node and all slaves nodes on which executor will run. Perform the following steps to load data from relational databases: 1. Create a table named person in MySQL using the following DDL: CREATE TABLE 'person' ( 'person_id' int(11) NOT NULL AUTO_INCREMENT, 'first_name' varchar(30) DEFAULT NULL, 'last_name' varchar(30) DEFAULT NULL, 'gender' char(1) DEFAULT NULL, PRI...

Loading and data from Apache Cassandra v/s HDFS Storage and why Cassandra?

Apache Cassandra is a NoSQL database with a masterless ring cluster structure. While HDFS is a good fit for streaming data access, it does not work well with random access. For example, HDFS will work well when your average file size is 100 MB and you want to read the whole file. If you frequently access the nth line in a file or some other part as a record, HDFS would be too slow. Relational databases have traditionally provided a solution to that, providing low latency, random access, but they do not work well with big data. NoSQL databases such as Cassandra fill the gap by providing relational database type access but in a distributed architecture on commodity servers. To data from Cassandra as a Spark RDD. To make that happen Datastax, the company behind Cassandra, has contributed spark-cassandra-connector. This connector lets you load Cassandra tables as Spark RDDs, write Spark RDDs back to Cassandra, and execute CQL queries. Command to use to load data from Cassandra Perform the ...

Twitter Data streaming by using pipeline in PySpark

Twitter data analysis using PySpark along with Pipeline We are  processing  Twitter data using PySpark and we have tried to use all possible  methods  to understand Twitter data is being parsed in 2 stages which is sequential because of which we are using pipelines for these 3 stages Using fit function on pipeline then  model  is being trained then computation are being done from pyspark import SparkContext from pyspark . sql . session import SparkSession from pyspark . streaming import StreamingContext import pyspark . sql . types as tp from pyspark . ml import Pipeline from pyspark . ml . feature import StringIndexer , OneHotEncoderEstimator , VectorAssembler from pyspark . ml . feature import StopWordsRemover , Word2Vec , RegexTokenizer from pyspark . ml . classification import LogisticRegression from pyspark . sql import Row , Column import sys # define the function to get the predicted sentiment on the data received def get_prediction ( tweet_text ): t...