Tech Studio Online

Posts

Covid Data Analysis with Bed availability and other details

- January 31, 2022

Covid Data Analysis: I have provided small sample dataset and run same progam with 10 GB data on cluster with 10 mappers and it took around 25 secs to process data We have added Partioner just to understand how partition is partiioning data and mapper is being assigned to process that particular partition Implemented cache for performance booster Country wise total cases Country wise new cases Country wise other details like available beds, booster details etc for more details please follow below git details: https://github.com/Deepak-Bhardwaj-Architect/CovidDataAnalysis Implementation: package com.citi.covid.spark.driver; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class CovidDataAnalysis { public static void main(String[] args) throws InterruptedException { JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName...

Everything about YARN and its mode

- January 05, 2022

Yet another resource negotiator (YARN) is Hadoop’s compute framework that runs on top of HDFS, which is Hadoop’s storage layer. YARN follows the master slave architecture. The master daemon is called ResourceManager and the slave daemon is called NodeManager. Besides this application, life cycle management is done by ApplicationMaster, which can be spawned on any slave node and is alive for the lifetime of an application. When Spark is run on YARN, ResourceManager performs the role of Spark master and NodeManagers work as executor nodes. While running Spark with YARN, each Spark executor is run as YARN container. Spark applications on YARN run in two modes: yarn-client: Spark Driver runs in the client process outside of YARN cluster, and ApplicationMaster is only used to negotiate resources from ResourceManager yarn-cluster : Spark Driver runs in ApplicationMaster spawned by NodeManager on a slave node The yarn-cluster mode is recommended for production deployments, while...

How to load data from Relations Database?

- January 03, 2022

A lot of important data lies in relational databases that Spark needs to query. JdbcRDD is a Spark feature that allows relational tables to be loaded as RDDs. This recipe will explain how to use JdbcRDD. Spark SQL to be another option includes a data source for JDBC. This should be preferred over the current recipe as results are returned as DataFrames (to be introduced in the next chapter), which can be easily processed by Spark SQL and also joined with other data sources. Please make sure that the JDBC driver JAR is visible on the client node and all slaves nodes on which executor will run. Perform the following steps to load data from relational databases: 1. Create a table named person in MySQL using the following DDL: CREATE TABLE 'person' ( 'person_id' int(11) NOT NULL AUTO_INCREMENT, 'first_name' varchar(30) DEFAULT NULL, 'last_name' varchar(30) DEFAULT NULL, 'gender' char(1) DEFAULT NULL, PRI...

Loading and data from Apache Cassandra v/s HDFS Storage and why Cassandra?

- January 03, 2022

Apache Cassandra is a NoSQL database with a masterless ring cluster structure. While HDFS is a good fit for streaming data access, it does not work well with random access. For example, HDFS will work well when your average file size is 100 MB and you want to read the whole file. If you frequently access the nth line in a file or some other part as a record, HDFS would be too slow. Relational databases have traditionally provided a solution to that, providing low latency, random access, but they do not work well with big data. NoSQL databases such as Cassandra fill the gap by providing relational database type access but in a distributed architecture on commodity servers. To data from Cassandra as a Spark RDD. To make that happen Datastax, the company behind Cassandra, has contributed spark-cassandra-connector. This connector lets you load Cassandra tables as Spark RDDs, write Spark RDDs back to Cassandra, and execute CQL queries. Command to use to load data from Cassandra Perform the ...

Twitter Data streaming by using pipeline in PySpark

- November 06, 2021

Twitter data analysis using PySpark along with Pipeline We are processing Twitter data using PySpark and we have tried to use all possible methods to understand Twitter data is being parsed in 2 stages which is sequential because of which we are using pipelines for these 3 stages Using fit function on pipeline then model is being trained then computation are being done from pyspark import SparkContext from pyspark . sql . session import SparkSession from pyspark . streaming import StreamingContext import pyspark . sql . types as tp from pyspark . ml import Pipeline from pyspark . ml . feature import StringIndexer , OneHotEncoderEstimator , VectorAssembler from pyspark . ml . feature import StopWordsRemover , Word2Vec , RegexTokenizer from pyspark . ml . classification import LogisticRegression from pyspark . sql import Row , Column import sys # define the function to get the predicted sentiment on the data received def get_prediction ( tweet_text ): t...

Model Data Anylysis Using DataFrame and SparkSQL in java

- October 13, 2021

Here we will analyze Model data using pure Spark SQL, Data Frame and will use mostly used methods with sample data package com.dpq.model.data.driver; import java.util.Arrays; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class ModelDataAnalysis { public static void main(String[] args) throws InterruptedException { JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName(“Spark Count”).setMaster(“local”)); SparkSession spark = SparkSession.builder().appName(“spark-bigquery-demo”).getOrCreate(); Dataset<Row> row = spark.read().csv(“/Users/dpq/springbootWrokspace/CountryDataAnalysis/resources/modeloutput.csv”); // way 1 to change column name row = row.withColumnRenamed(“_c0”, “CountryName”); row = row.withColumnRenamed(“_c1”, ...

Broadcast Nested Loop in detail in spark

- October 02, 2021

Broadcast Nested Loop join works by broadcasting one of the entire datasets and performing a nested loop to join the data. So essentially every record from dataset 1 is attempted to join with every record from dataset 2. As you could guess, Broadcast Nested Loop is not preferred and could be quite slow. It works for both equi and non-equi joins and it is picked by default when you have a non-equi join. Example We don’t change the default values for both spark.sql.join.preferSortMergeJoin and spark.sql.autoBroadcastJoinThreshold . scala> spark.conf.get("spark.sql.join.preferSortMergeJoin")res0: String = truescala> spark.conf.get("spark.sql.autoBroadcastJoinThreshold")res1: String = 10485760 scala> val data1 = Seq(10, 20, 20, 30, 40, 10, 40, 20, 20, 20, 20, 50)data1: Seq[Int] = List(10, 20, 20, 30, 40, 10, 40, 20, 20, 20, 20, 50)scala> val df1 = data1.toDF("id1")df1: org.apache.spark.sql.DataFrame = [id1: int]scala> val data2 = Seq(30, ...

Everything about Cartesian Product in Spark

- October 01, 2021

Cartesian Product join works very similar to a Broadcast Nested Loop join except the dataset is not broadcasted. Shuffle-and-Replication does not mean a “true” shuffle as in records with the same keys are sent to the same partition. Instead the entire partition of the dataset is sent over or replicated to all the partitions for a full cross or nested-loop join. We will understand all the above points with examples in detail We are setting spark.sql.autoBroadcastJoinThreshold to -1 to disable broadcast. scala> spark.conf.get("spark.sql.join.preferSortMergeJoin")res1: String = truescala> spark.conf.get("spark.sql.autoBroadcastJoinThreshold")res2: String = -1scala> val data1 = Seq(10, 20, 20, 30, 40, 10, 40, 20, 20, 20, 20, 50)data1: Seq[Int] = List(10, 20, 20, 30, 40, 10, 40, 20, 20, 20, 20, 50)scala> val df1 = data1.toDF("id1")df1: org.apache.spark.sql.DataFrame = [id1: int]scala> val data2 = Seq(30, 20, 40, 50)data2: Seq[Int] = List(30, 2...

Country Risk Data Analysis by Spark with Dataset

- October 01, 2021

Here we will analyse country risk data and we will do some manipulation snd to do manipulation we will apply fxrate data to this dataset and to achieve performance benefits we will use broadcast variable Sample Data: AU,,2017Q1,Account,100.1020,2000.1040 KR,,2017Q1,Account,100.1020,2000.1040 US,,2017Q1,Account,100.1020,2000.1040 AU,,2018Q1,Account,100.1020,2000.1040 US,,2018Q1,Account,100.1020,2000.1040 AU,,2019Q1,Account,100.1020,2000.1040 KR,,2019Q1,Account,100.1020,2000.1040 AU,,2016Q1,Account,100.1020,2000.1040 KR,,2016Q1,Account,100.1020,2000.1040 AU,,2017Q1,Segment,100.1020,2000.1040 AU,,2017Q1,Segment,100.1020,2000.1040 US,,2017Q1,Account,100.1020,2000.1040 package com . dpq . country . data . driver ; import java . io . Serializable ; import java . math . BigDecimal ; import java . util . HashMap ; import java . util . Iterator ; import java . util . LinkedList ; import java . util . List ; import java . util . Map ; import org . apache . spark . SparkConf ; import or...