Posts

Showing posts from September, 2021

Airline dataset processing with SparkSQL

Now let’s explore the Dataset using Spark SQL and DataFrame transformations. After we register the DataFrame as a SQL temporary view, we can use SQL functions on the SparkSession to run SQL queries, which will return the results as a DataFrame. We cache the DataFrame, since we will reuse it and because Spark can cache DataFrames or Tables in columnar format in memory, which can improve memory usage and performance. // cache DataFrame in columnar format in memory df . cache // create Table view of DataFrame for Spark SQL df . createOrReplaceTempView ( "flights" ) // cache flights table in columnar format in memory spark . catalog . cacheTable ( "flights" ) Below, we display information for the top five longest departure delays with Spark SQL and with DataFrame transformations (where a delay is considered greater than 40 minutes): // Spark SQL spark . sql ( "select carrier,origin, dest, depdelay,crsdephour, dist, dofW from flights where depdelay > 40 order by

SQL tutorial in detail with exercises and it's solutions

I have taken MySQL to explain queries: MySQL Introduction Note: “MySQL” it third party (“sun micro system”) C:\mysql  –u  root Types of Table (Engine) MyISAM: Foreign key constraint does not support InnoDB: used to support foreign key constraint BDB: support for UNIX environment Heap: it is temporary or virtual table, which is created only in memory not in hard disk Merge: it is used, if we want to merge more than one table (it is also temporary or virtual table) Syntax: Create table list ( — , — , — )engine=InnoDB; MySQL Commands: Show databases; Create database db_name; Use dbname; Show tables; Create table tb_name(id int, name varchar(20)); Desc tb_name; Insert into tb_name values(101 , ‘lokesh kakkar’); Insert into tb_name (id) values(102); Update tb_name set name=’luck key’ where id=102; Select * from tb_name; Delete from tb_name where id=102; How to Import data from any file: Mysql –u root <db.sql  (for database and tables) Mysql –u root <data.sql (for data into tables) Cre

What is Catalyst query optimizer?

Image
Spark SQL was designed with an optimizer called Catalyst based on the functional programming of Scala. Its two main purposes are: first, to add new optimization techniques to solve some problems with “big data” and second, to allow developers to expand and customize the functions of the optimizer. Catalyst Spark SQL architecture and Catalyst optimizer integration Catalyst components Los componentes principales del optimizador de Catalyst son los siguientes: The main components of the Catalyst optimizer are as follows: Trees The main data type in Catalyst is the tree. Each tree is composed of nodes, and each node has a nodetype and zero or more children. These objects are immutable and can be manipulated with functional language. As an example, let me show you the use of the following nodes: Merge ( Attribute ( x ), Merge ( Literal ( 1 ), Literal ( 2 )) Where: Literal(value: Int) : a constant value Attribute(name: String) : an attribute as input row Merge(left: TreeNode, right: TreeNo

When will you choose RDD over SparkSQL and Dataframe

RDDs or Resilient Distributed Datasets is the fundamental data structure of the Spark. It is the collection of objects which is capable of storing the data partitioned across the multiple nodes of the cluster and also allows them to do processing in parallel. It is fault-tolerant if you perform multiple transformations on the RDD and then due to any reason any node fails. The RDD, in that case, is capable of recovering automatically. When to use RDDs? We can use RDDs in the following situations- you want low-level transformation and actions and control on your dataset; your data is unstructured, such as media streams or streams of text; you want to manipulate your data with functional programming constructs than domain specific expressions; you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column; and you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and

Rdd vs Dataframe vs DataSet

Initially, in 2011 in they came up with the concept of RDDs, then in 2013 with Dataframes and later in 2015 with the concept of Datasets. None of them has been depreciated, we can still use all of them. In this article, we will understand and see the difference between all three of them. RDDs vs Dataframes vs Datasets RDDs Dataframes Datasets Data Representation RDD is a distributed collection of data elements without any schema. It is also the distributed collection organized into the named columns It is an extension of Dataframes with more features like type-safety and object-oriented interface. Optimization No in-built optimization engine for RDDs. Developers need to write the optimized code themselves. It uses a catalyst optimizer for optimization. It also uses a catalyst optimizer for optimization purposes. Projection of Schema Here, we need to define the schema manually. It will automatically find out the schema of the dataset. It will also automatically find out the schema of th

All about Kryo serialiser in Spark

Q. Why Kryo is fatser? Ans: Kryo is significantly faster and more compact than Java serialization Q. If Kryo serialiser is faster then why it is not default serialiser? does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance. So it is not used by default because: Not every  java.io.Serializable  is supported out of the box – if you have custom class that extends  Serializable  it still cannot be serialized with Kryo, unless registered. One needs to register custom classes. The only reason Kryo is not set to default is because it requires custom registration. Although, Kryo is supported for RDD caching and shuffling, it’s not natively supported to serialize to the disk. Both the methods, saveAsObjectFile on RDD and objectFile method on SparkContext supports only java serialization. Still? If you need a performance boost and also need to reduce memory usage, Kryo is definitely for you. The join operations

Show practical example to list files, Insert data, retrieving data and shutting down HDFS.

Assignment  – 3 Question 3: Show practical example to list files, Insert data, retrieving data and shutting down HDFS. Initially, you have to format the configured HDFS file system, open namenode (HDFS server), and execute the following command. $ hadoop namenode -format After formatting the HDFS, start the distributed file system. The following command will start the namenode as well as the data nodes as cluster. $ start-dfs.sh  Listing Files in HDFS After loading the information in the server, we can find the list of files in a directory, status of a file, using  ‘ls’ . Given below is the syntax of  ls  that you can pass to a directory or a filename as an argument. $ $HADOOP_HOME/bin/hadoop fs -ls <args> Inserting Data into HDFS Assume we have data in the file called file.txt in the local system which is ought to be saved in the hdfs file system. Follow the steps given below to insert the required file in the Hadoop file system. Step 1 You have to create an input directory. $ $

Perform following tasks on Big data platform such as Hadoop:

Assignment  – 2 Q2. Perform following tasks on Big data platform such as Hadoop:  Run Map and Reduce codes  Data storage and retrieval operations  Batch processing operations    Run Map and Reduce codes First Hadoop MapReduce Program Step 1) Create a new directory with name  MapReduceTutorial sudo mkdir MapReduceTutorial Give permissions sudo chmod -R 777 MapReduceTutorial SalesMapper.java  package SalesCountry;   import java.io.IOException;   import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.*;   public class SalesMapper extends MapReduceBase implements Mapper <LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1);   public void map(LongWritable key, Text value, OutputCollector <Text, IntWritable> output, Reporter reporter) throws IOException {   String valueString = value.toString(); String[] SingleCountryData = valueString.spl