Posts

Showing posts with the label Hive

'Hive Default Partition' in details

Image
We will use below sample data for the task. This is the sample data of employee details. Some employees are the member of company’s sports team like Cricket, Football, Basketball, Tennis, Badminton. Some employees are not a member of any sports team that records are highlighted in the above picture. You can download the sample data from here. Solution There are 2 types of partitions in hive – Static and Dynamic. We do assign a  value  to a partition column in static partition table whereas, in the dynamic partition, the value gets  assigned  to the partitioned column dynamically based on the data available in the table for the defined partition column Step 1: Set Up In this step, we are creating a hive table for loading the sample data. If you have already loaded the data into a table then skip this step and jump to the next step. Let’s say we are having a hive table named emp_sports which stores employees details (sample data).  We can create the table using below DDL and load the dat

How to handle incremental load for hive table

table_1  contains: customer_id | items | price | updated_date------------+-------+-------+-------------10 | watch | 1000 | 2017062611 | bat | 400 | 20170625 table_2  contains: customer_id | items | price | updated_date------------+----------+-------+-------------10 | computer | 20000 | 20170624 I want to update records of  table_2  if  customer_id  already exists in it, if not, it should append to  table_2 . As Hive 0.13 does not support update, I tried using join, but it fails. There are 2 ways to handle this: 1st way : using row_number 2nd way : using F ull outer join but hive 0.13 does not support update to we cant use this Full join By Using row Number: insert overwrite table_1 select customer_id, items, price, updated_datefrom(select customer_id, items, price, updated_date, row_number() over(partition by customer_id order by new_flag desc) rnfrom ( select customer_id, items, price, updated_date, 0 as new_flag from table_1

Hive joins in details with examples

Hive Join | HiveQL Select Joins Query | Types of Join in Hive In  Apache Hive , for combining specific fields from two tables by using values common to each one we use Hive Join – HiveQL Select Joins Query. However, we need to know the syntax of Hive Join for implementation purpose. So, in this article, “Hive Join – HiveQL Select Joins Query and its types” we will cover syntax of joins in hive. Also, we will learn an example of Hive Join to understand well. Moreover, there are several types of Hive join – HiveQL Select Joins: Hive inner join, hive left outer join, hive right outer join, and hive full outer join. We will also learn Hive Join tables in depth. Apache Hive Join – HiveQL Select Joins Query Basically, for combining specific fields from two tables by using values common to each one we use Hive JOIN clause. n other words, to combine records from two or more tables in the database we use JOIN clause. However, it is more or less similar to SQL JOIN. Also, we use it to combine ro

Hive interview Questions

Q1. What is Hive and why it is useful? Ans. Hive is a data warehouse application where data gets stored in the structure format. It is used to querying and managing large datasets. It provides a SQL-like interface to access the data which is also called HiveQL(HQL). Q2. What are the advantages of the Hive? Ans. The advantage of the Hive ==> Hive is a distributed store. ==> Provides a variety of data storage(Text, Sequence, Parquet, ORC) ==> It also enabled easy ETL process (Extract, Transform, Process). Q3. What are the execution engines of Hive? Ans. There are two hive execution engine – MapReduce and Tez. Hive on Tez execution takes advantage of Directed Acyclic Graph (DAG) execution representing the query instead of multiple stages of Map Reduce program which involve a lot of synchronization barriers and I/O overheads. This all improved in Tez engine by writing intermediate data set into memory instead of hard disk. Q4. What are the different files format supported in Hive?

Hive Partitioning in detail

Requirement Suppose there is a source data, which is required to store in the hive partitioned table. So our requirement is to store the data in the hive table with static and dynamic partitions. With an understanding of  partitioning  in the hive, we will see where to use the static and dynamic partitions. Solutions First, create a temp table to store the data. This table will have all the data and from this table, we will load data into static and dynamic partitioned hive table based on the partitioned column(s). create table CREATE TABLE partitioned_temp( empId INT, firstname STRING, lastname STRING, city STRING, mobile STRING, yearofexperience INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\n’ STORED as TEXTFILE; Load data into the created table: load data – partition in hive LOAD DATA LOCAL INPATH ‘/user/data/employee.txt’ INTO TABLE partitioned_temp; Static partition: Implement on Managed table Suppose below are the sample data which we will store in the

Hive Scenario based interview questions

Q. Let’s say a Hive table is created as an external table. If we drop the table, will the data be accessible? Answer : The data will be accessible even if the table gets dropped. We can get the data from the table’s HDFS location. Q. A Hive table is created as an external table at location say hdfs://usr/data/table_name . If we dump a data set which are having the data as per the table structure, will we able to fetch the records from the table using a select query? Answer : Yes, we will be able to fetch the records from the table after dumping the data set at the hive table external location. Q. Hive partition table is created which is partition by a column say yearofexperience . If we create a directory say  yearofexperience=3  at the HDFS path of the table and dump the data set which is as per the table structure. Will the data be available if we execute select query on the table? Answer : No, the data will not accessible by executing the select query on the table. After dumping the

What is Apache Spark and it's life cycle with components in details

Image
Apache Spark is a general-purpose cluster computing system to process big data workloads. What sets Spark apart from its predecessors, such as MapReduce, is its speed, ease-of-use, and sophisticated analytics. Apache Spark was originally developed at AMPLab, UC Berkeley, in 2009. It was made open source in 2010 under the BSD license and switched to the Apache 2.0 license in 2013. Toward the later part of 2013, the creators of Spark founded Databricks to focus on Spark’s development and future releases. Talking about speed, Spark can achieve sub-second latency on big data workloads. To achieve such low latency, Spark makes use of the memory for storage. In MapReduce, memory is primarily used for actual computation. Spark uses memory both to compute and store objects. Getting Started with Apache Spark 2 Spark also provides a unified runtime connecting to various big data storage sources, such as HDFS, Cassandra, HBase, and S3. It also provides a rich set of higher-level libraries for dif