'Hive Default Partition' in details

We will use below sample data for the task.

This is the sample data of employee details. Some employees are the member of company’s sports team like Cricket, Football, Basketball, Tennis, Badminton. Some employees are not a member of any sports team that records are highlighted in the above picture.

You can download the sample data from here.

Solution

There are 2 types of partitions in hive – Static and Dynamic. We do assign a value to a partition column in static partition table whereas, in the dynamic partition, the value gets assigned to the partitioned column dynamically based on the data available in the table for the defined partition column

Step 1: Set Up

In this step, we are creating a hive table for loading the sample data. If you have already loaded the data into a table then skip this step and jump to the next step.

Let’s say we are having a hive table named emp_sports which stores employees details (sample data).  We can create the table using below DDL and load the data into it:

 

CREATE TABLE db_bdpbase.emp_sports(

id INT,

firstname STRING,

lastname STRING,

sports STRING,

city STRING,

country STRING

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

STORED AS TEXTFILE

TBLPROPERTIES(“skip.header.line.count”=“1”);

LOAD DATA INPATH ‘/usr/bdp/hive/sample_data.csv’ INTO db_bdpbase.emp_sports;

Here, we have created a table and loaded the sample data.

Let’s check the data in this table:

 SELECT * FROM db_bdpbase.emp_sports LIMIT 15;

Step 2: Create Partitioned Table

In this step, we are going to create a hive partition table which is partitioned by Sports column.

The DDL will look like:

CREATE TABLE DB_BDPBASE.DEFAULT_PARTITION_TEST(

ID    INT,

FIRSTNAME   STRING,

LASTNAME STRING,

CITY STRING,

COUNTRY  STRING

) PARTITIONED BY (

SPORTS STRING

) ROW FORMAT DELIMITED

FIELDS TERMINATED  BY  ‘,’;

Step 3: Load data into Partitioned Table

In this step, we are inserting data from the base table into the created hive partitioned table.

In this step, we are inserting data from the base table into the created hive partitioned table.

INSERT OVERWRITE TABLE DB_BDPBASE.DEFAULT_PARTITION_TEST PARTITION (SPORTS)SELECT ID, FIRSTNAME, LASTNAME, CITY, COUNTRY, SPORTS FROM DB_BDPBASE.EMP_SPORTS;

Here, total 6 partitions have created which are Badminton, Basketball, Cricket, Football, Tennis, and HIVE_DEFAULT_PARTITION. The HIVE_DEFAULT_PARTITION partition has been created because of NULL value of the partitioned column. The records which are having a NULL value for sports are getting stored in HIVE_DEFAULT_PARTITION partition.

SHOW PARTITIONS db_bdpbase.DEFAULT_PARTITION_TEST;

Let’s check records of this default partition:

 SELECT * FROM db_bdpbase.DEFAULT_PARTITION_TEST WHERE sports=“__HIVE_DEFAULT_PARTITION__”;

 

Wrapping Up

In this post, we understood what is the default partition in hive. This gets create when we load data into partitioned table dynamically and partition column has a NULL value.

Popular posts from this blog

Window function in PySpark with Joins example using 2 Dataframes (inner join)

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)

Credit Card Data Analysis using PySpark (how to use auto broadcast join after disabling it)