Hive interview Questions

Q1. What is Hive and why it is useful?

Ans. Hive is a data warehouse application where data gets stored in the structure format. It is used to querying and managing large datasets. It provides a SQL-like interface to access the data which is also called HiveQL(HQL).

Q2. What are the advantages of the Hive?
Ans. The advantage of the Hive
==> Hive is a distributed store.
==> Provides a variety of data storage(Text, Sequence, Parquet, ORC)
==> It also enabled easy ETL process (Extract, Transform, Process).

Q3. What are the execution engines of Hive?
Ans. There are two hive execution engine – MapReduce and Tez.

Hive on Tez execution takes advantage of Directed Acyclic Graph (DAG) execution representing the query instead of multiple stages of Map Reduce program which involve a lot of synchronization barriers and I/O overheads.

This all improved in Tez engine by writing intermediate data set into memory instead of hard disk.

Q4. What are the different files format supported in Hive?

Ans. The file formats are supported by Hive are:
Text
Sequence
ORC
Parquet
Avro

Q5. What is metastore in Hive?

Ans. Metastore in Hive is a central repository where the information related to database, tables, and relations get stored. This information is called Metadata.

Local and Remote meta stores are the two types of Hive meta stores.

metastore sharing is not supported by Hive.

Local meta stores run on the same Java Virtual Machine (JVM) as the Hive service whereas remote meta stores run on a separate, distinct JVM.

The default database for metastore is the embedded Derby database provided by Hive which is backed by the local disk.

Q6. What is the different type of tables available in Hive?

Ans. There are two types of the table in the hive:
Managed Table (Internal Table)
External Table

Q7. Where to use the external and managed table in Hive?

Managed Table: In this table, the Hive controls the lifecycle of their data. This is the default table of Hive.

When you load data into a managed table, it gets stored in a subdirectory under the warehouse directory which is defined as a value of the properties hive.metastore.warehouse.dir in the configuration file named hive-site.xml located at /user/hive/warehouse.

External Table: In this table, Hive does not assume its own data. The data are available at an external location.

While external tables give data control to Hive but not control of a schema, managed tables give both schema and data control.

Q8. CAN HIVE BE USED IN OLTP SYSTEMS?

Since Hive does not support row-level data insertion, it is not suitable for use in OLTP systems.

Q9. CAN A TABLE NAME BE CHANGED IN HIVE?

Yes, you can change a table name in Hive. You can rename a table name by using: Alter Table table_name RENAME TO new_name.

Q10. WHAT ARE THE THREE DIFFERENT MODES IN WHICH HIVE CAN BE OPERATED?

The three modes in which Hive can be operated are Local mode, distributed mode, and pseudo-distributed mode.

Q11. IS THERE A DATA TYPE IN HIVE TO STORE DATE INFORMATION?

The TIMESTAMP data type in Hive stores all data information in the java.sql.timestamp format.

Q12. WHY IS PARTITIONING USED IN HIVE?

Partitioning is used in Hive as it allows for the reduction of query latency. Instead of scanning entire tables, only relevant partitions and corresponding datasets are scanned.

Q13. WHAT ARE THE HIVE COLLECTION DATA TYPES?

ARRAY, MAP, AND STRUCT are the three Hive collection data types.

Q14. IS IT POSSIBLE TO RUN UNIX SHELL COMMANDS IN HIVE?

Yes, one can run shell commands in Hive by adding a ‘!’ before the command.

Q15. IS IT POSSIBLE TO EXECUTE HIVE QUERIES FROM A SCRIPT FILE?

Yes, one can do so with the help of a source command. For example – Hive> source /path/queryfile.hql

Q16. WHAT IS A .HIVERC FILE?

It is a file that consists of a list of commands that need to be run when the Command Line Input is initiated.

Q17. HOW CAN YOU CHECK IF A SPECIFIC PARTITION EXISTS?

Use the following command: SHOW PARTITIONS table_name PARTITION (partitioned_column=’partition_value’)

Q18. IF YOU HAD TO LIST ALL DATABASES THAT BEGAN WITH THE LETTER ‘C’, HOW WOULD YOU DO IT?

By using the following command: SHOW DATABASES LIKE ‘c.*’

Q19. IS IT POSSIBLE TO DELETE DBPROPERTY IN HIVE?

No, there is no way to delete the DBPROPERTY.

Q20. WHICH JAVA CLASS HANDLES THE INPUT RECORD ENCODING INTO FILES THAT STORE HIVE TABLES?

The ‘org.apache.hadoop.mapred.TextInputFormat’ class.

Q21. WHICH JAVA CLASS HANDLES OUTPUT RECORD ENCODING INTO HIVE QUERY FILES?

The ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’ class.

Q22. WHEN A HIVE TABLE PARTITION IS POINTED TO A NEW DIRECTORY, WHAT HAPPENS TO THE DATA?

The data remains in the old directory and needs to be transferred manually.

Q23. DO YOU SAVE SPACE IN THE HDFS BY ARCHIVING HIVE TABLES?

No, archiving Hive tables only helps reduce the number of files that make for easier management of data.

Q24. HOW CAN YOU STOP A PARTITION FROM BEING ACCESSED IN A QUERY?

Use the ENABLE OFFLINE clause along with the ALTER TABLE command.

Q25. WHAT IS A TABLE GENERATING FUNCTION ON HIVE?

MapReduce is a programming framework that allows Hive to divide large datasets into smaller units and process them parallelly.

Q26. CAN YOU AVOID MAPREDUCE ON HIVE?

You can make Hive avoid MapReduce to return query results by setting the hive.exec.mode.local.auto property to ‘true’.

Q27. CAN A CARTESIAN JOIN BE CREATED BETWEEN TWO HIVE TABLES?

This is not possible as it cannot be implemented in MapReduce programming.

Q26. WHAT IS A VIEW IN HIVE?

A view is a logical construct that allows search queries to be treated as tables.

Q27. CAN THE NAME OF A VIEW BE THE SAME AS A HIVE TABLE NAME?

No, the name of the view must always be unique in the database.

Q28 CAN WE USE THE LOAD OR INSERT COMMAND TO VIEW?

No, these commands cannot be used with respect to a view in Hive.

Q29. WHAT IS INDEXING IN HIVE?

Hive indexing is a query optimization technique to reduce the time needed to access a column or a set of columns within a Hive database.

Q30. ARE MULTI-LINE COMMENTS SUPPORTED BY HIVE?

No, multi-line comments are supported by Hive.

Q31. HOW CAN YOU VIEW THE INDEXES OF A HIVE TABLE?

By using the following command: SHOW INDEX ON table_name

Q32. WHAT IS THE HIVE OBJECTINSPECTOR FUNCTION?

It helps to analyze the structure of individual columns and rows and provides access to the complex objects that are stored within the database.

To analyze the structure of individual columns and the internal structure of the row objects we use ObjectInspector. Basically, it provides access to complex objects which can be stored in multiple formats in Hive.

Q33. WHAT IS BUCKETING?

Bucketing is the process of hashing the values in a column into several user-defined buckets which helps avoid over-partitioning.

Q34. HOW IS BUCKETING HELPFUL?

Bucketing helps optimize the sampling process and shortens the query response time.

Q35. CAN YOU SPECIFY THE NAME OF THE TABLE CREATOR IN HIVE?

Yes, by using the TBLPROPERTIES clause. For example – TBLPROPERTIES (‘creator’= ‘john’)

Q36. WHAT IS HCATALOG?

Hcatalog is a tool that helps to share data structures with other external systems in the Hadoop ecosystem.

Q37. WHAT IS UDF IN HIVE?

UDF is a user-designed function created with a Java program to address a specific function that is not part of the existing Hive functions.

Q38. WHAT DOES /*STREAMTABLE(TABLE_NAME)*/ DO?

It is a query hint that allows for a table to be streamed into memory before a query is executed.

Q39. WHAT ARE THE LIMITATIONS OF HIVE?

Hive has the following limitations:

  • Real-time queries cannot be executed and it has no row-level support.
  • Hive cannot be used for online transaction processing.

Q40. WHY DO YOU NEED A HCATALOG?

For sharing Data structures with external systems, Hcatalog is a necessary tool. It offers access to the Hive metastore for reading and writing data in a Hive data warehouse.

Q41. NAME THE COMPONENTS OF A HIVE QUERY PROCESSOR?

Following are the components of a Hive query processor:

  1. Logical Plan of Generation.
  2. Physical Plan of Generation.
  3. Execution Engine.
  4. UDF’s and UDAF.
  5. Operators.
  6. Optimizer.
  7. Parser.
  8. Semantic Analyzer.
  9. Type Checking.

Q42. HOW HIVE DISTRIBUTES THE ROWS INTO BUCKETS?

Hive uses the formula: hash_function (bucketing_column) modulo (num_of_buckets) to calculate the row’s bucket number. Here, hash_function is based on the Data type of the column. The hash_function is for integer data type:

hash_function (int_type_column)= value of int_type_column

Q43. HOW DO ORC FORMAT TABLES HELP HIVE TO ENHANCE THE PERFORMANCE?

You can easily store the Hive Data with the ORC (Optimized Row Column) format as it helps to streamline several limitations.

ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%. As a result the speed of data processing also increases and shows better performance than Text, Sequence and RC file formats.

Q44. WHAT ARE THE DIFFERENT COMPONENTS OF A HIVE ARCHITECTURE?

Following are the five components of a Hive Architecture:

  1. User Interface: It helps the user to send queries to the Hive system and other operations. The user interface provides hive Web UI, Hive Command-Line and Hive HDInsight.
  1. Driver: It designs a session handle for the query, and then the queries are sent to the compiler for the execution plan.
  1. Metastore: It contains the organized data and information on various warehouse tables and partitions.
  1. Compiler: It creates the execution plan for the queries, performs semantic analysis on different query blocks, and generates query expression.
  1. Execution Engine: It implements the execution plans created by the compiler.

Q45. When should we use SORT BY instead of ORDER BY?

Ans. Despite ORDER BY we should use SORT BY. Especially while we have to sort huge datasets. The reason is SORT BY clause sorts the data using multiple reducers. ORDER BY sorts all of the data together using a single reducer.

Hence, using ORDER BY will take a lot of time to execute a large number of inputs.

Q46. What is dynamic partitioning and when is it used?

Ans. Dynamic partitioning values for partition columns are known in the runtime. In other words, it is known during loading of the data into a Hive table.

  • Usage:
  1. While we Load data from an existing non-partitioned table, in order to improve the sampling. Thus it decreases the query latency.
  2. Also, while we do not know all the values of the partitions beforehand. Thus, finding these partition values manually from a huge dataset is a tedious task

Q47.Why do we need buckets?

Ans. Basically, for performing bucketing to a partition there are two main reasons:

  • A map side join requires the data belonging to a unique join key to be present in the same partition.
  • It allows us to decrease the query time. Also, makes the sampling process more efficient.

Q48. Explain about the different types of join in Hive.

Ans. There are  4 different types of joins in HiveQL –

  • JOIN-  It is very similar to Outer Join in SQL
  • FULL OUTER JOIN – This join Combines the records of both the left and right outer tables. Basically, that fulfill the join condition.
  • LEFT OUTER JOIN- Through this Join, All the rows from the left table are returned even if there are no matches in the right table.
  • RIGHT OUTER JOIN – Here also, all the rows from the right table are returned even if there are no matches in the left table.

Q49.Is it possible to change the default location of Managed Tables in Hive, if so how?

Ans. Yes, by using the LOCATION keyword while creating the managed table, we can change the default location of Managed tables. But the one condition is, the user has to specify the storage path of the managed table as the value of the LOCATION keyword

Q50. How does data transfer happen from HDFS to Hive?

Ans. Basically, the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. But only if data is already present in HDFS. Hence, using the keyword external that creates the table definition in the hive metastore  the user just has to define the table.
Create external table table_name (
id int,
myfields string
)
location ‘/my/location/in/hdfs’;

Q51. Wherever (Different Directory) I run the hive query, it creates new metastore_db, please explain the reason for it?

Ans. Basically, it creates the local metastore, while we run the hive in embedded mode. Also, it looks whether metastore already exist or not before creating the metastore. Hence, in configuration file hive-site.xml. Property is “javax.jdo.option.ConnectionURL” with default value “jdbc:derby:;databaseName=metastore_db;create=true” this property is defined. Hence, to change the behavior change the location to the absolute path, thus metastore will be used from that location.

Q52. Is it possible to use the same metastore by multiple users, in case of the embedded hive?

Ans. No, we cannot use metastore in sharing mode. It is possible to use it in standalone “real” database. Such as MySQL or PostGresSQL.

Que 53. Usage of Hive.

Ans. Here, we will look at following Hive usages.
– We use Hive for Schema flexibility as well as evolution.
– Moreover, it is possible to portion and bucket, tables in Apache Hive.
– Also, we can use JDBC/ODBC drivers, since they are available in Hive.

Popular posts from this blog

Window function in PySpark with Joins example using 2 Dataframes (inner join)

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)

Credit Card Data Analysis using PySpark (how to use auto broadcast join after disabling it)