Basic information about Hadoop ecosystem tools

- August 14, 2021

If you are new to Big Data, welcome to these amazing tools.

HIVE:-
Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and process large datasets.
As a result, Hive is closely integrated with Hadoop, and is designed to work quickly on petabytes of data.

PIG:-
Pig is an alternative to Java programming for MapReduce, and automatically generates MapReduce functions.
Pig is used for the analysis of a large amount of data. Pig is used to perform all kinds of data manipulation operations in Hadoop.

SQOOP(SQL-to-Hadoop):-
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and external datastores such as relational databases, enterprise data warehouses.
Sqoop is used to import data from external datastores into Hadoop Distributed File System or related Hadoop eco-systems like Hive and HBase.

HBASE:-
Hadoop and HBase are both used to store a massive amount of data. But the difference is that in Hadoop Distributed File System (HDFS)
data is stored is a distributed manner across different nodes on that network.
Whereas, HBase is a database that stores data in the form of columns and rows in a Table

OOZIE:-
Oozie is a workflow scheduler to manage all the different jobs that are running simultaneously in the Hadoop cluster.
The main purpose of using Oozie is to manage different type of jobs being processed in Hadoop system.

Apache Spark:-
Hadoop and Spark, both developed by the Apache Software Foundation, are widely used open-source frameworks for big data architectures.
Each framework contains an extensive ecosystem of open-source technologies that prepare, process, manage and analyze big data sets.
Spark is widely hailed as the de facto replacement for Hadoop.

Tech Studio Online

Basic information about Hadoop ecosystem tools

Popular posts from this blog

How to change column name in Dataframe and selection of few columns in Dataframe using Pyspark with example

What is Garbage collection in Spark and its impact and resolution

Window function in PySpark with Joins example using 2 Dataframes (inner join)