The relationship and difference of several technologies in hadoop--Hadoop ecology: the relationship and difference of hive, pig and hbase 02/13 Update SLTechnology News&Howtos

The relationship and difference of several technologies in hadoop--Hadoop ecology: the relationship and difference of hive, pig and hbase

2026-02-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Pig

A lightweight scripting language for operating hadoop, which was first introduced by Yahoo, but is now in decline. At the beginning, Yahoo slowly withdrew from the maintenance of pig and contributed it to the open source community to be maintained by all enthusiasts. However, some companies still use it, but I think it is better to use hive than to use pig. :)

Pig is a data flow language, which is used to process huge data quickly and easily.

Pig consists of two parts: Pig Interface,Pig Latin.

Pig can easily process the data of HDFS and HBase. Like Hive, Pig can deal with what it needs to do very efficiently. It can save a lot of labor and time by directly operating Pig query. You can use Pig when you want to make some transformations on your data and don't want to write MapReduce jobs.

Hive

Friends who do not want to use programming language to develop MapReduce, such as DB, friends who are familiar with SQL can use Hive for data processing and analysis offline.

Note that Hive is now suitable for data manipulation offline, that is, it is not suitable for real-time online queries or operations in a real production environment, because one word is "slow". On the contrary

It originated from the role of FaceBook,Hive as a data warehouse in Hadoop. Built at the top of the Hadoop cluster, it operates on the SQL-like interface of the data stored on the Hadoop cluster. You can use HiveQL for select,join, and so on.

If you have data warehouse requirements and you are good at writing SQL and do not want to write MapReduce jobs, you can use Hive instead.

HBase

HBase runs on top of HDFS as a column-oriented database, and HDFS lacks immediate read and write operations, which is why HBase appears. HBase is modeled on Google BigTable and stored in the form of key-value pairs. The goal of the project is to quickly locate and access the required data in billions of rows of data in the host.

HBase is a database, a NoSql database, like other databases to provide immediate read and write function, Hadoop can not meet the real-time needs, HBase can meet. If you need to access some data in real time, store it in HBase.

You can use Hadoop as a static data warehouse and HBase as a data store to store data that will change if you do something.

Pig VS Hive

Hive is more suitable for data warehouse tasks, while Hive is mainly used for static structures and tasks that require frequent analysis. The similarity between Hive and SQL makes it an ideal intersection of Hadoop and other BI tools.

Pig gives developers more flexibility in the big data set domain and allows the development of concise scripts to transform data streams for embedding into larger applications.

Pig is relatively lightweight compared to Hive, and its main advantage is that it can significantly reduce the amount of code compared to using Hadoop Java APIs directly. Because of this, Pig still attracts a large number of software developers.

Both Hive and Pig can be used in combination with HBase. Hive and Pig also provide high-level language support for HBase, which makes it very easy to process data statistics on HBase.

Hive VS HBase

Hive is a batch processing system based on Hadoop to reduce the writing work of MapReduce jobs, and HBase is a project to support a project to make up for Hadoop's shortcomings in real-time operation.

Imagine you are operating a RMDB database. If it is a full table scan, use Hive+Hadoop, and if it is an index access, use HBase+Hadoop.

Hive query is that MapReduce jobs can run from 5 minutes to more than a few hours. HBase is very efficient and certainly much more efficient than Hive.

Introduction:

What is hive???

1Philehive is a data warehouse tool based on Hadoop,

2, you can map the structured data file to a database table, and provide sql-like query function,

3, you can convert sql statements into mapreduce tasks to run,

4, which can be used for data extraction transformation loading (ETL)

5Jing hive is the sql parsing engine, which converts sql statements into MCMUR job and then runs in Hadoop.

The table of hive is actually the directory / folder of HDFS.

The data in the hive table is the file in the hdfs directory. Separate the folders by table name. If it is a partition table, the partition value is a subfolder and the data can be used directly in M _ job.

6Pros and cons of phonehive:

Can provide SQL-like statements to quickly implement simple mapreduce statistics, without the need to develop special mapreduce applications

Real-time query is not supported

7Phantom hive data is divided into real stored data and metadata.

Real data is stored in hdfs and metadata is stored in mysql

Metastore metadata storage database

Hive stores metadata in databases such as MySQL and derby.

The metadata in Hive includes the name of the table, the columns and partitions of the table and its properties, the attributes of the table (whether it is an external table, etc.), the directory where the data of the table is located, and so on.

Second, the architecture of hive:

User interface, including CLI (shell), JDBC/ODBC,WebUI (via browser)

Metadata storage, usually stored in relational databases such as mysql, derby

Interpreters, compilers, optimizers, and executors complete HQL query statements from parsing, compilation, optimization and generation of query plans, which are stored in HDFS and then called and executed by mapreduce

Hadoop: use HDFS for storage and MapReduce for calculation (query select from teacher does not generate mapreduce tasks, only full table scans)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.