What is the use of the hive function 07/12 Update SLTechnology News&Howtos

What is the use of the hive function

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail what is the use of the hive function for you. The editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.

First of all, we need to know exactly what hive does. The following paragraphs well describe the characteristics of hive:

1.hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, provide complete sql query functions, and transform sql statements into MapReduce tasks to run. Its advantage is that the learning cost is low, simple MapReduce statistics can be quickly realized through SQL-like statements, and there is no need to develop special MapReduce applications, so it is very suitable for statistical analysis of data warehouse.

2.Hive is a data warehouse infrastructure built on Hadoop. It provides a series of tools that can be used for data extraction, transformation loading (ETL), a mechanism that can store, query, and analyze large-scale data stored in Hadoop. Hive defines a simple SQL-like query language called HQL, which allows users who are familiar with SQL to query data. At the same time, the language also allows familiar with MapReduce developers to develop custom mapper and reducer to handle complex analytical work that cannot be done by built-in mapper and reducer.

To understand hive, you must first understand hadoop and mapreduce. If you are not familiar with children's shoes, you can have a look at Baidu.

Using hive's command line interface feels like operating a relational database, but there is still a big difference between hive and a relational database. Let me compare the difference between hive and a relational database as follows:

1.hive is different from the system in which files are stored in relational databases. Hive uses hadoop's HDFS (hadoop's distributed file system), while relational databases are local to the server.

The computing model used by 2.hive is mapreduce, while the relational database is a computing model designed by itself.

3. Relational databases are designed for real-time query business, while hive is designed for data mining of massive data, and the real-time performance is very poor. The difference in real-time performance leads to a great difference between hive application scenarios and relational databases.

4.Hive can easily expand its storage and computing power, which inherits from hadoop, while relational databases are much worse than databases in this respect.

The above is to compare the differences between hive and relational database from a macro point of view, and there are many similarities and differences between hive and relational database, which I will describe one by one later in the article.

Let me talk about the technical architecture of hive. Let's take a look at the following architecture diagram:

As you can see from the figure above, hadoop and mapreduce are the foundation of the hive architecture. The Hive architecture includes the following components: CLI (command line interface), JDBC/ODBC, Thrift Server, WEB GUI, metastore and Driver (Complier, Optimizer and Executor), which I can be divided into two categories: server components and client components.

First, let's talk about server-side components:

Driver component: this component includes Complier, Optimizer and Executor. Its function is to parse, compile and optimize the HiveQL (class SQL) statements we write, generate the execution plan, and then call the underlying mapreduce computing framework.

Metastore component: metadata service component, this component stores the metadata of hive, the metadata of hive is stored in relational database, and the relational databases supported by hive are derby and mysql. Metadata is very important for hive, so hive supports separating metastore services and installing them into remote server clusters, so as to decouple hive services and metastore services and ensure the robustness of hive. I will explain this in detail in the following metastore section.

Thrift service: thrift is a software framework developed by facebook, which is used for the development of extensible and cross-language services. Hive integrates this service and allows different programming languages to call hive interfaces.

Client components:

CLI:command line interface, command line interface.

Thrift client: the Thrift client is not written in the architecture diagram above, but many client interfaces of the hive architecture are built on top of the thrift client, including JDBC and ODBC interfaces.

The WEBGUI:hive client provides a way to access the services provided by hive through web pages. This interface corresponds to the hwi component (hive web interface) of hive, which starts the hwi service before using it.

Next, I will focus on the metastore components, as follows:

The metastore component of Hive is where hive metadata is centrally stored. The Metastore component consists of two parts: metastore service and background data storage. The media for background data storage is relational databases, such as hive's default embedded disk database derby, and mysql databases. Metastore service is a service component that is built on the background data storage medium and can interact with hive service. By default, metastore service and hive service are installed together and run in the same process. I can also split the metastore service from the hive service, metastore is installed independently in a cluster, hive remotely invokes the metastore service, so that we can put the metadata layer behind the firewall, the client accesses the hive service, and can connect to the metadata layer, thus providing better management and security. Using remote metastore services, we can make metastore services and hive services run in different processes, which also ensures the stability of hive and improves the efficiency of hive services.

The execution process of Hive is shown in the following figure:

The picture is very clear, so I won't repeat it here.

Let me show you a simple example of how hive works.

First of all, let's create a normal text file with only one line of data and only one string stored in that line. The command is as follows:

Echo 'sharpxiajun' > / home/hadoop/test.txt

Then we build a hive table:

Hive-e "create table test (value string)

Next, load the data:

Load data local inpath 'home/hadoop/test.txt' overwrite into table test

Finally, we look up the following table:

Hive-e'select * from test'

As you can see, hive is very simple, easy to get started, and the operation is very similar to sql. Next, I will deeply analyze the difference between hive and relational database. Some people may not understand this part, but it is necessary to mention it in advance. I will further talk about hive in my article in the future. If you look at this part of children's shoes, which you do not understand at that time, many questions will be much clearer, as follows:

1. In a relational database, the loading mode of the table is forcibly determined when the data is loaded (the loading mode of the table refers to the file format in which the database stores the data). If the loaded data is found to be inconsistent with the schema, the relational database will refuse to load the data, which is called "write-time mode". The write-time mode checks and verifies the data schema when the data is loaded. Unlike a relational database, Hive does not check the data or change the loaded data file when loading data, while the operation of checking the data format is performed during the query operation, which is called "read-time mode". In practical application, the write-time mode will index the columns and compress the data when loading the data, so the speed of loading the data is very slow, but when the data is loaded and we query the data, the speed is very fast. But when our data is unstructured and the storage mode is unknown, the scenario of relational data manipulation becomes much more troublesome, and hive will play to its advantages.

two。 One of the important characteristics of relational database is that it can update and delete the data of a row or some rows. Hive does not support the operation of a specific row, and the operation of hive only supports overwriting the original data and appending data. Hive also does not support transactions and indexes. Update, transaction and index are the characteristics of relational database, these hive do not support, and do not intend to support, because the design of hive is to deal with massive data, the whole data is scanned normally, and the efficiency of operation for some specific data is very poor. For update operation, hive transforms the data of the original table through query and finally stores it in the new table, which is very different from the update operation of the traditional database.

3.Hive can also make its own contribution to the real-time query of hadoop, that is, integrated with hbase, hbase can query quickly, but hbase does not support SQL-like statements, then hive can provide hbase with a shell for sql syntax parsing, and you can use sql-like statements to operate hbase databases.

This is the end of this article on "what is the use of hive function". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.