Hive basic architecture 04/26 Update SLTechnology News&Howtos

Hive basic architecture

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

As shown in the following figure, logically, Hive consists of three major parts.

Hive ClientsHive ServicesHive Storage and Computing

There are three main interfaces for users to operate Hive: CLI,Client and WUI.

One of the most common is when CLI,Cli starts, it starts a copy of Hive at the same time.

Client is the client of Hive, and the user connects to Hive Server. When you start Client mode, you need to indicate the node where Hive Server is located, and start Hive Server on that node. The client can be divided into three types of Thrift Client,JDBC Client,ODBC Client.

Web Interface accesses Hive through a browser.

Hive stores metadata in databases such as mysql and derby. The metadata in Hive includes the name of the table, the columns and partitions of the table and its properties, the attributes of the table (whether it is an external table, etc.), the directory where the data of the table is located, and so on. The interpreter, compiler and optimizer complete the HQL query sentence from lexical analysis, syntax analysis, compilation, optimization and query plan generation. The generated query plan is stored in HDFS and subsequently executed by a MapReduce call. Hive's data is stored in HDFS, and most of the queries and calculations are done by MapReduce (note that included queries, such as select from tbl, do not generate MapRedcue tasks).

The Driver in the figure above processes all requests from the application to the metastore to the filed system for subsequent operations. Hive component Driver

Session handler is implemented, and API for executing and obtaining information is implemented on the JDBC/ODBC interface.

Compiler

This component is used to parse queries on different query expressions, analyze semantics, and eventually generate an execution plain based on the table and partition metadata queried from metastore.

Execution Egine

This component executes the execution created by compiler. Where plan is a DAG in terms of data structure, and this component manages the dependencies between the different stage of the plan and the execution of these plan in the component.

Metastore

The metastore component of Hive is where hive metadata is centrally stored. The component stores structured information including columns and column types in the variable table, as well as partition information in the data warehouse (including column and column type information, serialization and deserialization information necessary to read and write data, and the location where the data is stored in the HDFS file).

The Metastore component consists of two parts: metastore services and Meta storage database.

The media for Metastore database is relational databases, such as hive's default embedded disk database derby, and mysql databases. Metastore services is a service component that is built on the background data storage medium (HDFS) and can interact with hive services.

By default, metastore services and hive services are installed together and run in the same process. You can also split the metastore services from the hive services and install the metastore independently in a cluster, and hive can call metastore services remotely. In this way, we can put the metadata layer behind the firewall, and the client can access the hive service and connect to the metadata layer, thus providing better management and security.

Using remote metastore services, you can make metastore services and hive services run in different processes, which also ensures the stability of hive and improves the efficiency of hive services.

Hive execution process

The general steps of the process are:

Users submit queries and other tasks to Driver. Driver creates a session handler for the query operation, and then dirver sends the query operation to compiler to generate an execute planCompiler that goes to MetaStore to get the required Hive metadata information according to the user task. This metadata is used for type detection and pruning of abstract syntax trees in subsequent stage. Compiler obtains metadata information, compiles task, first converts HiveQL into abstract syntax tree, then converts abstract syntax tree into query block, converts query block into logical query plan, rewrites logical query plan, transforms logical plan into physical plan (MapReduce), and finally chooses the best strategy. Submit the final plan to Driver. Driver transfers the plan to ExecutionEngine for execution, and submits the acquired metadata information to JobTracker or RsourceManager to execute the task, and the task will be read directly into the HDFS for corresponding operations. Gets the result of the execution. Get and return the execution result. Create a tabl

Parsing Hive statements submitted by users-> parsing them-> decomposing them into Hive objects such as tables, fields, partitions, etc.

According to the parsed information, the corresponding tables, fields, partitions and other objects are constructed, and the latest ID of the construction object is obtained from SEQUENCE_TABLE. Together with the construction object information (name, type, etc.), it is written into the table of the Metabase through the DAO method. After success, the latest ID+5 in the SEQUENCE_TABLE is generated.

In fact, the common RDBMS is organized in this way, and the ID information is displayed in the system table as well as the Hive metadata. The data can be easily read through this metadata.

Optimizer

The optimizer is a constantly updated component, and most plan transfers are done through the optimizer.

Merge multiple Multiple join into one Muti-way join to redivide join, group-by, and custom MapReduce operations. Subtract unnecessary columns. Promote the use of assertions in table scanning operations. For partitioned tables, reduce unnecessary partitions. In the sampling query, reduce unnecessary buckets. The optimizer also adds local aggregation operations to handle large packet aggregations and additional rezoning operations to deal with asymmetric packet aggregations.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.