The execution and compilation of hive sql 07/11 Update SLTechnology News&Howtos

The execution and compilation of hive sql

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. Hive accesses and processes data in mapreduce mode when executing SQL, which mainly includes the following stages:

1. Hive first obtains data from the hdfs file according to the table in the sql statement, and splits the data file so that it can read the required data into memory line by line;

2. The map function maps the data in memory according to the key value to form a key-value value line by line, such as the gender field in the user table. The records processed by map in memory are as follows:

3. In practical applications, there will be multiple machines participating in map processing. After map completion, data with the same key needs to be distributed to the same cluster for subsequent processing. At this time, the operation is called shuffle;

4. If sql contains join, count, sum, reduce operation will be performed at this time, such as count, and the data after reduction is completed is as follows:

Second, at the bottom of hive, at the same time, the above sql will be compiled, and the process mainly includes the following six points:

For ease of understanding, let's show you a simple query statement that queries the region dimension table for May 30:

select * from dim.dim_region where dt = '2019-05-30'

1. According to sql syntax rules defined by Antlr, parse related sql lexically and syntactically, and transform it into AST Tree.

ABSTRACT SYNTAX TREE:TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME dim dim_region TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF TOK_WHERE = TOK_TABLE_OR_COL dt '2019-05-30'

2. Traversing AST Tree abstracts QueryBlock, the basic component unit of query.

After AST Tree is generated, its complexity is still high, which is not convenient to translate into mapreduce program. It needs to be further abstracted and structured to form QueryBlock. QueryBlock is the most basic component of SQL, including three parts: input source, calculation process, output. A QueryBlock is simply a subquery. The generation process of QB is a recursive process. It traverses AST Tree first, encounters different Token nodes (understood as special tags), and saves them to corresponding attributes. It mainly includes the following processes:

TOK_QUERY： Create QB object, recursively sub-node TOK_FROM: Save the syntax part of table name in aliasToTabs and other attributes of QB object TOK_INSERT: recursively sub-node TOK_DESTINATION: Save the syntax part of output target in nameToDest attribute of QBParseInfo object TOK_SELECT: Save the syntax part of query expression in destToSelExpr, destToAggregationExphs and destToDistinctFuncExphs respectively TOK_WHERE: Save the syntax of the Where section in the destToWhereExpr attribute of the QBParseInfo object

3. Traverse QueryBlock, translate to OperatorTree

The MapReduce task that Hive ultimately generates, the Map phase and the Reduce phase are all composed of OperatorTrees. Logical operators are those that perform a single specific operation in the Map phase or Reduce phase.

Basic operators include TableScanOperator, SelectOperator, FilterOperator, JoinOperator, GroupByOperator, ReduceSinkOperator

ReduceSinkOperator serializes the Map-side field combinations into Reduce Key/value, Partition Key, which can only appear in the Map phase, and also marks the end of the Map phase in the MapReduce program generated by Hive.

Operator Data transfer between Map Reduce phases is a streaming process. After each Operator completes an operation on a row of data, it passes the data to childOperator for calculation.

Since Join/GroupBy/OrderBy needs to be completed in the Reduce phase, a ReduceSinkOperator is generated before generating the Operator of the corresponding operation, combining and serializing the fields into Reduce Key/value, Partition Key.

4.. Logical Optimizer Optimizes OperatorTree

Use ReduceSinkOperator to reduce the amount of shuffle data. Most logic-layer optimizers reduce MapReduce jobs and shuffle data by transforming OperatorTrees and merging operators.

5. Traverse OperatorTree and translate to MapReduce task

The process of transforming OperatorTree into Task Tree is divided into the following stages

Generate MoveTask for Output Table

A depth-first traversal down from one of the root nodes of the OperatorTree

reduceSinkOperator marks the boundaries of Map/Reduce, boundaries between multiple jobs

Traverse other root nodes, meet JoinOperator merge MapReduceTask

Generate Task Update Metadata

Cut the Operator relationship between Map and Reduce

6. The physical layer optimizer optimizes MapReduce tasks to generate final execution plans

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.