The execution process of hive 11/03 Update SLTechnology News&Howtos

The execution process of hive

2025-11-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. Overview of execution proc

View the execution flow of the hive statement: explain select … . from t_table...

View the execution flow of the hive statement: explain select … . from t_table... ; operator is the minimum execution unit of hive Hive executes MapReduce programs through execmapper and execreducer, and the execution mode has local mode and distributed mode. Each operator represents a HDFS operation or MapReduce job.

Operator for hive:

Job responsibilities of Hive compiler: Parser: convert Hql statements into abstract grammar books (Abstract Syntax Tree) Semantic Analyzer: convert abstract syntax trees into query blocks Logic Plan Generator: convert query trees into logical query plans Logic Optimizer: rewrite logical query plans, optimize logical execution plans Physical Plan Gernerator: convert logical execution plans into physical plans Physical Optimizer: choose the best join strategy, optimize physical execution plans 2. How Hive works

The general steps of the process are:

1. Users submit queries and other tasks to Driver.

two。 The compiler gets the user's task Plan.

3. According to the user's task, the compiler Compiler goes to MetaStore to get the metadata information of Hive.

4. Compiler Compiler gets metadata information, compiles tasks, first converts HiveQL into abstract syntax tree, then converts abstract syntax tree into query block, converts query block into logical query plan, rewrites logical query plan, transforms logical plan into physical plan (MapReduce), and finally chooses the best strategy.

5. Submit the final plan to Driver.

Driver transfers the planned Plan to ExecutionEngine for execution, obtains metadata information, and submits it to JobTracker or SourceManager to perform the task. The task will directly read the file in HDFS and perform the corresponding operation.

7. Gets the result of the execution.

8. Get and return the execution result.

3. Analysis of the specific implementation process of hive (1) Join (reduce join)

Example: SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON pv.userid = u.userid

Map side: take the columns in the JOIN ON condition as the Key, take the required fields in the page_ view table, and the table identification as the value, and finally sort through the key, that is, the join field.

Shuffle side: Hash according to the value of Key, and push the Key/Value pair to different pairs of Reduce according to the Hash value

Reduce side: grouping according to key, taking out different data according to the identification of different tables, and splicing.

(2) group by

Example: SELECT pageid, age, count (1) FROM pv_users GROUP BY pageid, age

Map side:

Key: use pageid and age as key, and have combiner on the output side of map.

Value: 1 time

Reduce side: summing the value

(3) distinct

Example: select distinct age from log

Map side:

Key:age

Value:null

Reduce side:

A group needs only one output context.write (key,null).

(4) distinct+count

Example: select count (distinct userid) from weibo_temp

Even if the number of reduce is set to 3, only one will be executed in the end, because count () is global and only one reducetask can be turned on.

Map side:

Key:userid

Value: null

Reduce side:

Only one for a group, define a global variable for counting, and output context.write (key,count) in cleanup (Context context)

Of course, distinct+count is a practice that is easy to generate data skew, which should be avoided as much as possible, and if it cannot be avoided, use this method:

Select count (1) from (select distinct userid from weibo_temp); this can parallelize multiple reduce task tasks, thus solving the problem of excessive pressure on a single node.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.