Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Hive architecture, skew optimization, sql and frequently asked questions

2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Hive architecture

The hive architecture is shown in the figure. Client interacts with driver, through parser, planner, optimizer, and finally to mapreduce. The specific steps are as follows

When driver inputs a sql, it changes from parser to abstract syntax tree AST, which is a syntax tree without task metadata information; the parser converts AST into a QueryBlock, and a QueryBlock contains input, output, and computing logic, that is to say, a subroutine QueryBlockplanner traverses all QueryBlock and turns into Operator (operators, such as tablescanOperator), and finally forms the OperatorTree; optimizer to optimize OperatorTree, including predicate push-down, pruning, etc. Then traverse the OperatorTree, divide it into multiple mapreduce jobs, and then make physical optimization after forming a physical plan, such as whether to optimize Hive data tilt such as map join, there can be two optimization points for group by.

Map aggregation: set hive.map.aggr=true, which aggregates the same key on the map side first.

It is distributed into two jobs: set hive.groupby.skewindata=true. The original job is divided into two jobs. The first one is randomly assigned key, and the second is assigned according to key.

Note: it is useful for some aggregate functions, such as sum and count, but full aggregate functions are useless. For example, avg also has two optimization points for join.

Map join: set hive.auto.convert.join=true is enabled by default in the new version of hive. If the left table of join is small enough, the contents of the left table will be loaded directly into memory.

Two jobs: set hive.optimize.skewjoin = true;set hive.skewjoin.key = skew_key_threshold (default = 100000) these two jobs are different from groupby. This means starting a separate map join for more than 100000 rows of data, and finally aggregating the results. Hive FAQ hive does not support non-equivalent join.

Error: select from an inner join b on a.idb.id

Alternative: select from an inner join b on a.id=b.id and a.id is null;hive does not support non-join connections

Error: select from dual a dint dual b where a.key = b.key

Correct: select from dual a join dual b on a.key = b. Keyboard hive does not support or

Error: select from an inner join b on a.id=b.id or a.name=b.name

Alternative: the difference between internal and external tables in select from an inner join b on a.id=b.id union all select * from an inner join b on a.name=b.namehive

When creating a table: when creating an internal table, the data is moved to the path pointed to by the data warehouse; if an external table is created, only the path where the data is located is recorded and no change is made to the location of the data.

When you delete a table: when you delete a table, the metadata and data of the internal table are deleted together, while the external table only deletes the metadata, not the data. In this way, external tables are relatively more secure, data organization is more flexible, and it is convenient to share source data sortby, orderby, distributeby.

Order by causes global sorting; it causes all data to be concentrated on a single reducer node and then sorted, which is likely to exceed the disk and memory storage capacity of a single node and cause tasks to fail.

Distribute by + sort by is the alternative. The field set by distribute by is KEY, and the data will be distributed by HASH to different reducer machines, and then sort by will partially sort each set of data on the same reducer machine.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report