What are the tuning of Hive? 07/09 Update SLTechnology News&Howtos

What are the tuning of Hive?

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Editor to share with you what Hive tuning, I believe that most people do not know much about it, so share this article for your reference, I hope you will learn a lot after reading this article, let's go to know it!

1. Fetch crawling (Hive can avoid MapReduce)

Queries in Hive for some cases can be calculated without using MapReduce. For example: SELECT * FROM employees; in this case, Hive can simply read the files in the storage directory corresponding to employee, and then output the query results to the console.

In the hive-default.xml.template file, hive.fetch.task.conversion defaults to more, and the old version of hive defaults to minimal. After this attribute is modified to more, mapreduce is not used in global lookup, field lookup, limit lookup, and so on.

Hive.fetch.task.conversionmore Expects one of [none, minimal, more]. Some select queries can be converted to single FETCH task minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins. 0. None: disable hive.fetch.task.conversion 1.1 case practice

Setting hive.fetch.task.conversion to none, and then executing the query statement, will execute the mapreduce program.

Hive (default) > set hive.fetch.task.conversion=none;hive (default) > select * from score;hive (default) > select s_score from score;hive (default) > select s_score from score limit 3

Set hive.fetch.task.conversion to more, and then execute the query statement, the following query will not execute the mapreduce program.

1.2 Local mode

Most Hadoop Job require the full extensibility provided by Hadoop to handle large datasets. However, sometimes the amount of input data in Hive is very small. In this case, triggering the execution of a task for a query may take much more time than the actual job. In most of these cases, Hive can handle all tasks on a single machine in local mode. For small datasets, the execution time using local mode can be significantly reduced.

Users can automatically start this optimization at the appropriate time by setting the value of hive.exec.mode.local.auto to true.

Set hive.exec.mode.local.auto=true;-enable local MapReduce-- to set the maximum amount of input data for local mr. When the amount of input data is less than this value, local mr is used. Default is 134217728, that is, 128Mset hive.exec.mode.local.auto.inputbytes.max=51234560. -- set the maximum number of input files for local mr. If the number of input files is less than this value, local mr is used. Default is 4set hive.exec.mode.local.auto.input.files.max=10.

Practical case

Turn on local mode and execute the query statement

Hive (default) > set hive.exec.mode.local.auto=true; hive (default) > select * from score cluster by players 1.568 seconds

Turn off local mode and execute the query statement

Hive (default) > set hive.exec.mode.local.auto=false; hive (default) > select * from score cluster by seconds II, group By

By default, the same Key data is distributed to a reduce in the Map phase, which is skewed when a key data is too large.

Not all aggregation operations need to be completed on the reduce side. Many aggregation operations can be partially aggregated on the Map side, and finally the final result can be obtained on the Show side.

Enable Map aggregation parameter settings

-- whether to aggregate on the map side. The default is Trueset hive.map.aggr = true;-- number of entries for aggregation operation on the map side set hive.groupby.mapaggr.checkinterval = 1000000 set hive.groupby.mapaggr.checkinterval-load balancing is performed when the data is skewed (default is false) set hive.groupby.skewindata = true

When the option is set to true, the generated query plan has two MR Job. In the first MR Job, the output of Map will be randomly distributed to the Reduce, and each Reduce will do part of the aggregation operation and output the result. The result is that the same Group By Key may be distributed to different Reduce, thus achieving the purpose of load balancing. The second MR Job is distributed to the Reduce according to the Group By Key according to the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce), and finally completes the final aggregation operation.

3. Count (distinct)

It doesn't matter when the amount of data is small. In the case of a large amount of data, because the COUNT DISTINCT operation needs to be completed with a Reduce Task, and the amount of data to be processed by this Reduce is too large, it will make it difficult for the whole Job to be completed. Generally, COUNT DISTINCT is replaced by GROUP BY before COUNT:

Environmental preparation

Create table bigtable (id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by'\ tloaded data local inpath'/ home/bigtable' into table bigtable

Use count (distinct)

Set hive.exec.reducers.bytes.per.reducer=32123456;SELECT count (DISTINCT id) FROM bigtable; result: c010000Time taken: 35.49seconds, Fetched: 1 row (s)

Use conversion

Set hive.exec.reducers.bytes.per.reducer=32123456;SELECT count (id) FROM (SELECT id FROM bigtable GROUP BY id) a; results: Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 13.07 sec HDFS Read: 120749896 HDFS Write: 464 SUCCESSStage-Stage-2: Map: 3 Reduce: 1 Cumulative CPU: 5.14 sec HDFS Read: 8987 HDFS Write: 7 SUCCESS_c010000Time taken: 51.202 seconds, Fetched: 1 row (s)

Although it will be done with an extra Job, it is definitely worth it in the case of a large amount of data.

Test data download link: https://pan.baidu.com/s/1LwKKJTeXR4h0iaOAknZ7_g extraction code: 5252

IV. Cartesian product

Try to avoid Cartesian product, that is, to avoid join without on condition, or invalid on condition, Hive can only use 1 reducer to complete Cartesian product.

Fifth, use partition clipping and column clipping

In SELECT, take only the columns you need, if any, use partition filtering as much as possible and use less SELECT *.

In partition clipping, when using external associations, if the filter condition of the secondary table is written after Where, then the full table association will be first and then filtered later, such as:

Data preparation

Create table ori (id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by'\ t creating table bigtable (id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by'\ t load data local inpath'/ home/bigtable' into table bigtable;load data local inpath'/ home/ori' into table ori

First associate in Where

Select a.id FROM bigtable a left join ori o on a.id=o.id where o.id

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.