Example Analysis of Fetch fetching in Hive performance tuning 07/01 Update SLTechnology News&Howtos

Example Analysis of Fetch fetching in Hive performance tuning

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you the "sample analysis of Fetch fetching in Hive performance tuning", which is easy to understand and well-organized. I hope it can help you solve your doubts. Let the editor lead you to study and study the "sample analysis of Fetch fetching in Hive performance tuning".

When we first started to learn hive, we all know that hive can reduce programmers' learning costs and development costs. The concrete manifestation is that we can convert SQL statements into MapReduce programs to run.

However, queries for certain cases in Hive do not have to be calculated using MapReduce. For example: SELECT * FROM employees; in this case, Hive can simply read the files in the storage directory corresponding to employee, and then output the query results to the console.

In the hive-default.xml.template file, hive.fetch.task.conversion defaults to more, and the old version of hive defaults to minimal. After this attribute is modified to more, mapreduce is not used in global lookup, field lookup, limit lookup, and so on.

Hive.fetch.task.conversion

Expects one of [none, minimal, more].

Some select queries can be converted to single FETCH task minimizing latency.

Currently the query should be single sourced not having any subquery and should not have

Any aggregations or distincts (which incurs RS), lateral views and joins.

0. None: disable hive.fetch.task.conversion

Next, let's check the effect through some practical exercises!

Case practice

Setting hive.fetch.task.conversion to none, and then executing the query statement, will execute the mapreduce program.

Hive (default) > set hive.fetch.task.conversion=none

Hive (default) > select * from score

Hive (default) > select s_score from score

Hive (default) > select s_score from score limit 3

Set hive.fetch.task.conversion to more, and then execute the query statement, the following query will not execute the mapreduce program.

Hive (default) > set hive.fetch.task.conversion=more

Hive (default) > select * from score

Hive (default) > select s_score from score

Hive (default) > select s_score from score limit 3

We can clearly find that when hive.fetch.task.conversion is set to none, it will take a certain amount of time for all programs to go through the mapreduce program. But even if it is set to none, only some sql statements will not go to the mapreduce program, is there any way to optimize this problem?

Local mode

Most Hadoop Job require the full extensibility provided by Hadoop to handle large datasets. However, sometimes the amount of input data in Hive is very small. In this case, triggering the execution of a task for a query may take much more time than the actual job. In most of these cases, Hive can handle all tasks on a single machine in local mode. For small datasets, the execution time using local mode can be significantly reduced.

Users can automatically start this optimization at the appropriate time by setting the value of hive.exec.mode.local.auto to true.

Set hive.exec.mode.local.auto=true; / / enable local mr

Set the maximum input data of local mr. When the input data is less than this value, local mr is used. The default is 134217728, that is, 128m.

Set hive.exec.mode.local.auto.inputbytes.max=51234560

Set the maximum number of input files for local mr. When the number of input files is less than this value, local mr is used. The default is 4.

Set hive.exec.mode.local.auto.input.files.max=10

Case practice:

Turn on local mode and execute the query statement

Hive (default) > set hive.exec.mode.local.auto=true

Hive (default) > select * from score cluster by s_id

18 rows selected (1.568 seconds)

Turn off local mode and execute the query statement

Hive (default) > set hive.exec.mode.local.auto=false

Hive (default) > select * from score cluster by s_id

18 rows selected (11.865 seconds)

Group By

By default, the same Key data is distributed to a reduce in the Map phase, which is skewed when a key data is too large.

Not all aggregation operations need to be completed on the reduce side. Many aggregation operations can be partially aggregated on the Map side, and finally the final result can be obtained on the Show side.

Enable Map aggregation parameter settings

Set hive.map.aggr = true

The number of entries that perform aggregation operations on the map side

Set hive.groupby.mapaggr.checkinterval = 100000

Load balancing is performed when data is skewed (default is false)

Set hive.groupby.skewindata = true

When the option is set to true, the generated query plan has two MR Job. In the first MR Job, the output of Map will be randomly distributed to the Reduce, and each Reduce will do part of the aggregation operation and output the result. The result is that the same Group By Key may be distributed to different Reduce, thus achieving the purpose of load balancing. The second MR Job is distributed to the Reduce according to the Group By Key according to the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce), and finally completes the final aggregation operation.

Count (distinct)

It doesn't matter when the amount of data is small. In the case of a large amount of data, because the COUNT DISTINCT operation needs to be completed with a Reduce Task, and the amount of data to be processed by this Reduce is too large, it will make it difficult for the whole Job to be completed. Generally, COUNT DISTINCT is replaced by GROUP BY before COUNT:

Environmental preparation:

Create table bigtable (id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by'\ t'

Load data local inpath'/ home/admin/softwares/data/ 1 million large table data (id divided by 10 rounded) / bigtable' into table bigtable

Set hive.exec.reducers.bytes.per.reducer=32123456

Test: SELECT count (DISTINCT id) FROM bigtable

Results: c0 10000 Time taken: 35.49 seconds, Fetched: 1 row (s)

Can be converted to:

Set hive.exec.reducers.bytes.per.reducer=32123456; SELECT count (id) FROM (SELECT id FROM bigtable GROUP BY id) a

Results: Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 13.07 sec HDFS Read: 120749896 HDFS Write: 464 SUCCESS Stage-Stage-2: Map: 3 Reduce: 1 Cumulative CPU: 5.14 sec HDFS Read: 8987 HDFS Write: 7 SUCCESS _ c0 10000 Time taken: 51.202 seconds, Fetched: 1 row (s)

Although it will be done with an extra Job, it is definitely worth it in the case of a large amount of data.

Cartesian product

Try to avoid Cartesian product, that is, to avoid join without on condition, or invalid on condition, Hive can only use 1 reducer to complete Cartesian product.

Use zone clipping, column clipping

In SELECT, take only the columns you need, if any, use partition filtering as much as possible and use less SELECT *. In partition clipping, when using external associations, if the filter condition of the secondary table is written after Where, then the full table association will be first and then filtered later, such as:

Environmental preparation:

Create table ori (id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by'\ t'

Create table bigtable (id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by'\ t'

Raw data / ori' into table ori of load data local inpath'/ home/admin/softwares/data/ plus incremental id

Load data local inpath'/ home/admin/softwares/data/ 1 million large table data (id divided by 10 rounded) / bigtable' into table bigtable

Associate first and then Where:

SELECT a.id

FROM bigtable a

LEFT JOIN ori b ON a.id = b.id

WHERE b.id

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.