What are the common tuning methods in Hive 04/16 Update SLTechnology News&Howtos

What are the common tuning methods in Hive

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is to share with you about the common tuning methods in Hive. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

As a commonly used data warehouse component in big data field, Hive needs to pay attention to efficiency in the design and development stage. It is not only the excessive amount of data that affects the efficiency of Hive, but also the tilt of data, data redundancy, too much job or Imax O, unreasonable MapReduce allocation and other factors have an impact on the efficiency of Hive. Tuning Hive includes not only the optimization of the HiveQL statement itself, but also the tuning of Hive configuration items and MR.

It is carried out from the following three aspects: architecture optimization parameters optimization SQL optimization

1. Architectural aspect

For the execution engine, choose a more suitable and faster engine for the resources of the company's platform, such as MR, TEZ, Spark, etc.

If you choose the TEZ engine, you can turn on the vectorized optimizer during the optimizer, and you can choose the cost optimizer CBO, which is configured as follows:

Set hive.vectorized.execution.enabled = true;-- default falseset hive.vectorized.execution.reduce.enabled = true;-- default falseSET hive.cbo.enable=true;-- default trueSET hive.compute.query.using.stats=true; from v0.14.0-default falseSET hive.stats.fetch.column.stats=true;-- default falseSET hive.stats.fetch.partition.stats=true;-- default true

Optimize the design of the table, such as selecting the partition table, the bucket table, and the storage format of the table. In order to reduce data transmission, you can use compression. Here are a few parameters (see the official website for more parameters).

-- compressed SEThive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec for intermediate results;-- compressed SET hive.exec.compress.output=true;SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodc2 for output results. Parameter optimization

The second part is parameter optimization. In fact, part of the above architecture is also controlled by parameters. The parameter control of this part mainly has the following aspects.

Local mode, strict mode, JVM reuse, parallel execution, speculative execution, merge small files, Fetch mode

2.1 Local mode

When the amount of data is small, it is slower to start distributed processing data, and the startup time is longer, which is not as fast as the local mode. Use the following parameters to adjust.

SET hive.exec.mode.local.auto=true;-default false small SET hive.exec.mode.local.auto.inputbytes.max=50000000;-the size of the input file is smaller than the large SET hive.exec.mode.local.auto.input.files.max=5; configured by hive.exec.mode.local.auto.inputbytes.max-the default number of 4 map tasks is less than the size 2.2 strict mode configured by hive.exec.mode.local.auto.input.files.max

This is actually a switch, which will fail when the following three statements are satisfied. If it is not opened, it will be executed normally, and when it is turned on, these statements will fail automatically.

Hive.mapred.mode=nostrict-statements that do not limit partitioned columns when querying partitioned tables;-- statements that produce Cartesian product by two tables join;-- sort by order by, but do not specify 2.3Jvm reuse of statements for limit

In mr, it is a process as a unit, a process is a Jvm, in fact, like short jobs, these processes can be reused very quickly, but its disadvantage is that it will wait for the task to be executed after the task slot, which is more obvious when the data is skewed. Turn this on and use the following parameters

SET mapreduce.job.jvm.numtasks=5;2.4 parallel execution

The query for Hive will be converted to stage. These stage are not interdependent. You can execute these stage in parallel, using the following parameters

SET hive.exec.parallel=true;-default falseSET hive.exec.parallel.thread.number=16;-default 82.5 conjecture execution

The function of this parameter is to use space resources in exchange for the time to get the final result. For example, due to the inequality of resources in the network, some tasks run very slowly, and the backup process will be started to process the same data. And finally choose the first successful calculation result as the final result.

Set mapreduce.map.speculative=trueset mapreduce.reduce.speculative=trueset hive.mapred.reduce.tasks.speculative.execution=true2.6 merges small files

Before map execution, merge small files to reduce the number of map

Set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

Merge small files at the end of the task

# merge small files at the end of the map-only task, default trueSET hive.merge.mapfiles = true;# merge small files at the end of the map-reduce task, default falseSET hive.merge.mapredfiles = true;# merge file size, default 256MSET hive.merge.size.per.task = 268435456 * when the average size of the output file is less than this value, start a separate map-reduce task to proceed with file mergeSET hive.merge.smallfiles.avgsize = 16777216scape 2.7 Fetch mode

The last fetch mode is to try not to run mr in some cases, such as querying several fields, global search, field lookup, limit lookup and so on.

Hive.fetch.task.conversion=more3.sql optimization

This part is complicated and may involve the problem of data tilting. As for the problem of data tilting, it has always been an unavoidable problem dealt with by big data, and there are many ways to deal with it.

3.1 sql optimization

Sql optimization is the easiest part for developers to control, often made by experience, roughly summed up in the following ways

Column, partition disassembly, sort by instead of order by, group by instead of count (distinct), group by prepolymerization (controlled by parameters), tilt configuration item, map join, filtering null values separately, properly adjusting the number of map and reduces, almost all of these will be encountered in the work, as much as possible to optimize them is what you need to do.

3.2 tilt equalization configuration item

This configuration is similar to the tilt equalization configuration of group by, which is configured through hive.optimize.skewjoin and defaults to false. If enabled, during the join process, Hive will temporarily write the line corresponding to the skewed key whose count exceeds the threshold hive.skewjoin.key (default 100000) into the file, and then start another job to generate the result of map join. You can also control the number of mapper of the second job through the hive.skewjoin.mapjoin.map.tasks parameter. Default is 1000.

3.3 handle tilted key separately

If skewed key has practical meaning, generally speaking, there are very few skewed key, so you can extract them separately, store the corresponding rows in a temporary table, prefix them with a smaller random number (such as 0,9), and then aggregate them. Don't write too much Join in one Select statement. Be sure to understand the business and the data. (A0-A9) divided into multiple statements and executed step by step; (A0MurA4; A5-A9); first execute the association between large tables and small tables

4. Two SQL4.1 find out the teams that have all won three consecutive games.

Team,year Pistons, 1990 Bulls, 1991 Bulls, 1992

-1 ranks select team, year, row_number () over (partition by team order by year) as rank from T1 THEL2 get grouping idselect team, year, row_number () over (partition by team order by year) as rank, (year-row_number () over (partition by team order by year)) as groupid from T1 -- 3 groups to solve select team, count (1) years from (select team, (year-row_number () over (partition by team order by year)) as groupid from T1) tmpgroup by team, groupidhaving count (1) > = 3 tmpgroup by team 4.2 to find all the peaks and troughs of each id in one day

Crest: the value of this moment > the value of the previous moment > the value of the next moment trough: the value of this moment.

< 前一时刻的值这一时刻的值 < 后一时刻的值 id time price 前一时刻的值（lag）后一时刻的值(lead) sh76688, 9:35, 29.48 null 28.72 sh76688, 9:40, 28.72 29.48 27.74 sh76688, 9:45, 27.74 sh76688, 9:50, 26.75 sh76688, 9:55, 27.13 sh76688, 10:00, 26.30 sh76688, 10:05, 27.09 sh76688, 10:10, 26.46 sh76688, 10:15, 26.11 sh76688, 10:20, 26.88 sh76688, 10:25, 27.49 sh76688, 10:30, 26.70 sh76688, 10:35, 27.57 sh76688, 10:40, 28.26 sh76688, 10:45, 28.03 -- 思路：关键是找到波峰波谷的特征-- 波峰的特征: 大于前一个时间段、后一个时间段的值-- 波谷的特征: 小于前一个时间段、后一个时间段的值-- 找到这个特征SQL就好写了select id, time, price, case when price >

Beforeprice and price > afterprice then "crest" when price

< beforeprice and price < afterprice then "波谷" end as feature from (select id, time, price, lag(price) over (partition by id order by time) beforeprice, lead(price) over (partition by id order by time) afterprice from t2 )tmp where (price >

Beforeprice and price > afterprice) or (price < beforeprice and price < afterprice); thank you for reading! This is the end of this article on "what are the common tuning methods in Hive". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.