How to tune Hive 07/19 Update SLTechnology News&Howtos

How to tune Hive

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to tune Hive, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

Hive tuning involves sql tuning, data tilting tuning, small file problem tuning, data compression tuning and so on.

Data compression and storage format

Comparison of the selection of file format and compression coding

Setting mode

1. In the map phase, the output data is compressed. At this stage, an algorithm with low CPU overhead is preferred.

Set hive.exec.compress.intermediate=trueset mapred.map.output.compression.codec= org.apache.hadoop.io.compress.SnappyCodecset mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec

two。 Compress the final output

Set hive.exec.compress.output=true set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec # # of course, you can also specify the file format and compression encoding of the table when creating the table in hive

Conclusion: orcfile/parquet + snappy mode is generally chosen.

two。 Rational use of zoning and bucket separation

Partitioning is to physically divide the data of a table into different folders, so that the partition directory to be read can be accurately specified when querying, thus reducing the amount of data read.

Sub-bucket means that the table data is hashed according to the hash of the specified column and divided into different files. When querying in the future, hive can quickly locate the sub-bucket file where a row of data is located according to the sub-bucket structure, thus improving the reading efficiency.

Optimization of 3.hive parameters-to enable parallel execution of set hive.exec.parallel=true without mapreduce tasks, mapreduce tasks hive > set hive.fetch.task.conversion=more; / / start tasks / / explanation: when there are multiple job in a sql and there is no dependency between the multiple job, the sequential execution can be changed into parallel execution (usually when union all is used) / / the maximum number of threads allowed for parallel tasks in the same sql set hive.exec.parallel.thread.number=8 / / setting jvm reuse / / JVM reuse has a great impact on the performance of hive, especially for scenarios where it is difficult to avoid small files or where there are a lot of task, most of these scenarios have a short execution time. The startup process of jvm can incur considerable overhead, especially if the executed job contains thousands of task tasks. Set mapred.job.reuse.jvm.num.tasks=10; / / set the number of reduce reasonably / / method 1: adjust the amount of data accepted by each reduce set hive.exec.reducers.bytes.per.reducer=500000000; (500m) / / method 2: directly set the number of reduce set mapred.reduce.tasks = 20

4.sql optimization

(1) where condition optimization.

Before optimization (relational database will be automatically optimized without consideration):

Select m.cidreu.id from order m join customer u on (m.cid = u.id) where m.dtbrush 20180808'

After optimization (the where condition is executed on the map side rather than on the reduce side):

Select m.cidreu.id from (select * from order where dt='20180818') m join customer u on (m.cid = u.id)

(2) union optimization

Try not to use union (union removes duplicate records), but use union all and then use group by to remove duplicates.

(3) count distinct optimization

Do not use count (distinct cloumn), use subqueries

Select count (1) from (select id from tablename group by id) tmp

(4) replace join with in

If you need to constrain another table according to the fields of one table, try to use in instead of join. In is faster than join.

Select id,name from tb1 a join tb2 b on (a.id = b.id); select id,name from tb1 where id in (select id from tb2)

(5) eliminate group by, COUNT (DISTINCT), MAX,MIN in the subquery. The number of job can be reduced.

(6) join optimization:

The stage in which a Common/shuffle/Reduce JOIN connection occurs, which occurs in the reduce phase, and is suitable for connecting large tables to large tables (the default)

Map join: the connection occurs in the map phase, which is suitable for connecting small tables to large tables.

The data of large tables are read from files.

The data of the small table is stored in memory (hive has been automatically optimized, the small table is automatically judged, and then cached)

Set hive.auto.convert.join=true

SMB join

Sort-Merge-Bucket Join optimizes the connection of large tables with large tables by using the concept of bucket tables. Cartesian product join occurs in one bucket (two bucket tables are required for join)

Set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.auto.convert.sortmerge.join.noconditionaltask=true;5, data skew

Performance: the task progress has been maintained at 99% (or 100%) for a long time. Check the task monitoring page and find that only a small number of (1 or more) reduce subtasks have not been completed. Because the amount of data it handles is too different from that of other reduce.

Reason: the data input of one reduce is much larger than that of other reduce data.

1). Uneven distribution of key

2), the characteristics of the business data itself (there is a hot key)

3) poor consideration in the establishment of the table

4) some SQL statements have data skew.

Keyword consequence join one of the tables is small, but the data distributed to one or more Reduce in key is much higher than the average join large table and large table, but the bucket judgment field 0 value or null value is too much, these null values are handled by a reduce, very slow group bygroup by dimension is too small, the number of a value is too much to deal with a value reduce is very time-consuming count distinct a special value is too time-consuming to deal with this special value

(1) Parameter adjustment

Set hive.map.aggr=true / / map end aggregation to reduce the amount of data transmitted to reduce set hive.groupby.skewindata=true / / enable the number tilt optimization mechanism built into hive

(2) be familiar with the distribution of data, optimize the logic of sql, and find out the reason for the tilt of data.

If the data skew is generated in groupby, can we say that the dimension of groupby becomes finer? if it cannot become finer, we can add random numbers to the original packet key and aggregate them once, then remove random numbers from the results and then aggregate them.

In join, if there are a large number of join key with null, the null can be converted to random values to avoid aggregation

(3) each input of join is relatively large, and the long tail is caused by hot spot values, so you can deal with hot point values and non-hot point values respectively, and then merge the data.

6. Merge small files

There are three ways to generate small files: map input, map output and reduce output. Too many small files will also affect the efficiency of hive analysis:

Set the merge of small files for map input

Set mapred.max.split.size=256000000; / / the minimum size of split on a node (this value determines whether files on multiple DataNode need to be merged) set mapred.min.split.size.per.node=100000000;// the minimum size of split on a switch (this value determines whether files on multiple switches need to be merged) set mapred.min.split.size.per.rack=100000000 / / merge set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat with small files before executing Map

Set parameters for merging map output and reduce output:

/ / set map output to merge. Default is trueset hive.merge.mapfiles = true// to set reduce output for merging. Default is falseset hive.merge.mapredfiles = true// to set the size of the merged file set hive.merge.size.per.task = 256 "1000pm / when the average size of the output file is less than this value, start a separate MapReduce task to merge the file. Set hive.merge.smallfiles.avgsize=160000007, view the execution plan of sql explain sql

Learn to view the implementation plan of sql, optimize business logic, and reduce the amount of data in job. Is also very important for tuning.

The above is all the content of this article "how to optimize Hive". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.