How to tune Hive 07/08 Update SLTechnology News&Howtos

How to tune Hive

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to tune Hive. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

1 Fetch crawl

Fectch fetching means that queries in some cases do not have to be calculated using MapReduce

Set hive.fetch.task.conversion to more, and do not use MapReduce in global lookup, field lookup, limit lookup, etc.

2 Local mode

Most Hadoop Job require the full scalability provided by Hadoop to handle large data sets, but sometimes the amount of input data provided by Hive is very small, in which case, it may take much more time to trigger execution tasks for the query than the actual job execution time. For most of these cases, Hive can handle all tasks on a single machine through local mode, for small data sets The execution time can be significantly shortened.

Set hive.exec.mode.local.auto to true and let Hive start the optimization automatically at the appropriate time

3 Table Optimization 3.1 empty KEY filtering

Sometimes the JOIN timeout is due to too much data corresponding to some KEY, and the data corresponding to the same KEY will be sent to the same Reducer, resulting in insufficient memory. In this case, we should carefully divide into many cases, the data corresponding to these KEY are abnormal data, and we need to filter them in the SQL statement.

3.2Null KEY conversion

Sometimes, although there is a lot of data corresponding to an empty KEY, the corresponding data is not abnormal data and must be included in the result of the JOIN. In this case, we can assign a random value to the fields in the table with empty KEY, so that the data is randomly and evenly distributed to different Reducer.

3.3 MapJOIN

If you do not specify MapJOIN or do not meet the conditions of MapJOIN, the Hive parser will convert JOIN to Common JOIN, that is, when JOIN is completed in the Reduce phase, data skew is easy to occur. You can use MapJOIN to load small tables into the Map end of memory for JOIN to avoid Reducer processing.

3.4 Group By

By default, the same Key data is distributed to a reduce in the Map phase, which is skewed when a key data is too large

# whether to aggregate on the map side set hive.map.aggr = true# the number of entries for aggregation operation on the map side set hive.groupby.mapaggr.checkinterval = 10000 load balancing set hive.groupby.skewindata = true when the data is skewed

When the option is set to true, the generated query plan has two MR Job. In the first MR Job, the output of Map is randomly distributed to the Reduce, and each Reduce performs partial aggregation operations and outputs the result. The result is that the same Group By Key may be distributed to different Reduce, thus achieving the purpose of load balancing. The second MR Job is then distributed to the Reduce according to the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce). Finally, complete the final aggregation operation.

3.5 Count (Distinct)

It doesn't matter when the amount of data is small, but in the case of large amount of data, due to the total aggregation operation of COUNT DISTINCT, even if the number of Reduce Task is set, Hive will only start one Reduce, which results in a large amount of data processed by one Reduce, which makes it difficult to complete the whole Job. Generally, COUNT DISTINCT is replaced by first GROUP BY and then COUNT.

3.6 Cartesian product

Do not use Cartesian product

3.7 row and row filtering

Prohibit the use of SELECT *

3.8 dynamic partition adjustment

When partitioning table Insert data, the database automatically inserts the data into the corresponding partition according to the value of the partition field, that is, dynamic partition

# enable dynamic partitioning function hive.exec.dynamic.partition=true# is set to non-strict mode hive.exec.dynamic.partition.mode=nonstrict#, the maximum number of dynamic partitions that can be created on all nodes executing MR, and the maximum number of dynamic partitions that can be created by default 1000hive.exec.max.dynamic.partitions=1000# on each node executing MR This parameter needs to be based on the actual data to set the maximum number of HDFS files that can be created in the entire MR Job of hive.exec.max.dynamic.partitions.pernode=100#, and whether the default 100000hive.exec.max.created.files=100000# throws an exception when a partition is generated. Generally, it is not necessary to set up hive.error.on.empty.partition=false4, reasonably set the number of MapReduce, 4.1complex files to increase the number of Map.

When the files of input are very large, the logic of the task is complex, and the execution of Map is very slow, you can consider increasing the number of Map to reduce the amount of data processed by each Map, so as to improve the efficiency of task execution.

4.2 merge small files

Merge small files before Map execution to reduce the number of Map:

Set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

Merge settings for small files at the end of Map-Reduce 's task

# merge small files at the end of map-only task SET hive.merge.mapfiles = true;# merge small files SET hive.merge.mapredfiles = true;# merge file size at the end of Map-Reduce task, default 256MSET hive.merge.size.per.task = 268435456 merge file size # when the average size of the output file is less than this value, start a separate Map-Reduce task to carry out the file mergeSET hive.merge.smallfiles.avgsize = 16777216 4.3.reasonably set the number of Reduce # the amount of data processed by each Reduce defaults to the maximum number of Reduce per task in 256MBhive.exec.reducers.bytes.per.reducer=256000000#, and defaults to 1009hive.exec.reducers.max=1009# 's formula for calculating Reduce number N=min (hive.exec.reducers.max, total data / hive.exec.reducers.bytes.per.reducer) 5 to execute in parallel

Hive converts a query into one or more phases, which can be MapReduce, sampling, merge, limit, or other phases that may be needed during Hive execution. By default, Hive executes only one phase at a time, but a particular job may contain many phases that may not be entirely interdependent. That is, some phases can be executed in parallel, which may shorten the execution time of the entire job. However, if there are more phases that can be executed in parallel, the faster the job may complete.

Concurrent execution can be enabled by setting the parameter hive.exec.parallel to true. However, in a shared cluster, it is important to note that if there are more parallel phases in the job, the cluster utilization will increase.

6 strict mode

Hive provides a strict mode to prevent users from executing queries that may have an unexpected negative impact.

To enable strict mode, you need to change the hive.mapred.mode value to strict. Enable strict mode to disable various types of queries:

For partitioned tables, execution is not allowed unless the partition field filter condition is included in the where statement to limit the scope

For queries that use the order by statement, you must use the declare statement

Query that restricts Cartesian product

7 JVM reuse

JVM reuse is the content of Hadoop tuning parameters, which has a great impact on the performance of Hive, especially for scenarios where it is difficult to avoid small files or where there are a lot of task, most of these scenarios have a short execution time.

The default configuration of Hadoop usually uses derivation to execute and tasks, and the startup process can cause considerable overhead, especially if hundreds of tasks are executed.

Reuse allows instances to be reused multiple times in the same one.

It needs to be configured in the mapred-site.xml of Hadoop, but there is a disadvantage that the number of reuse is too large, which will produce a lot of garbage.

Mapreduce.job.jvm.numtasks 10 How many tasks to run per jvm. If set to-1, there is no limit. 8 speculative execution

If the user needs to execute Map or Reduce Task for a long time because of the large amount of input data, then the waste caused by starting speculative execution is very great.

This is the end of the article on "how to tune Hive". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.