Hive tuning skills 07/06 Update SLTechnology News&Howtos

Hive tuning skills

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1.Fetch crawl set hive.fetch.task.conversion=more (default) 1

Fetch fetching means that queries for certain cases in Hive do not have to be calculated by MapReduce.

When this property is set to more, global lookup, field lookup, limit lookup, and so on, do not follow MapReduce. When set to none, all types of search statements are set to MapReduce.

two。 Local mode set hive.exec.mode.local.auto=true (enable local mode) 1

Hive can handle all tasks on a single machine in local mode. For small datasets, the execution time can be significantly reduced

1. When local mode is enabled, you need to set the maximum amount of input data for local mr. When the amount of data is less than this value, use the local mr method.

Set hive.exec.mode.local.auto.inputbytes.max=134217728 (default) 1

two。 When local mode is enabled, you need to set the maximum number of input files for local mr. When the amount of data is less than this value, use the local mr method.

Set hive.exec.mode.local.auto.input.files.max=4 (default) 13. Table optimization 3.1 small table join large table (small table needs to be on the left.)

Note: the new version of hive has optimized small JOIN large tables and large JOIN small tables. There is no obvious difference between the left and the right side of the watch.

3.2 large table join big table

When there are many null values in a table, the empty will become a key value in the MapReduce process, corresponding to a large number of value values, and the value of a key will reach the reduce together, resulting in insufficient memory, so find a way to filter these null values.

1. By querying all the results that are not empty

Insert overwrite table jointable select n. * from (select * from nullidtable where id is not null) n left join ori o on n.id = o.id

two。 Query the null value and assign it a random number to avoid the key value being empty

Insert overwrite table jointableselect n.* from nullidtable n full join ori o on case when n.id is null then concat ('hive', rand ()) else n.id end = o.idter123

Note: this method can solve the problem of data skew.

3.3MapJoin

If you do not specify MapJoin or do not meet the criteria for MapJoin, the Hive parser converts the Join operation to Common Join, that is, completing the join during the Reduce phase. Data tilt is easy to occur. You can use MapJoin to load all the small tables into the map side of memory for join to avoid reducer processing.

Set up MapJoin

Set hive.auto.convert.join = true (default) 1

Valve value setting for large and small tables (the default is less than 25m for small tables):

Set hive.mapjoin.smalltable.filesize=25000000;13.4Group BY

By default, the same Key data is distributed to a reduce in the Map phase, and when a key data is too large, not all aggregations are completed on the reduce side. Many aggregation operations can be partially aggregated on the map side now, and the results are finally obtained in the Reduce segment.

Enable Map aggregation parameter settings

Whether to aggregate in Map segment. Default is true.

Hive.map.aggr = true1

The number of entries that perform aggregation operations on the Mapend

Hive.groupby.mapaggr.checkinterval = 1000001

Load balancing is performed when data is skewed (default is false)

Hive.groupby.skewindata = true1

Note: when the option is set to true, the generated query plan will have two MR Job. In the first MR Job, the output of Map will be randomly distributed to the Reduce, and each Reduce will do part of the aggregation operation and output the result. The result is that the same Group By Key may be distributed to different Reduce, thus achieving the purpose of load balancing. The second MR Job is distributed to the Reduce according to the Group By Key according to the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce), and finally completes the final aggregation operation.

3.5Count (Distinct) deduplicated statistics

Count Distinct uses a mapreduce, and it will be difficult to complete job with only one MapReduce when the data is large. It is necessary to use 2 MapReduce to complete the grouping Group BY because set mapreduce.job.reduces = 5 is set; so the first MapReduce process is completed through a map and 5 reduce, which reduces the load of the reduce, although it will be completed with an extra Job, but in the case of a large amount of data, this is definitely worth it.

Column processing: in select, take only the columns you need, use partition filtering as much as possible, and use less select* row processing: in partition clipping, when using external associations, if the filter condition of the secondary table is written after where, then the whole table will be associated first, and then filtered later.

Example:

1. The test associates two tables first, and then filters them with where conditions

Hive (default) > select o.id from bigtable bjoin ori o on o.id = b.idwhere o.id select b.id from bigtable bjoin (select id from ori where id

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.