The reason and Optimization method of Hive data skew 03/21 Update SLTechnology News&Howtos

The reason and Optimization method of Hive data skew

2026-03-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly explains the "Hive data tilt reasons and optimization methods", the article explains the content is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "Hive data tilt reasons and optimization methods" it!

The cause of data tilt: due to the uneven distribution of data, resulting in a large number of data concentrated to one point, resulting in data hotspots. Specifically, the data received by one reduce is n times that of other reduce, resulting in obvious barrel effect.

Symptoms:

1. Select count (1) from tb group by key the table to see if there are a lot of the same key in the table.

2. Check the monitoring interface to find that the task progress has been maintained at 99% (or 100%) for a long time, and only a few (1 or more) reduce subtasks have not been completed or some reduce subtasks are n times the average reduce duration.

The reduce time of one of the job in the above figure is far longer than that of the other reduce, indicating that the data processed by this reduce far exceeds that of the other reduce, indicating that the data skew occurs in this statistics.

Solution

Parameter tuning:

1 key set hive.groupby.skewindata=true: this parameter means that when doing a Reduce operation, the key obtained is not all the same values given to the same Reduce, but randomly distributed, and then the Reduce aggregates, and after that, a round of MR is done, and the result is calculated with the previously aggregated data. So this parameter is actually similar to what Hive.Map.aggr does, but only takes it to the Reduce end to do, and needs to start an extra round of Job, so it is not recommended, and the effect is not obvious.

2Dem set hive.skewjoin.key=100000: if the number of records corresponding to the key of join exceeds this value, it will be optimized.

3 Reducer mapred.reduce.tasks=500: increase the number of Reducer, usually the data (KV value pair) Shuffle to a Reducer is Hash according to Key and then take the module to the number of Reducer.

HQL statement optimization:

1, small table join big table:

Place the small watch to the left of the join to reduce the chances of oom

With mapjoin, the number of small table data should be less than 1000. Select / * + mapjoin (a) * / count (1) from tb_a a left outer join tb_b b on a.uid=b.uid

2, big table join big table

Change the key of null value into a string plus random number, and divide the tilted data into different reduce. Because the null value is not associated, the final result will not be affected after processing.

Select * from tb_a a left outer join tb_b b on (case when a.userid is null then concact ('xxx', rand ()) else a.userid end = b.userid)

3. Data skew is caused by the association of different data types. Convert the data types before join:

Select * from users a left outer join logs b on a.usr_id = cast (b.user_id as string)

4Query count distinct optimization

Use sum () group by to replace count (distinct) for calculation.

Original statement: select a, count (distinct b) as c from tbl group by a

After rewriting: select a, count (*) as c from (select distinct a, b from tbl) group by a

In addition, in the case of count distinct, the null value is handled separately. If the count distinct is calculated, it can be filtered without processing, and 1 is added to the final result. If there are other calculations and you need to group by, you can first deal with the records with empty values separately, and then union with other calculation results.

Thank you for your reading, the above is the "Hive data tilt reasons and optimization methods" content, after the study of this article, I believe you have a deeper understanding of the reasons for Hive data tilt and optimization methods, the specific use of the need for you to practice and verify. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.