The concept and processing method of Hive data skew 07/12 Update SLTechnology News&Howtos

The concept and processing method of Hive data skew

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "the concept and processing method of Hive data skew". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn the concept and processing of Hive data skew.

The concept and causes of Hive data skew and processing ① data skew

1.1 data skew

Data skew is the imbalance in the distribution of data, especially in some places and very few in some places, resulting in some of the data being processed quickly while others are not finished for a long time. As a result, the overall task cannot be completed in the end, and this phenomenon is called data skew.

For the process of mapreduce, there are multiple reduce, among which one or more reduce have a large amount of data to deal with, while other reduce deal with a relatively small amount of data, then the reduce with a small amount of data can be completed quickly, while those with a large amount of data need a lot of time, resulting in the whole task waiting for it and can not be completed.

When running mr tasks, the progress of the common reduce is always stuck at 99%, which is probably caused by data tilt.

1.2 causes of data skew

Uneven distribution of key

As mentioned above, the data tilt is due to the large difference in the amount of data of reduce, while the data of reduce is the result of partitioning, and the partition is the hash value of key, which determines that the key is assigned to a partition according to the hash value, and then enters a certain reduce. If the key is very centralized or the same, then the calculated hash value may be the same, then it will be assigned to the same reduce, which will result in too much data to be processed by this reduce.

The characteristics of the business data itself

For example, if some business data are centralized as key fields, the result will definitely lead to data skew.

There are some other reasons, but the root cause is the uneven distribution of key, while the other reason is the uneven distribution of key, which leads to the skew of data, so the root cause is the uneven distribution of key.

1.3 performance of skewed data

The task progress has been maintained at 99% (or 100%) for a long time. Check the task monitoring page and find that only a small number of (1 or more) reduce subtasks have not been completed. Because the amount of data it handles is too different from that of other reduce.

The difference between the number of records in a single reduce and the average number of records is too large, usually up to 3 times or more. The longest time is longer than the average time.

The solution of ② data skew

2.1 set parameters

Hive.map.aggr = true / / Map end partial aggregation, equivalent to Combiner

Hive.groupby.skewindata=true / /

Load balancing is performed when the data is skewed, and the selected item is set to true, and the generated query plan will have two MR Job. In the first MR Job, the output result set of the Map is randomly distributed to the Reduce, and each Reduce does a partial aggregation operation and outputs the result. The result is that the same Group By Key may be distributed to different Reduce, thus achieving the purpose of load balancing. The second MR Job is distributed to the Reduce according to the Group By Key according to the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce), and finally completes the final aggregation operation.

2.2 SQL statement optimization

Size table join

Use map join to advance memory for small dimension tables (less than 1000 records). Complete the reduce on the map side.

Big Join big table

Change the key of null value into a string plus random number, and divide the tilted data into different reduce. Because the null value is not associated, the final result will not be affected after processing.

Count distinct has a large number of the same special values

In the case of count distinct, the null value is handled separately. If the count distinct is calculated, it can be filtered directly without processing, and 1 is added to the final result. If there are other calculations and you need to group by, you can first deal with the records with empty values separately, and then union with other calculation results.

Special treatment of special circumstances

In cases where the effect of business logic optimization is not so good, sometimes the skewed data can be taken out and processed separately. Finally, union went back.

At this point, I believe you have a deeper understanding of "the concept and processing of Hive data tilt". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.