What are the reasons for data skew in hive 07/16 Update SLTechnology News&Howtos

What are the reasons for data skew in hive

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "what are the reasons for data skew in hive". The content is easy to understand and clear. I hope it can help you solve your doubts. Let me lead you to study and learn the article "what are the reasons for data skew in hive".

Hive big data tilted summary

In the optimization process of the Shuffle stage, the problem of data tilt is encountered, which results in the optimization effect is not obvious in some cases. The main reason is that the Counters obtained after the completion of the Job is the sum of the whole Job, and the optimization is based on the average value of these Counters, and the difference in the amount of data processed by map is too large due to the tilt of data, which reduces the value that these averages can represent. The execution of Hive is divided into stages, and the difference in the amount of data processed by reduce depends on the map output of the previous stage, so how to distribute the data evenly to each reduce is the root to solve the data tilt. It is more efficient to avoid errors to run better than to solve them. After looking at some information, it is summarized as follows.

1 reasons for data skew 1.1 Operation:

Keywords

Situation

Consequence

Join

One of the tables is smaller.

But key is centralized

Data distributed to one or more Reduce is much higher than the average

Large table and large table, but the bucket judgment field 0 value or null value is too much

These null values are handled by a reduce, which is often slow.

Group by

Group by dimension is too small

There is too much of a value.

It is often time-consuming to process the reduce gray of a certain value.

Count Distinct

Too much special value

It takes time to process reduce for this special value

1.2 reasons:

1). Uneven distribution of key

2), the characteristics of the business data itself

3) poor consideration in the establishment of the table

4) some SQL statements have data skew.

1.3 performance:

The task progress has been maintained at 99% (or 100%) for a long time. Check the task monitoring page and find that only a small number of (1 or more) reduce subtasks have not been completed. Because the amount of data it handles is too different from that of other reduce.

The difference between the number of records in a single reduce and the average number of records is too large, usually up to 3 times or more. The longest time is longer than the average time.

2 the solution of data tilt 2.1 parameter adjustment:

Hive.map.aggr=true

Partial aggregation at the end of Map, equivalent to Combiner

Hive.groupby.skewindata=true

Load balancing is performed when the data is skewed, and the selected item is set to true, and the generated query plan will have two MR Job. In the first MR Job, the output result set of the Map is randomly distributed to the Reduce, and each Reduce does a partial aggregation operation and outputs the result. The result is that the same Group By Key may be distributed to different Reduce, thus achieving the purpose of load balancing. The second MR Job is distributed to the Reduce according to the Group By Key according to the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce), and finally completes the final aggregation operation.

2.2 SQL statement tuning:

How to Join:

With regard to the selection of the driver table, the table with the most uniform join key distribution is selected as the driver table.

Do a good job of column clipping and filter operations, so as to achieve the effect that the amount of data becomes relatively small when the two tables do join.

Size table Join:

Use map join to advance memory for small dimension tables (less than 1000 records). Complete reduce on the map.

Big table Join big table

Change the key of null value into a string plus random number, and divide the tilted data into different reduce. Because the null value is not associated, the final result will not be affected after processing.

Count distinct has a large number of the same special values

In the case of count distinct, the null value is handled separately. If the count distinct is calculated, it can be filtered directly without processing, and 1 is added to the final result. If there are other calculations and you need to group by, you can first deal with the records with empty values separately, and then union with other calculation results.

The group by dimension is too small:

Sum () group by is used to replace count (distinct) to complete the calculation.

Special treatment for special circumstances:

In cases where the effect of business logic optimization is not so good, sometimes the skewed data can be taken out and processed separately. Finally, union went back.

3 typical business scenario 3.1 data skew caused by null values

Scenario: for example, in a log, there is often a problem of information loss, such as the user_id in the log. If you associate the user_id in the log with the user_id in the user table, you will encounter the problem of data skew.

Solution 1: do not participate in the association if user_id is empty (red font is modified)

Select * from log a join users b on a.user_id is not null and a.user_id = b.user_idunion allselect * from log a where a.user_id is null

Solution 2: assign null values to new key values

Select * from log a left outer join users b on case when a.user_id is null then concat ('hive',rand ()) else a.user_id end = b.user_id

Conclusion: method 2 is more efficient than method 1, not only with less io, but also with fewer tasks. In workaround 1, log is read twice, and jobs is 2. Solution 2 the number of job is 1. This optimization is suitable for skew problems caused by invalid id (such as-99,'', null, etc.). By changing the null key into a string plus a random number, the skewed data can be divided into different reduce to solve the data skew problem.

3.2 Association of different data types results in data skew

Scenario: the user_id field in the user table is the user_id field in the int,log table, with both string and int types. When Join operations for two tables are performed according to user_id, the default Hash operation is allocated by int-type id, which results in all records of string type id being assigned to one Reducer.

Solution: convert a number type to a string type

Select * from users a left outer join logs b on a.usr_id = cast (b.user_id as string)

How to use map join to solve the tilt problem

Use map join to solve the data skew problem of small tables (with a small number of records) associated with large tables. This method is used very frequently, but if the small table is so large that there will be bug or exceptions in map join, special handling is required. The following examples are:

Select * from log a left outer join users b on a.user_id = b.user_id

The users table has a record of 600w +, so distributing users to all map is a lot of overhead, and map join does not support such large small tables. If you use ordinary join, you will encounter the problem of data tilt.

Solution:

Select / * + mapjoin (x) * / * from log a left outer join (select / * + mapjoin (c) * / d.* from (select distinct user_id from log) c join users d on c.user_id = d.user_id) x on a.user_id = b.user_id

If there are millions of user _ id in log, this goes back to the original map join problem. Fortunately, there will not be too many uv members per day, not too many members with transactions, not too many members with clicks, not too many members with commission, and so on. So this method can solve the problem of data skew in many scenarios.

4 Summary

It is our ultimate goal to distribute the output data of map to reduce more evenly. Due to the limitations of Hash algorithm, pressing key Hash will cause data tilt more or less. A great deal of experience shows that the cause of data tilt is human negligence in table construction or can be avoided by business logic. The more general steps are given here:

1. Sample the log table, which user_id is tilted, and get a result table tmp1. As for the computing framework, he does not know the distribution of all the data, so sampling is indispensable.

2. The distribution of data accords with the rules of sociological statistics, and there is inequality between the rich and the poor. There will not be too many inclined key, just as there are not many rich people and not many strange people in a society. So the number of tmp1 records will be very small. Map join tmp1 and users to generate tmp2, and read tmp2 to distribute file cache. This is a map process.

3, map read in users and log, if the record comes from log, check whether user_id is in tmp2, if so, output to the local file a, otherwise the generated key,value pair, if the record comes from member, the generated key,value pair enters the reduce phase.

4. Finally, the a file is merged and the output files of Stage3 reduce phase are merged and written to hdfs.

If confirming that the business requires such skewed logic, consider the following optimization options:

1. For join, when judging that the small table is not larger than 1G, use map join

2. For groupby or distinct, set hive.groupby.skewindata=true

3. Optimize by using the above SQL statement adjustment as far as possible

The above is all the contents of the article "what are the reasons for the skew of data in hive". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.