What are the reasons for the tilt of hive big data? 10/29 Update SLTechnology News&Howtos

What are the reasons for the tilt of hive big data?

2025-10-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "what are the reasons for the tilt of hive big data". The content is simple and clear. I hope it can help you solve your doubts. Here, let the editor lead you to study and learn the article "what are the reasons for the tilt of hive big data?"

1 reasons for data skew 1.1 Operation:

Keywords

Situation

Consequence

Join

One of the tables is smaller.

But key is centralized

Data distributed to one or more Reduce is much higher than the average

Large table and large table, but the bucket judgment field 0 value or null value is too much

These null values are handled by a reduce, which is often slow.

Group by

Group by dimension is too small

There is too much of a value.

It is often time-consuming to process the reduce gray of a certain value.

Count Distinct

Too much special value

It takes time to process reduce for this special value

1.2 reasons:

1). Uneven distribution of key

2), the characteristics of the business data itself

3) poor consideration in the establishment of the table

4) some SQL statements have data skew.

1.3 performance:

The task progress has been maintained at 99% (or 100%) for a long time. Check the task monitoring page and find that only a small number of (1 or more) reduce subtasks have not been completed. Because the amount of data it handles is too different from that of other reduce.

The difference between the number of records in a single reduce and the average number of records is too large, usually up to 3 times or more. The longest time is longer than the average time.

2 the solution of data tilt 2.1 parameter adjustment:

Hive.map.aggr=true

Partial aggregation at the end of Map, equivalent to Combiner

Hive.groupby.skewindata=true

Load balancing is performed when the data is skewed, and the selected item is set to true, and the generated query plan will have two MR Job. In the first MR Job, the output result set of the Map is randomly distributed to the Reduce, and each Reduce does a partial aggregation operation and outputs the result. The result is that the same Group By Key may be distributed to different Reduce, thus achieving the purpose of load balancing. The second MR Job is distributed to the Reduce according to the Group By Key according to the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce), and finally completes the final aggregation operation.

2.2 SQL statement tuning:

How to Join:

With regard to the selection of the driver table, the table with the most uniform join key distribution is selected as the driver table.

Do a good job of column clipping and filter operations, so as to achieve the effect that the amount of data becomes relatively small when the two tables do join.

Size table Join:

Use map join to advance memory for small dimension tables (less than 1000 records). Complete reduce on the map.

Big table Join big table

Change the key of null value into a string plus random number, and divide the tilted data into different reduce. Because the null value is not associated, the final result will not be affected after processing.

Count distinct has a large number of the same special values

In the case of count distinct, the null value is handled separately. If the count distinct is calculated, it can be filtered directly without processing, and 1 is added to the final result. If there are other calculations and you need to group by, you can first deal with the records with empty values separately, and then union with other calculation results.

The group by dimension is too small:

Sum () group by is used to replace count (distinct) to complete the calculation.

Special treatment for special circumstances:

In cases where the effect of business logic optimization is not so good, sometimes the skewed data can be taken out and processed separately. Finally, union went back.

3 typical business scenario 3.1 data skew caused by null values

Scenario: for example, in a log, there is often a problem of information loss, such as the user_id in the log. If you associate the user_id in the log with the user_id in the user table, you will encounter the problem of data skew.

Solution 1: do not participate in the association if user_id is empty (red font is modified)

Select * from log a join users b on a.user_id is not null and a.user_id = b.user_idunion allselect * from log a where a.user_id is null

Solution 2: assign null values to new key values

Select * from log a left outer join users b on case when a.user_id is null then concat ('hive',rand ()) else a.user_id end = b.user_id

Conclusion: method 2 is more efficient than method 1, not only with less io, but also with fewer tasks. In workaround 1, log is read twice, and jobs is 2. Solution 2 the number of job is 1. This optimization is suitable for skew problems caused by invalid id (such as-99,'', null, etc.). By changing the null key into a string plus a random number, the skewed data can be divided into different reduce to solve the data skew problem.

3.2 Association of different data types results in data skew

Scenario: the user_id field in the user table is the user_id field in the int,log table, with both string and int types. When Join operations for two tables are performed according to user_id, the default Hash operation is allocated by int-type id, which results in all records of string type id being assigned to one Reducer.

Solution: convert a number type to a string type

Select * from users a left outer join logs b on a.usr_id = cast (b.user_id as string) 3.3.The small table is neither small nor big, how to solve the tilt problem with map join

Use map join to solve the data skew problem of small tables (with a small number of records) associated with large tables. This method is used very frequently, but if the small table is so large that there will be bug or exceptions in map join, special handling is required. The following examples are:

Select * from log a left outer join users b on a.user_id = b.user_id

The users table has a record of 600w +, so distributing users to all map is a lot of overhead, and map join does not support such large small tables. If you use ordinary join, you will encounter the problem of data tilt.

Solution:

Select / * + mapjoin (x) * / * from log a left outer join (select / * + mapjoin (c) * / d.* from (select distinct user_id from log) c join users d on c.user_id = d.user_id) x on a.user_id = b.user_id

If there are millions of user _ id in log, this goes back to the original map join problem. Fortunately, there will not be too many uv members per day, not too many members with transactions, not too many members with clicks, not too many members with commission, and so on. So this method can solve the problem of data skew in many scenarios.

These are all the contents of this article entitled "what are the reasons for the tilt of hive big data?" Thank you for your reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.