16. Hive data skew and solution 07/19 Update SLTechnology News&Howtos

16. Hive data skew and solution

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Data skew 1, what is data skew due to uneven data distribution, resulting in a large number of data concentrated to one point, resulting in data hotspots 2. During the execution of a task, the progress of the task is maintained at about 99% for a long time. Check the task monitoring page and find that only a small number of (1 or more) reduction subtasks have not been completed. Because the amount of data it handles is too different from that of other reduce. The difference between the number of records in a single reduce and the average number of records is too large, usually up to 3 times or more. The longest time is longer than the average time. 3. The tilt of data

4. Reasons for data skew 1), uneven distribution of key 2), characteristics of business data 3), insufficient consideration in table building 4), some SQL statements have data skew 5, solution of data skew 5.1 map side aggregation-Map side partial aggregation, which is equivalent to load balancing hive.groupby.skewindata=true when Combinerhive.map.aggr = true;-- has data tilt -load balancing is performed when the data is skewed. If the selected item is set to true, the generated query plan will have two MR Job. In the first MR Job, the output result set of the Map is randomly distributed to the Reduce, and each Reduce does a partial aggregation operation and outputs the result. The result is that the same Group By Key may be distributed to different Reduce, thus achieving the purpose of load balancing. The second MR Job is distributed to the Reduce according to the Group By Key according to the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce), and finally completes the final aggregation operation. 5.2 SQL statement tuning

How to Join

With regard to the selection of the driver table, the table with the most uniform distribution of join key is used as the driver table to do column clipping and filter operations, so as to achieve the effect that the amount of data becomes relatively small when the two tables do join.

Size table Join

Use map join to advance memory for small dimension tables (less than 1000 records). Complete reduce on the map.

Big Join big table

Change the key of null value into a string plus random number, and divide the tilted data into different reduce. Because the null value is not associated, the final result will not be affected after processing.

Count distinct has a large number of the same special values

In the case of count distinct, the null value is handled separately. If the count distinct is calculated, it can be filtered directly without processing, and 1 is added to the final result. If there are other calculations and you need to group by, you can first deal with the records with empty values separately, and then union with other calculation results.

Group by dimension is too small

Sum () group by is used to replace count (distinct) to complete the calculation.

Special treatment of special circumstances

In cases where the effect of business logic optimization is not so good, sometimes the tilted data can be taken out and processed separately. Finally, union goes back to 5.3 typical business scenarios.

Data skew caused by null values

In scenarios such as logs, there is often a problem of information loss, such as the user_id in the log. If you associate the user_id in the log with the user_id in the user table, you will encounter the problem of data skew. Solution-if user_id is empty, do not participate in the association select * from log ajoin users bon a.user_id is not nulland a.user_id = b.user_idunion allselect * from log awhere a.user_id is null;-- with a new key value select * from log aleft outer join users bon case when a.user_id is null then concat ('hive',rand ()) else a.user_id end = b.user_id

Data skew caused by the association of different data types

The user_id field in the scenario user table is the user_id field in the int,log table, which has both string type and int type. When Join operations for two tables are performed according to user_id, the default Hash operation is allocated by int-type id, which results in all records of string type id being assigned to one Reducer.

Solution.

Convert a numeric type to a string type select * from users a left outer join logs b on a.usr_id = cast (b.user_id as string)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.