Data skew in hive 07/19 Update SLTechnology News&Howtos

Data skew in hive

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Data skew usually means that the hive is distributed to each node according to the key value hash, and the same key value will be distributed to an execution node. Because the amount of data corresponding to some key values is much larger than that of other key values, the running time of some execution nodes is much longer than that of other nodes, resulting in a longer execution time of the whole job.

The sql executed in hive will have two stages: map and reduce. The data tilt in map phase is mainly when data is read from disk into memory, and the data tilt in join,reduce phase is mainly join, group by, count distinct. There are different ways to deal with these operations to avoid data tilt.

I. map stage

1. Because the file size distribution of map read-in data is uneven, and there are many small files, some map reads and processes a lot of data.

In this case, parameter adjustment can be used to prevent the uneven amount of data read by each map due to too many small files, mapred.max.split.size=256000000 (the maximum file size that each map can handle, which can be adjusted to reduce the number of map)

II. Reduce stage

1.join

There are two situations in which the datasheet is skewed during join:

(1) tilt of join for small and large tables

In this case, you can directly use hint (such as / + mapjoin (a) /) to load all small tables into memory and then sequentially scan large tables to complete join (mapjoin has usage restrictions, it must be available when the slave table in join is small, the slave table mainly refers to the right table in left join, the left table in right join, and the largest small table is 2GB).

(2) tilt of large table and large table join

This situation requires a specific analysis of the specific reasons:

Data skew caused by some special values

Parameter setting method: hive.optimize.skewjoin=true; writes the special values that cause tilt to hdfs without processing, and then starts a new mapjoin to deal with special values. You can set the amount of data exceeding by parameter to default to special values, such as hive.skewjoin.key=10000. Key with more than 10000 table names will be recognized as special values.

The processing of special values can also be optimized in sql, where special values and non-special values are treated separately in sql, and then spliced through union all, but this will increase the data tilt caused by null values in IO;.

Changing the key of a null value into a string plus a random number can also learn from the sql optimization of special values; data skew caused by associations of different data types

For example, if an int user id is associated with a string user id, hive defaults to assigning hash according to int type, and all data of string type is allocated to a reduce. In this case, int type should be converted to string and then associated.

2.group by + count distinct

When this happens in sql, the fields of group by need to be de-reprocessed in advance, and then count

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.