Hive advanced operations (optimization, data skew optimization) 02/11 Update SLTechnology News&Howtos

Hive advanced operations (optimization, data skew optimization)

2026-02-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Thursday, 2019-2-21

Hive advanced operations (optimization, data skew optimization)

Partition table / bucket table application, skew,map-join / / see the basic syntax of hive

Row-column conversion

Hive optimization

Hive optimization thought

The use of Explain

Classic case (distinct count)

The reason for the tilt of data

Action:

Keywords case consequences

1. Join one of the tables is small, but the data distributed to one or more Reduce in key is much higher than the average

2. Large tables and large tables, but bucket judgment fields with zero values or too many null values are handled by a reduce, which is very slow

3. The dimension of group by group by is too small, and it is very time-consuming to deal with the reduce of a certain value.

4. A special value of Count Distinct takes too much time to deal with the reduce of this special value.

Summary of the reasons:

1). Uneven distribution of key

2), the characteristics of the business data itself

3) poor consideration in the establishment of the table

4) some SQL statements have data skew.

Performance:

The task progress has been maintained at 99% (or 100%) for a long time. Look at the task monitoring page and find that only a small number of (1 or more) reduce subtasks have not been completed. Because the amount of data it handles is too different from that of other reduce. The difference between the number of records in a single reduce and the average number of records is too large, usually up to 3 times or more. The longest time is longer than the average time.

The solution of data skew

1. Parameter adjustment:

Hive.map.aggr=true

Partial aggregation at the end of Map, equivalent to Combiner

Hive.groupby.skewindata=true

Load balancing is performed when the data is skewed, and the selected item is set to true, and the generated query plan will have two MR Job. In the first MR Job, the output result set of the Map is randomly distributed to the Reduce, and each Reduce does a partial aggregation operation and outputs the result. The result is that the same Group By Key may be distributed to different Reduce, thus achieving the purpose of load balancing. The second MR Job is distributed to the Reduce according to the Group By Key according to the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce), and finally completes the final aggregation operation.

2. SQL statement adjustment:

How to Join:

With regard to the selection of the driver table, the table with the most uniform join key distribution is selected as the driver table.

Do a good job of column clipping and filter operations, so as to achieve the effect that the amount of data becomes relatively small when the two tables do join.

Size table Join:

Use map join to advance memory for small dimension tables (less than 1000 records). Complete reduce on the map.

Big table Join big table

Change the null key into a string plus a random number, and divide the skewed data into different reduce. Because the null value is not associated, the final result will not be affected after processing.

Count distinct (different) lots of the same special values

In the case of count distinct, the null value is handled separately. If the count distinct is calculated, it can be filtered directly without processing, and 1 is added to the final result. If there are other calculations and you need to group by, you can first deal with the records with empty values separately, and then union with other calculation results. Group by dimension is too small: use sum () group by to replace count (distinct) to complete the calculation.

Special treatment for special circumstances:

In cases where the effect of business logic optimization is not so good, sometimes the skewed data can be taken out and processed separately. Finally, union went back.

Summary of problem solving methods encountered in typical business scenarios / / production

1. Data skew caused by null values

Scenario: for example, in a log, there is often a problem of information loss, such as the user_id in the log. If you associate the user_id in the log with the user_id in the user table, you will encounter the problem of data skew.

Solution 1: do not participate in the association if user_id is empty (red font is modified)

Select from log a join users b on a.user_id is not null and a.user_id = b.user_idunion allselect from log a where a.user_id is null

Solution 2: assign a new key value to a null value

Select * from log a left outer join users b on case when a.user_id is null then concat ('hive',rand ()) else a.user_id end = b.user_id

Conclusion: method 2 is more efficient than method 1, not only with less io, but also with fewer tasks.

Solution:

Log is read twice in 1, and jobs is 2. Solution 2 the number of job is 1.

This optimization is suitable for skew problems caused by invalid id (such as-99,'', null, etc.). By changing the null key into a string plus a random number, the skewed data can be divided into different reduce to solve the data skew problem.

2. Data skew caused by the association of different data types

Scenario: the user_id field in the user table is the user_id field in the int,log table, with both string and int types. When Join operations for two tables are performed according to user_id, the default Hash operation is assigned according to int type id, which results in all records of string type id being assigned to one Reducer.

Solution: convert a number type to a string type

Select * from users a left outer join logs b on a.usr_id = cast (b.user_id as string)

3. The small table is not small or big, how to use map join to solve the tilt problem

Use map join to solve the data skew problem of small tables (with a small number of records) associated with large tables. This method is used very frequently, but if the small table is so large that there will be bug or exceptions in map join, special handling is required. The following examples are:

Select from log a left outer join users b on a.user_id = b.user_id

The users table has 600w + records, and distributing users to all map is not a small overhead, and map join does not support such large small tables. If you use ordinary join, you will encounter the problem of data tilt.

Solution:

Select/+mapjoin (x) / from log a left outer join (select/+mapjoin (c) / d.* from (select distinct user_id from log) c join users d on c.user_id = d.user_id) x on a.user_id = b.user_id

If there are millions of user_id in log, this goes back to the original map join problem. Fortunately, there will not be too many uv members per day, not too many members with transactions, not too many members with clicks, not too many members with commission, and so on. So this method can solve the problem of data skew in many scenarios.

Summary:

/ / * it is our ultimate goal to distribute the output data of map to reduce more evenly.

Due to the limitations of Hash algorithm, pressing key Hash will cause data tilt more or less. A great deal of experience shows that the cause of data tilt is human negligence in table construction or can be avoided by business logic.

The more general steps are given here:

1. Sample the log table, which user_id is tilted, and get a result table tmp1. As for the computing framework, he does not know the distribution of all the data, so sampling is indispensable.

2. The distribution of data accords with the rules of sociological statistics, and there is inequality between the rich and the poor. There will not be too many inclined key, just as there are not many rich people and not many strange people in a society. So the number of tmp1 records will be very small. Map join tmp1 and users to generate tmp2, and read tmp2 to distribute file cache. This is a map process.

3, map reads users and log, if the record comes from log, check whether user_id is in tmp2, if so, output to the local file a, otherwise the generated key,value pair, if the record comes from member, the generated key,value pair enters the reduce phase.

4. Finally, the a file is merged and the output files of Stage3 reduce phase are merged and written to hdfs.

If confirming that the business requires such skewed logic, consider the following optimization options:

1. For join, when judging that the small table is not larger than 1G, use map join

2. For groupby or distinct, set hive.groupby.skewindata=true

3. Optimize by using the above SQL statement adjustment as far as possible

Prompt

1. When JOIN is used for large and small tables, MapJoin is required. When performing MapJoin, you should focus on the order of the JOIN and the number of rows in the filtered small table. The number of rows in a filtered small table cannot be too large. The maximum number of rows allowed for a small table is usually 200000 (for reference only). MapJoin is not recommended if this limit is exceeded. In particular, if the large table of JOIN is extremely large and Shuffle has a lot of data, taking into account the relativity of the size of large and small tables, it is allowed to raise the maximum number of acceptable rows of small tables to 1 million.

2. Focus on whether the amount of data in Shuffle and JOIN Stage is too large to overflow the disk. If so, please increase the number of Reduce Task. We must try our best to avoid direct JOIN between large tables and large tables, so check and analyze SQL before execution. If there are small tables, filter large tables with small tables or tables with high filtering rate, that is, do JOIN related to small tables as far as possible, and then make large tables participate.

3. It is necessary to carry out a large Join table: change the key of null values into a string plus random numbers, and divide the tilted data into different reduce. As the null values are not associated, the final result will not be affected after processing.

Normally, the join of the table is handled in the business.

Summary:

1. First of all, the column of the table needs to be cut, so that the amount of data of the two tables is relatively equal, and the amount of data after processing is also smaller.

2. Size table join: on the map side, join can consider allowing 1000 pieces of data from small dimension tables to be advanced into memory, or you can let small tables or tables with high filtering rate filter large tables, that is, do JOIN related to small tables as much as possible, and then make large tables participate.

3. Large table and large table join: you can change the key of null value into a string plus random number, and divide the tilted data into different reduce. As the null value is not associated, the final result will not be affected after processing. You can also join large tables into small tables.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.