Example Analysis of SQL data skew and Optimization in Hive 07/06 Update SLTechnology News&Howtos

Example Analysis of SQL data skew and Optimization in Hive

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the example analysis of SQL data tilt and optimization in Hive, which has a certain reference value, and interested friends can refer to it. I hope you can learn a lot after reading this article.

1 reasons for data skew 1.1 Operation:

Keywords

Situation

Consequence

Join

One of the tables is smaller.

But key is centralized

Data distributed to one or more Reduce is much higher than the average

Large table and large table, but the bucket judgment field 0 value or null value is too much

These null values are handled by a reduce, which is often slow.

Group by

Group by dimension is too small

There is too much of a value.

It is often time-consuming to process the reduce gray of a certain value.

Count Distinct

Too much special value

It takes time to process reduce for this special value

1.2 reasons:

1). Uneven distribution of key

2), the characteristics of the business data itself

3) poor consideration in the establishment of the table

4) some SQL statements have data skew.

1.3 performance:

The task progress has been maintained at 99% (or 100%) for a long time. Check the task monitoring page and find that only a small number of (1 or more) reduce subtasks have not been completed. Because the amount of data it handles is too different from that of other reduce.

The difference between the number of records in a single reduce and the average number of records is too large, usually up to 3 times or more. The longest time is longer than the average time.

2 the solution of data tilt 2.1 parameter adjustment:

Hive.map.aggr=true

Partial aggregation at the end of Map, equivalent to Combiner

Hive.groupby.skewindata=true

Load balancing is performed when the data is skewed, and the selected item is set to true, and the generated query plan will have two MR Job. In the first MR Job, the output result set of the Map is randomly distributed to the Reduce, and each Reduce does a partial aggregation operation and outputs the result. The result is that the same Group By Key may be distributed to different Reduce, thus achieving the purpose of load balancing. The second MR Job is distributed to the Reduce according to the Group By Key according to the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce), and finally completes the final aggregation operation.

2.2 SQL statement tuning:

How to Join:

With regard to the selection of the driver table, the table with the most uniform join key distribution is selected as the driver table.

Do a good job of column clipping and filter operations, so as to achieve the effect that the amount of data becomes relatively small when the two tables do join.

Size table Join:

Use map join to advance memory for small dimension tables (less than 1000 records). Complete reduce on the map.

Big table Join big table

Change the key of null value into a string plus random number, and divide the tilted data into different reduce. Because the null value is not associated, the final result will not be affected after processing.

Count distinct has a large number of the same special values

In the case of count distinct, the null value is handled separately. If the count distinct is calculated, it can be filtered directly without processing, and 1 is added to the final result. If there are other calculations and you need to group by, you can first deal with the records with empty values separately, and then union with other calculation results.

The group by dimension is too small:

Sum () group by is used to replace count (distinct) to complete the calculation.

Special treatment for special circumstances:

In cases where the effect of business logic optimization is not so good, sometimes the skewed data can be taken out and processed separately. Finally, union went back.

3 typical business scenario 3.1 data skew caused by null values

Scenario: for example, in a log, there is often a problem of information loss, such as the user_id in the log. If you associate the user_id in the log with the user_id in the user table, you will encounter the problem of data skew.

Solution 1: do not participate in the association if user_id is empty (red font is modified)

Select * from log a join users b on a.user_id is not null and a.user_id = b.user_idunion allselect * from log a where a.user_id is null

Solution 2: assign null values to new key values

Select * from log a left outer join users b on case when a.user_id is null then concat ('hive',rand ()) else a.user_id end = b.user_id

Conclusion: method 2 is more efficient than method 1, not only with less io, but also with fewer tasks. In workaround 1, log is read twice, and jobs is 2. Solution 2 the number of job is 1. This optimization is suitable for skew problems caused by invalid id (such as-99,'', null, etc.). By changing the null key into a string plus a random number, the skewed data can be divided into different reduce to solve the data skew problem.

3.2 Association of different data types results in data skew

Scenario: the user_id field in the user table is the user_id field in the int,log table, with both string and int types. When Join operations for two tables are performed according to user_id, the default Hash operation is allocated by int-type id, which results in all records of string type id being assigned to one Reducer.

Solution: convert a number type to a string type

Select * from users a left outer join logs b on a.usr_id = cast (b.user_id as string) 3.3.The small table is neither small nor big, how to solve the tilt problem with map join

Use map join to solve the data skew problem of small tables (with a small number of records) associated with large tables. This method is used very frequently, but if the small table is so large that there will be bug or exceptions in map join, special handling is required. The following examples are:

Select * from log a left outer join users b on a.user_id = b.user_id

The users table has a record of 600w +, so distributing users to all map is a lot of overhead, and map join does not support such large small tables. If you use ordinary join, you will encounter the problem of data tilt.

Solution:

Select / * + mapjoin (x) * / * from log a left outer join (select / * + mapjoin (c) * / d.* from (select distinct user_id from log) c join users d on c.user_id = d.user_id) x on a.user_id = b.user_id

If there are millions of user _ id in log, this goes back to the original map join problem. Fortunately, there will not be too many uv members per day, not too many members with transactions, not too many members with clicks, not too many members with commission, and so on. So this method can solve the problem of data skew in many scenarios.

3.4GROUP BY replaces COUNT (DISTINCT) to achieve optimization effect.

COUNT (DISTINCT) is often used when calculating uv, but COUNT (DISTINCT) is slower when the data is skewed. At this point, you can try to use GROUP BY rewriting code to calculate uv.

INSERT OVERWRITE TABLE s_dw_tanx_adzone_uv PARTITION (ds=20120329)

SELECT 20120329 AS thedate,adzoneid,COUNT (DISTINCT acookie) AS uv FROM s_ods_log_tanx_pv t WHERE t.ds=20120329 GROUP BY adzoneid

The problem of data skew in COUNT (DISTINCT) cannot be generalized. It depends on the situation. Here is a set of data I tested:

Test data: 169857 items

# Statistics daily IP CREATE TABLE ip_2014_12_29 AS SELECT COUNT (DISTINCT ip) AS IP FROM logdfs WHERE logdate='2014_12_29'; time: 24.805 seconds # Statistics daily IP (modification) CREATE TABLE ip_2014_12_29 AS SELECT COUNT (1) AS IP FROM (SELECT DISTINCT ip from logdfs WHERE logdate='2014_12_29') tmp; time: 46.833 seconds

Test result table name: the modified statement is obviously more time-consuming than before, this is because the modified statement has 2 SELECT and an extra job, so that when the amount of data is small, the data will not be skewed.

3.5 to solve the shortcomings of Hive's optimization of UNION ALL

Hive's feature of union all optimization: union all optimization is limited to non-nested queries.

Eliminate group by in a subquery

Example 1: group by in a subquery

SELECT * FROM (SELECT * FROM T1 GROUP BY C1, c2 UNION ALL SELECT * FROM T2 GROUP BY c1, GROUP BY c2, c2, GROUP BY c1, c2, GROUP BY, c3, GROUP BY, c3,

In terms of business logic, the GROUP BY in the subquery looks redundant (functionally redundant, unless there is a COUNT (DISTINCT)), if not because of Hive Bug or performance considerations (there has been a Hive Bug that the data will not get the correct results if the subquery GROUP BY is not executed). So the Hive is empirically converted to something like this:

SELECT * FROM (SELECT * FROM T1 UNION ALL SELECT * FROM T2) T3 GROUP BY C1 recorder c2mai c3

Tuning result: after testing, there is no Hive Bug for union all, and the data are consistent. The number of jobs in MapReduce is reduced from 3 to 1.

T1 is equivalent to a directory, and T2 is equivalent to a directory. For Map/Reduce programs, T1 Magi T2 can be used as a mutli inputs for Map/Reduce jobs. This can be solved through a Map/Reduce. The computing framework of Hadoop is not afraid of a lot of data, but of a large number of jobs.

But if it is replaced by other computing platforms such as Oracle, it is not necessarily, because the large input is split into two inputs, respectively sort the summary after merge (if the two subsorts are parallel), it is possible to have better performance (for example, Hill sorting is better than bubble sorting).

Eliminate COUNT (DISTINCT), MAX,MIN in the subquery.

SELECT * FROM (SELECT * FROM T1 UNION ALL SELECT C1 GROUP BY c2 (DISTINCT c4) FROM c3 COUNT (DISTINCT c4) FROM T2 GROUP BY c1) T3 GROUP BY C1 menus c2 Eng c3

Because there is a COUNT (DISTINCT) operation in the subquery, going directly to GROUP BY will not achieve the business goal. At this time, using temporary meter to eliminate COUNT (DISTINCT) operation can not only solve the tilt problem, but also effectively reduce jobs.

INSERT c4 SELECT c1, c2 FROM, SELECT c4, FROM c1, GROUP BY c1, c2, income, SUM (uv) FROM (SELECT c1, c2, UNION ALL SELECT, c1, UNION ALL SELECT, c1, AS income,1 AS uv FROM, c2), T3, GROUP BY, c1, c2, AS income,1 AS uv FROM, c2.

The job number is 2, which is halved, and twice Map/Reduce is more efficient than COUNT (DISTINCT).

Tuning result: 10 million level catalog table, member table, associated with 1 billion grade merchandise table. The original task of 1963s was adjusted and 1152s was completed.

Eliminate JOIN in a subquery

SELECT * FROM (SELECT * FROM T1 UNION ALL SELECT * FROM T2 UNION ALL SELECT * FROM T2 JOIN T3 ON t2.id=t3.id) x GROUP BY C1 c2

The above code runs with 5 jobs. If you add the JOIN survival temporary table first, T5, then UNION ALL, it will become 2 jobs.

INSERT OVERWRITE TABLE T2 SELECT * FROM T2 JOIN T3 ON t2.idroomt3.id; SELECT * FROM (T1 UNION ALL t4 UNION ALL T5)

The tuning results show that for the 10 million-level advertising table, the original 5 Job for a total of 15 minutes, divided into 2 job a 8-10 minutes, a 3 minutes.

4 Summary

It is our ultimate goal to distribute the output data of map to reduce more evenly. Due to the limitations of Hash algorithm, pressing key Hash will cause data tilt more or less. A great deal of experience shows that the cause of data tilt is human negligence in table construction or can be avoided by business logic. The more general steps are given here:

1. Sample the log table, which user_id is tilted, and get a result table tmp1. As for the computing framework, he does not know the distribution of all the data, so sampling is indispensable.

2. The distribution of data accords with the rules of sociological statistics, and there is inequality between the rich and the poor. There will not be too many inclined key, just as there are not many rich people and not many strange people in a society. So the number of tmp1 records will be very small. Map join tmp1 and users to generate tmp2, and read tmp2 to distribute file cache. This is a map process.

3, map read in users and log, if the record comes from log, check whether user_id is in tmp2, if so, output to the local file a, otherwise the generated key,value pair, if the record comes from member, the generated key,value pair enters the reduce phase.

4. Finally, the a file is merged and the output files of Stage3 reduce phase are merged and written to hdfs.

If confirming that the business requires such skewed logic, consider the following optimization options:

1. For join, when judging that the small table is not larger than 1G, use map join

2. For groupby or distinct, set hive.groupby.skewindata=true

3. Optimize by using the above SQL statement adjustment as far as possible

The process of processing data in hadoop has several remarkable characteristics:

Not afraid of too much data, just afraid of data tilt.

For jobs with a large number of jobs, the running efficiency is relatively low. For example, even if a table with hundreds of rows is associated with multiple summaries many times, more than a dozen jobs will be generated, which will not be finished in half an hour. Map reduce jobs take a long time to initialize.

For sum,count, there is no data skew problem.

For count (distinct), the efficiency is low, if there is a large amount of data, there will be problems, and if it is multi-count (distinct), the efficiency is even lower.

Optimization can start from several aspects:

Good model design gets twice the result with half the effort.

Solve the problem of data skew.

Reduce the job number.

Setting a reasonable number of task for map reduce can effectively improve performance. (for example, 10w + level calculation, using 160reduce, which is quite wasteful, 1 is enough).

It is a good choice to write your own sql to solve the data tilt problem. Set hive.groupby.skewindata=true; this is a general algorithm optimization, but algorithm optimization always ignores the business and habitually provides a general solution. Etl developers know more about the business and data, so the way to solve the tilt through business logic is often more accurate and effective.

Take the method of ignoring count (distinct), especially when the data is large, it is easy to tilt the problem, do not take chances. Do it yourself and have plenty of food and clothing.

Merging small files is an effective way to improve scheduling efficiency. If our job sets a reasonable number of files, it will also have a positive impact on the overall scheduling efficiency of the ladder.

Grasp the whole when optimizing, and the best of a single job is not as good as that of the whole.

The details are as follows:

Remove unwanted column from the query

Where condition judgment and so on are filtered at the TableScan stage.

Use Partition information to read only eligible Partition

Map side join, which is driven by large tables, and small tables are loaded into all mapper memory

Adjust the Join order to ensure that the large table is used as the driving table

For Group by tables with uneven data distribution, in order to avoid data concentration on a small number of reducer, it is divided into two map-reduce phases. In the first stage, shuffle is performed with Distinct columns, then partial aggregation is made on the reduce side to reduce the data size, and in the second map-reduce stage, the data is aggregated by group-by columns.

Partial aggregation is carried out with hash on the map side to reduce the scale of data processing on the reduce side.

Thank you for reading this article carefully. I hope the article "sample Analysis of SQL data skew and Optimization in Hive" shared by the editor will be helpful to you. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.