An example Analysis of join data skew in hive 07/03 Update SLTechnology News&Howtos

An example Analysis of join data skew in hive

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shares with you the content of an example analysis of join data skew in hive. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

1. Check the sql statement set mapred.reduce.tasks = 30 insert overwrite directory 'xxx' select cus.idA,cus.name,addr.bb from tableA as cus join tableB as addr on cus.idA = addr.idB that hasn't been finished for 5 hours.

A very simple hql statement, there is not much room for optimization (in the example, the amount of data of addr is smaller than that of cus, so it should be said that addr is put in front to drive join). The order of magnitude of tableA is hundreds of millions, and that of tableB is several million. With such a simple sql, Nima started running at 10:30 in the morning and had not finished running until 03:30 in the afternoon. I can't stand it. I dropped the kill.

two。 Preliminary analysis

First of all, the diagram from the previous query

Seeing this situation, the first reaction of the students with a little bit of experience must be: shit, this Nima must be skewed. Yes, map has long been completed, the reduce phase has been stuck at 99%, and the cumulative cpu time has been growing, indicating that the whole job is still running in the background. In this case, there is a 99% chance that the data is skewed and the entire query task is waiting for a node to finish. no, no, no.

3. The analysis of that part of the data produced a tilt.

Now that the problem has been located, the next step is to solve the problem. Unfortunately, there has been something wrong with the cluster these days. So, first of all, in order to confirm whether it is the problem of the cluster itself or the code, we first find the other two tables, both of which are billion-level data. There is no data skew in these two tables. Join tried it and the results came out within two minutes. Fortunately, it means that there is no problem with the cluster, so check the data and code.

The code itself is simple, so check it in the direction of tilting the data. Because the above two tables are related according to id, if it is tilted, then id must be tilted.

Set mapred.reduce.tasks = 5

Select idA,count (*) as num from tableA group by idA distribute by idA sort by num desc limit 10

The result is:

192928 58285292000000000496592833 240628918000 17060314000288 13863242000000003624295444 12011782000000001720892923 10294752000000002292880478 9912992000000000736661289 8819542000000000740899183 8734872000000000575115116 803250

For a table with hundreds of millions of data, this data is not very skewed. The largest key is more than 5 million and less than 6 million. All right, forget it. Check another watch.

Set mapred.reduce.tasks = 5 select idB,count (*) as num from tableB group by idB distribute by idB sort by num desc limit 10 results will come out soon.

192928 38341218000 60318617279581 2302851010262 46434000286 35282000000000575115116 32181366173280 30124212339 29722000000002025620390 27042000000001312577574 2622

The tilt of this data is not particularly serious.

However, after comparing the two results, Nima suddenly realized. The largest key of the two tables is 192928, one appears nearly 6 million times, and the other appears nearly 400000 times. These two tables have another join, and the key of Nima is 6 million * 400000 of the calculation. Most crucially, the amount of computation is allocated to a node. I am not very good at math. What is 6 million * 400000? get down on your knees and ask the students who are good at math to help me with the calculation. However, according to experience, not to mention five hours, adding another 0 may not be the end.

4. How to solve

Now that we have found the location of the tilt of the data, it is easy to solve it. Because the real need of this blogger is not to calculate the Cartesian product of two tables (it is estimated that there are very few real requirements to calculate the Cartesian product of 6 million * 400000 data. If so, the picture is too beautiful for me to look at), so the most easy solution is to filter out these key:

Set mapred.reduce.tasks = 30 insert overwrite directory 'xxx' select cus.idA,cus.name,addr.bb from tableA as cus join tableB as addr on cus.idA = addr.idB where cus.idA not in (192928pr. 200000036295444. 200000036295444. 20000001720892923. 200000022980478. 2000000073666899183, 200000005751151161617279581, 51026213663280,2000000202562039020000001312577574)

Resubmit this code, 5min time, job finish work!

Thank you for reading! This is the end of the article on "case analysis of join data tilt in hive". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.