In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly introduces the hive hundreds of millions of levels of table association how to optimize, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.
Environment: the company decided to use wide tables to fully associate 10 related large tables
(one hundreds of millions of watches, five tens of millions of meters, and the remaining tables are less than one million)
Spent two days researching and testing
For example, among these tables, table an is hundreds of millions of tables, five tables are tens of millions, and the rest are millions of tables.
Select a.uesriddiary b.citycoderecoveryb.registerroomnum,..., g.active_num from (select userid,citycode from a) left outer join (select userid,register_num from b) on (a.userid=b.userid)... left outer join (select userid,active_num from g) on (a.userid=b.userid)
You will find
The last job is unusually slow, and the reduce is 1.
Many people will say, you idiot, setting the reduce number is a good way, but what is the result?
# set transmission format set mapred.output.compress=true; set hive.exec.compress.output=true;set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;# set 200x200 reducesset mapred.reduce.tasks=200; # set parallel (even. Set parallel) set hive.exec.parallel=true;set hive.exec.parallel.thread.number=16; / / maximum parallelism is allowed for the same sql. Default is 8.
Grass, I have been testing all day according to various tutorials on the Internet, or the last reduce is 1; (I am hundreds of millions of people! )
Calculation method of automatic allocation of reduce by hive
1. Hive.exec.reducers.bytes.per.reducer (default is 1000 ^ 3)
2. Hive.exec.reducers.max (default is 999)
The formula for calculating the reducer number is simple:
N=min (Parameter 2, Total input data / Parameter 1)
Then the inquiry found out:
The reduce is 1 because:
Did not use group by
Order by is used
Cartesian product
I TM all over again, or reduce=1, I was very speechless at that time, can't Nima do it more clearly? I am a rookie! )
Time: I didn't finish running for 3 hours, it was 83% all the time.
So hadoop, seeing this statement, assigns a reduce.
How to trick hive into allocating reduce?
Then modify the script (of course, the above setting reduce number can not be reduced)
# how to deceive hive into multi-allocating reduceselect a.uesrid.CityCodeReceiving sum (b.register_num),..., sum (g.active_num) # finding the aggregate function from (select userid,citycode from x) # x Magi y represents the smallest table of these tables full outer join (select userid,unregister from y) # x Y represents the smallest of these tables, on (x.userid=y.userid) # (alternately set y.userid=b.userid) full outer join (select userid,register_num from b) on (x.userid=b.userid) # Association condition All are associated with small tables... right outer join (select userid,active_num from a) # the largest table is placed at the end of the on (y.userid=a.userid) # (alternately set y.userid=b.userid) group by a.useridscene b.citycode # finally, group by
Use aggregate functions, plus group by
Then the small watch is put in front (some people say: I TM want all the information, so you can use full connection)
Then the big table usually goes to the back, from small to big, as soon as it comes down in a row.
In this way, hive can be deceived into allocating multiple reduce to achieve the effect of tuning.
Time: less than 15 minutes, are you excited to the best part? Ha ha
Disadvantages:
It is troublesome to generate 200 files
Setting parallelism requires a bit high performance, so just set the number of parallelism moderately
Parallel parameters, for reference only
When the parameter is false, the three job are executed sequentially
Set hive.exec.parallel=false
But you can see that the sql in the two subqueries is not related and can run in parallel.
Set hive.exec.parallel=true
Hive > set hive.exec.parallel.thread.number; (if the machine is normal, you can set 3 in parallel, which feels reasonable)
The default number of parallelism for hive.exec.parallel.thread.number=8 is 8
Thank you for reading this article carefully. I hope the article "how to tune the hundreds of millions of hive table associations" shared by the editor will be helpful to you. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.