How to tune hundreds of millions of table associations in hive 07/13 Update SLTechnology News&Howtos

How to tune hundreds of millions of table associations in hive

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the hive hundreds of millions of levels of table association how to optimize, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

Environment: the company decided to use wide tables to fully associate 10 related large tables

(one hundreds of millions of watches, five tens of millions of meters, and the remaining tables are less than one million)

Spent two days researching and testing

For example, among these tables, table an is hundreds of millions of tables, five tables are tens of millions, and the rest are millions of tables.

Select a.uesriddiary b.citycoderecoveryb.registerroomnum,..., g.active_num from (select userid,citycode from a) left outer join (select userid,register_num from b) on (a.userid=b.userid)... left outer join (select userid,active_num from g) on (a.userid=b.userid)

You will find

The last job is unusually slow, and the reduce is 1.

Many people will say, you idiot, setting the reduce number is a good way, but what is the result?

# set transmission format set mapred.output.compress=true; set hive.exec.compress.output=true;set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;# set 200x200 reducesset mapred.reduce.tasks=200; # set parallel (even. Set parallel) set hive.exec.parallel=true;set hive.exec.parallel.thread.number=16; / / maximum parallelism is allowed for the same sql. Default is 8.

Grass, I have been testing all day according to various tutorials on the Internet, or the last reduce is 1; (I am hundreds of millions of people! )

Calculation method of automatic allocation of reduce by hive

1. Hive.exec.reducers.bytes.per.reducer (default is 1000 ^ 3)

2. Hive.exec.reducers.max (default is 999)

The formula for calculating the reducer number is simple:

N=min (Parameter 2, Total input data / Parameter 1)

Then the inquiry found out:

The reduce is 1 because:

Did not use group by

Order by is used

Cartesian product

I TM all over again, or reduce=1, I was very speechless at that time, can't Nima do it more clearly? I am a rookie! )

Time: I didn't finish running for 3 hours, it was 83% all the time.

So hadoop, seeing this statement, assigns a reduce.

How to trick hive into allocating reduce?

Then modify the script (of course, the above setting reduce number can not be reduced)

# how to deceive hive into multi-allocating reduceselect a.uesrid.CityCodeReceiving sum (b.register_num),..., sum (g.active_num) # finding the aggregate function from (select userid,citycode from x) # x Magi y represents the smallest table of these tables full outer join (select userid,unregister from y) # x Y represents the smallest of these tables, on (x.userid=y.userid) # (alternately set y.userid=b.userid) full outer join (select userid,register_num from b) on (x.userid=b.userid) # Association condition All are associated with small tables... right outer join (select userid,active_num from a) # the largest table is placed at the end of the on (y.userid=a.userid) # (alternately set y.userid=b.userid) group by a.useridscene b.citycode # finally, group by

Use aggregate functions, plus group by

Then the small watch is put in front (some people say: I TM want all the information, so you can use full connection)

Then the big table usually goes to the back, from small to big, as soon as it comes down in a row.

In this way, hive can be deceived into allocating multiple reduce to achieve the effect of tuning.

Time: less than 15 minutes, are you excited to the best part? Ha ha

Disadvantages:

It is troublesome to generate 200 files

Setting parallelism requires a bit high performance, so just set the number of parallelism moderately

Parallel parameters, for reference only

When the parameter is false, the three job are executed sequentially

Set hive.exec.parallel=false

But you can see that the sql in the two subqueries is not related and can run in parallel.

Set hive.exec.parallel=true

Hive > set hive.exec.parallel.thread.number; (if the machine is normal, you can set 3 in parallel, which feels reasonable)

The default number of parallelism for hive.exec.parallel.thread.number=8 is 8

Thank you for reading this article carefully. I hope the article "how to tune the hundreds of millions of hive table associations" shared by the editor will be helpful to you. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.