How to optimize hive statements 07/19 Update SLTechnology News&Howtos

How to optimize hive statements

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you how to optimize the hive sentence, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to understand it!

Tilt is divided into tilt caused by group by and tilt caused by join

Cymbal

Assume that the user_id of the user is recorded in the site access log, and that for the user_id where registered users use their user table, a user_id=0 representative is used for non-registered users. So since most users are non-registered users (just read but not write), user_id=0 accounts for the vast majority. If you use user_id as the dimension of group by or join key when calculating, then individual Reduce will receive much more data than other Reduce-- because it will receive all the user_id=0 records for processing, so its processing effect will be very poor, and other Reduce will still be running for a long time.

Tilt caused by 1.group by

The tilt caused by group by can be resolved by two parameters:

One of the map is Hive.Map.aggr, and the default value is true, which means that you can do combiner on the map side. So if you only do count (*) in your group by query, you will not see the tilting effect, but if you are doing count (distinct), you will still see a little tilting effect.

The other parameter of reduce is Hive.groupby. Skewindata . This parameter means that when you do the Reduce operation, the key you get is not all the same values given to the same Reduce, but randomly distributed, and then the Reduce aggregates. After that, you do a round of MR, and then calculate the results with the previously aggregated data.

Set Hive.optimize.skewjoin = true

There is also to tell Hive how to judge special values, according to the number of Hive.skewjoin.key settings Hive can know, for example, the default value is 100000, then the value of more than 100000 records is a special value.

Cymbal

So this parameter is actually similar to what Hive.Map.aggr does, but only takes it to the Reduce end to do, and needs to start an extra round of Job, so it is not recommended, and the effect is not obvious.

The optimization idea is to replace the followers first and make statistics later.

/ * before rewriting * / select a, count (distinct b) as cfrom tbl group by a * after rewriting * / select a, count (*) as cfrom (select distinct a, b from tbl) group by a

Count (distinct), in the case of a large amount of data, is inefficient, because count (distinct) is grouped by group by field and sorted by distinct field. Generally, this distribution is very skewed.

Tilt caused by 2.join

The tilt caused by join, such as the website access log and user table join described above:

Select a.* from logs a join users b on a.user_id = b.user_id

1. Separate treatment of tilt

In addition, the processing of special values is often related to the business, so you can also rewrite the sql solution from a business point of view. For example, the skewed join above can isolate the special values (from a business point of view, there should be no case of user_id = 0 in the users table, but this value is still assumed to make the writing more general):

Select a.*from (select a.*from (select * from logs where user_id = 0) a join (select * from users where user_id = 0) bon a.user_id = b.user_idunion allselect a.*from logs a join users bon a. User_id 0 and a . User_id = b.user_id) t

two。 Randomized treatment of tilt

Select * from log aleft outer join bmw_users bon case when a.user_id is null then concat ('dp_hive',rand ()) else a.user_id end = b.user_id

3. Hash skew processing of character types

Unify hash rules, the difference between int and string?

In essence:

H (1) and h ('1'), which are essentially assigned to partition, cannot solve the problem of data skew at all.

The difference is:

H (10) may have a hash collision with h (1) because the hash value may be the same, resulting in further data tilt

And:

H ('10') h (' 1'), hash is different in nature, but if the number of partition is small, it may result in allocation to the same partition.

HashPartitioner is the default partitioner for mapreduce.

The method of calculation is

Which reducer= (key.hashCode () & Integer.MAX_VALUE)% numReduceTasks

So this is what the optimization idea of question 2 means:

Problem 2: the association of different data types id will cause data skew problem.

A log of table S8, with one record for each item, should be associated with the product table. But the correlation has encountered the problem of tilt. In the log of S8, there is a string commodity id and a numeric commodity id. The type is string, but the numeric id in the commodity is bigint. The reason for the guessing problem is that the commodity id of S8 is converted into a digital id to do hash to allocate reduce, so the S8 log of the string id is all on a reduce, and the solution validates this guess.

Method: convert a numeric type to a string type

Select * from s8_log a

Left outer join r_auction_auctions b

On a.auction_id = cast (b.auction_id as string)

The above is all the content of this article "how to optimize hive sentences". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.