How to tune Hive SQL 07/16 Update SLTechnology News&Howtos

How to tune Hive SQL

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "how to tune Hive SQL", the content is easy to understand, clear, hope to help you solve your doubts, let the editor lead you to study and learn "how to tune Hive SQL" this article.

1. The removal efficiency of distict is lower than that of group by.

Before, you can always see on the Internet that there must be such a rule in hive tuning. To avoid using distinct to remove weight, the replacement method is group by. But is this the case in all cases? Take a look at the following case

Select count (1) from (

Select s_age

From student_tb_orc

Group by s_age

) b

This is to count the enumerated values of age from the student table, but why not use the distinct below?

Select count (distinct s_age)

From student_tb_orc

We generally think that when the amount of data is large, the first method can avoid the data tilt on the reduce side, but in fact, regardless of the amount of data, the concise SQL below is more efficient.

"the results of the author's run are 47s and 28s. "

Why is that?

Because the s_age column is removed, in fact, the business meaning means age, and the number of enumerated values is very limited, so the s_age will be de-weighed in the Map phase, so the s_age for each Map is limited, and the amount that reaches the Reduce stage is very limited, which will not reach the amount of data tilt at all. In addition, group by varies greatly from version to version, some versions will be reduplicated in the form of building hashtable, and some versions will be sorted, so the optimal time complexity of sorting can not reach O (1). In addition, the above description is translated into two tasks, which will consume more disk network Ido O resources. At present, count (distinct) optimization has been added to three kinds of Hive. By configuring "hive.optimize.countdistinct", even if there is a data tilt, it can be automatically optimized and automatically change the logic of SQL execution.

So, "the first way to write SQL above is a bit over-optimized". Let's move on to their execution flowchart:

The first SQL execution flow chart is as follows: the second SQL execution flow chart is as follows: so the comparison of the two SQL execution processes is as follows:

The time difference between the two SQL is mainly concentrated in the data transfer and the creation of intermediate tasks, which is the dotted frame part of the image above, so the way through the distinct keyword is more efficient than the subquery.

Of course, if the Spark engine is used here, it directly saves the time for Map1 to set up the disk and Reduce to read the intermediate data, and the running time difference between the two may be even shorter. But from the point of view of the same complexity of SQL, distinct is better from a more concise and optimal point of view.

"so under what circumstances is the SQL written in the first way more efficient than the SQL in the second way? "

In the case of data skew, the SQL method of the first method is better.

When the data is large to a certain order of magnitude, the SQL of the first method has two jobs, and the processing logic can be divided into two stages, that is, the first stage first processes part of the data to reduce the amount of data, and the second stage continues to process on the reduced data set.

In the second way of writing SQL, there is still a lot of data to be processed in the Map stage, but all the data needs to be processed by a single Reduce node, just like thousands of troops crossing a single log bridge, not only can not take advantage of the advantages of distributed clusters, but also waste a lot of time waiting, and this waiting time is far more than the extra time spent by the process extended by multiple MapReduce of SQL in the first way.

"however, as mentioned earlier, even if you encounter data skew in Hive 3.0 and the SQL of the second way of writing sets hive.optimize.countdistinct to true, the whole method of writing can achieve the effect of SQL of the first way of writing. "

I try to run the same SQL on my own cluster, using the Spark engine, probably because of the small amount of data, the difference is not much, all about 4s.

2. Rewrite SQL to optimize union.

Requirements: find the birthday dates of the newest and earliest people of each age group from the student table and write them into a table

So the SQL is as follows:

INSERT into table student_stat partition (tp)

Select

S_age

Min (s_birth) stat

'min' tp

From student_tb_txt

Group by s_age

Union all

Select

S_age

Max (s_birth) stat

'max' tp

From student_tb_txt

Group by s_age

But this SQL is actually 5 job corresponding to 4 MR tasks, the efficiency is relatively low. Then how to optimize it? Can you read the table only once, you can calculate the minimum and maximum values, and then write to the final result table in turn, without intermediate union. Look at the following SQL

From student_tb_txt

INSERT into table student_stat partition (tp)

Select (s_birth) stat,'min' tp

Group by s_age

Insert into table student_stat partition (tp)

Select sdistribuageMaginomax (s_birth) stat,'max' tp

Group by s_age

"this is also known as multi-table-insert syntax, multi-output" in the above SQL execution, in fact, also started a Job, so the efficiency improvement is still very significant.

These are all the contents of the article "how to tune Hive SQL". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.