There are several sort operations in Hive 07/01 Update SLTechnology News&Howtos

There are several sort operations in Hive

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "there are several sorting operations in Hive". The content is simple and clear. I hope it can help you solve your doubts. Let me lead you to study and learn this article "there are several sorting operations in Hive".

Four kinds of sorting in Hive

Sorting operation is a relatively common operation, especially in data analysis, we often need to sort the data. There are four keywords related to sorting in hive. Today we will see what they all do.

Data preparation

Below we have a piece of temperature data, tab segmentation

2008 32.0 2008 21.0 2008 31.5 2008 17.0 2013 34.0 2015 32.0 2015 33.0 2015 15.9 2015 31.0 2015 19.9 2015 27.0 2016 23.0 2016 39.9 2016 32.0

Build table and load data

Create table ods_temperature (`year`int, temper float) row format delimited fields terminated by'\ t'; load data local inpath'/ Users/liuwenqiang/workspace/hive/temperature.data' overwrite into table ods_temperature

1. Order by (global sort)

Order by sorts inputs globally, so there is only one Reducer (multiple Reducer cannot guarantee global order), while only one reducer consumes longer computing time when the input is large.

Descending order: desc

Ascending order: asc does not need to be specified. The default is ascending order.

It should be noted that it is affected by hive.mapred.mode, in strict mode, we must use limit to limit the amount of data sorted, because the amount of data is very large and there is only one reducer, there will be OOM or long running time, so in strict mode, if it is not applicable to limit, it will report an error. For more information, please refer to Hive's strict mode and local mode.

Error: Error while compiling statement: FAILED: SemanticException 1:39 Order by-s without limit are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.orderby.no.limit to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features.. Error encountered near token 'year' (state=42000,code=40000)

Next, let's take a look at the ranking result of order by, select * from ods_temperature order by year.

2. Sort by (sorting within partitions)

Instead of a global sort, it sorts the data before it enters the reducer, that is, it produces a sorted file for each reducer before the data enters the reduce. Therefore, if you sort with sort by and set mapreduce.job.reduces > 1, sort by only guarantees that the output of each reducer is ordered, not globally.

It is not affected by the Hive.mapred.mode attribute, and the data of sort by can only guarantee that the data in the same reduce can be sorted by the specified field. With sort by you can specify the number of reduce executed (specified by set mapred.reduce.tasks=n), and then perform merge sorting on the output data to get all the results.

Set mapred.reduce.tasks=3; select * from ods_temperature sort by year

It is found that the output above does not seem to show anything, only that it is not in order. Haha, let's output the data to a file in a different way, because we set the reduce number to 3, so there should be three file outputs.

Set mapred.reduce.tasks=3; insert overwrite local directory'/ Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by'\ t 'select * from ods_temperature sort by year

We can see that this is much clearer. We can see that the year in a partition does not agree, and the data for that year are all available.

Execution efficiency of sort by and order by

First of all, let's look at a phenomenon, generally speaking, we think that sort by should be faster than order by, because order by can only use one reducer to sort all, but not necessarily when the amount of data is relatively small, because the startup of reducer may take a long time to process data, as in the following example, order by is faster than sort by.

Limt in sort by

You can reduce the amount of data by using the limit clause in sort by. After using limit n, the number of data records transferred to the reduce side will be reduced to n * (the number of map), that is, what we use limit in sort by to limit the number of data in each reducer, and then order by according to the sorting field of sort by, and finally return n pieces of data to the client, that is to say, if you use the limit clause in sort by, you will still use order by for the final sorting.

Using limit in order by is to limit the sorted result file and then give it to reducer. You can see that the limit clause in sort by reduces the amount of data participating in sorting, while not in order by, it only limits the amount of data returned to the client.

From the above execution efficiency, we can see that sort by limit is almost twice as efficient as order by limit, so we can guess that there should be some extra link.

Next, let's take a look at the implementation plans of order by limit and sort by limit respectively.

Explain select * from ods_temperature order by year limit 2

Explain select * from ods_temperature sort by year limit 2

You can see it from the screenshot above where I circled it.

Sort by limit has an extra stage (order limit) than order by limit.

Sort by limit actually executes limit twice, reducing the amount of data involved in sorting

3. Distribute by (data distribution)

Distribute by controls how the data is split to the reduce side on the map side. Similar to partitioning partationer in MapReduce to partition data

Hive distributes the data to the corresponding reducer according to the following columns of distribute by. By default, hash algorithm + remainder is used.

Sort by generates a sort file for each reduce, and in some cases you need to control which reducer a particular line should go to, usually for subsequent aggregation operations. Distribute by can just do it. Therefore, distribute by is often used in conjunction with sort by.

For example, in the example of sort by above, we find that the data for different years are not in the same file, that is, not in the same reducer. Let's take a look at how to output the same years together, and then sort them in ascending order of temperature.

First, let's try the implementation of SQL without distribute by

Insert overwrite local directory'/ Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by'\ t 'select * from ods_temperature sort by temper

It is found that the data of the same year are not allocated together, so let's use distribute by

Insert overwrite local directory'/ Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by'\ t 'select * from ods_temperature distribute by year sort by temper

Now we see that the same year is put together for a while, and we can see that 2013 and 2016 are put together, but there is no certain order. At this time, we can sort the distribute by field again.

Insert overwrite local directory'/ Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by'\ t 'select * from ods_temperature distribute by year sort by year,temper

4. Cluster by

Cluster by not only has the function of distribute by but also has the function of sort by. However, sorting can only be in ascending order, and the collation cannot be specified as ASC or DESC.

When the partition field is the same as the sort field, cluster by can simplify the SQL writing of distribute by+sort by, that is, when the distribute by and sort by fields are the same, you can use cluster by instead of distribute by and sort by

Insert overwrite local directory'/ Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by'\ t 'select * from ods_temperature distribute by year sort by year

Insert overwrite local directory'/ Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by'\ t 'select * from ods_temperature cluster by year

We see that the output of the above two SQL writing methods is the same, which proves our argument that when the distribute by and sort by fields are the same, we can use cluster by instead of distribute by and sort by.

When you try to specify a sort direction for cluster by, you will get the following error.

Error: Error while compiling statement: FAILED: ParseException line 2:46 extraneous input 'desc' expecting EOF near' (state=42000,code=40000)

Order by is a global sort, and the performance may be poor.

The sort by partition is orderly and often cooperates with distribute by to determine what data is in the partition.

Distribute by determines the rules for data distribution, and data that meets the same conditions is distributed to a reducer.

Cluster by can use cluster by instead of distribute by and sort by when the distribute by and sort by fields are the same, but cluster by defaults to ascending order and cannot specify the sorting direction

Sort by limit is equivalent to the data limit of each reduce, followed by order by and then limit

The above is all the content of the article "there are several sorting operations in Hive". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.