What is the use of Order by, Sort by and Dristribute by,Cluster By in Hive 07/15 Update SLTechnology News&Howtos

What is the use of Order by, Sort by and Dristribute by,Cluster By in Hive

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the use of Order by, Sort by, Dristribute by,Cluster By in Hive, has certain reference value, interested friends can refer to, hope you can learn a lot after reading this article, the following let the editor take you to understand it.

The function and usage of Order by, Sort by and Dristribute by,Cluster By in Hive

1. Order by

Set hive.mapred.mode=nonstrict; (default value / default)

Set hive.mapred.mode=strict

Order by has the same function as Order by in the database, sorting output by one item & several items.

The difference from order by in the database is that limit must be specified in hive.mapred.mode = strict mode or execution will report an error.

Hive > select * from test order by id

FAILED: Error in semantic analysis: 1:28 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token 'id'

Reason: in the order by state, all data will go to one server for reduce operation, that is, there is only one reduce. If the result cannot be output in the case of a large amount of data, if limit n is performed, there are only n * map number records. Only one reduce can handle it.

2. Sort by

Sort by is not affected by whether hive.mapred.mode is strict or nostrict.

Data in sort by can only guarantee that data in the same reduce can be sorted by specified fields.

With sort by you can specify the number of reduce executed (set mapred.reduce.tasks=) so that you can output more data.

Then merge and sort the output data, and all the results can be obtained.

Note: you can use the limit clause to greatly reduce the amount of data. After using limit n, the number of data records transferred to the reduce side (stand-alone) is reduced to n * (the number of map). Otherwise, it may not be able to produce results because the data is too large.

Http://www.alidata.org/archives/622

3. Distribute by

The data is divided into different output reduce / files according to the specified fields.

Insert overwrite local directory'/ home/hadoop/out' select * from test order by name distribute by length (name)

This method divides the name into different reduce according to the length of the reduce, and finally outputs to different files.

Length is a built-in function, you can also specify other functions or this uses custom functions.

4. DISTRIBUTE BY with SORT BY

DISTRIBUTE BY can control how the output of map is divided in reduce. It can divide the data into different output reduce/ files according to the specified fields.

DISTRIBUTE BY is somewhat similar to GROUP BY, where DISTRIBUTE BY controls how reduce handles data, while SORT BY controls how data in reduce is sorted.

Note: hive requires the DISTRIBUTE BY statement to appear before the SORT BY statement.

5. Cluster By

Cluster by not only has the function of distribute by but also has the function of sort by.

Sort in reverse by default, but the fields of DISTRIBUTE BY and SORT BY must be the same, and no collation can be specified. Asc or desc.

Summary:

ORDER BY is a global sort, but in the case of a large amount of data, it will take a long time

SORT BY is to sort a single output of reduce, which cannot guarantee global order.

DISTRIBUTE BY can divide data into different reduce according to specified fields.

When the field of DISTRIBUTE BY is the same as that of SORT BY, you can use CLUSTER BY instead of DISTRIBUTE BY with SORT BY.

Thank you for reading this article carefully. I hope the article "what is the use of Order by, Sort by and Dristribute by,Cluster By in Hive" shared by the editor will be helpful to you. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.