The function and usage of order by,sort by, distribute by and cluster by in hive 04/15 Update SLTechnology News&Howtos

The function and usage of order by,sort by, distribute by and cluster by in hive

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. Order by

The order by in Hive has the same function as order by in the traditional SQL language, and it will sort the query results globally, so only hive's sql has order by and all data will be processed in the same reducer (no matter how many map, no matter how many block files, only one reducer will be launched). But for a large amount of data, it will take a long time to execute.

This is also a little different from the traditional sql: if hive.mapred.mode=strict is specified (the default is nonstrict), you must specify limit to limit the number of output entries, because all data will be carried out on the same reducer side, and the result may not be obtained in the case of a large amount of data, so in such a strict mode, you must specify the number of output entries.

2. Sort by

If sort by is specified in Hive, sorting will be done on each reducer, that is, local ordering is guaranteed (the data from each reducer is ordered, but there is no guarantee that all data is ordered, unless there is only one reducer). The advantage is that after performing local sorting, it can improve the efficiency of the next global sorting (in fact, a merge sort can be done to achieve global sorting).

3. Distribute by and sort by are used together

Ditribute by controls how the output of map is divided in reducer. For example, we have a table, mid refers to the merchant to which the store belongs, money is the profit of the merchant, and name is the name of the store.

Store:

MidmoneynameAA15.0 Store 1AA20.0 Store 2BB22.0 Store 3CC44.0 Store 4

Execute the hive statement:

Select mid, money, name from store distribute by mid sort by mid asc, money asc

All of our same mid data will be sent to the same reducer for processing, because distribute by mid is specified so that we can count the ranking of the profits of each store in each merchant (this must be globally ordered, because the same merchant will be processed in the same reducer). It is important to note that distribute by must be written before sort by.

4. Cluster by

The function of cluster by is to combine distribute by with sort by. The following two statements are equivalent:

Select mid, money, name from store cluster by midselect mid, money, name from store distribute by mid sort by mid

If you need to get the same effect as the statement in 3:

Select mid, money, name from store cluster by mid sort by money

Note that columns specified by cluster by can only be in descending order, not asc and desc.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.