How to sort in Hive 07/01 Update SLTechnology News&Howtos

How to sort in Hive

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "how to sort in Hive", the content is easy to understand, clear, hope to help you solve doubts, the following let the editor lead you to study and learn "how to sort in Hive" this article.

1. Global sorting: order by

The order by clause appears at the end of the select statement; the order by clause sorts the final result; by default, ascending order (ASC) is used; DESC can be used, followed by the field name to indicate descending order

ORDER BY performs global sorting, with only one reduce

-- sort by alias

Select empno, ename, job, mgr, sal + nvl (comm, 0) salcomm, deptno from emp order by salcomm desc

-- Multi-column sorting

Select empno, ename, job, mgr, sal + nvl (comm, 0) salcomm, deptno from emp order by deptno, salcomm desc

2. Internal sorting of each MR: sort by

Order by is inefficient for large-scale data; in many business scenarios, we do not need globally ordered data, so we can use sort by;sort by to generate a sort file for each reduce, sort within reduce, and get locally ordered results.

-- set the number of reduce

Set mapreduce.job.reduces=2;-View employee information select * from emp sort by sal desc in descending order of salary

-- Import the query results into the file (in descending order of salary). Generate two output files, each with internal data arranged in descending order of salary

Insert overwrite local directory'/ home/hadoop/output/sortsal' select * from emp sort by sal desc

3. Partition sorting: distribute by

Distribute by sends specific rows to a specific reducer to facilitate subsequent aggregation and sorting operations; distribute by is similar to the partition operation in MR, which can be combined with sort by operation to make the partition data orderly; distribute by should be written before sort by

-- divide the data into three regions, each with data

Set mapreduce.job.reduces=3

Insert overwrite local directory'/ home/hadoop/output/distBy1' select empno, ename, job, deptno, sal + nvl (comm, 0) salcomm from emp distribute by deptno sort by salcomm desc

4 、 cluster by

When distribute by and sort by are the same field, you can use cluster by to simplify syntax; cluster by can only be in ascending order and cannot specify collation;-- syntactically equivalent

Select * from emp distribute by deptno sort by deptno; select * from emp cluster by deptno

Sort summary:

Order by . It is inefficient to perform global sorting. Use it cautiously in production environment

Sort by . Make the data locally ordered (within the reduce)

Distribute by . Grouping data according to specified conditions, often in conjunction with sort by, to make the data locally ordered cluster by.

When distribute by and sort by are the same field, you can use cluster by to simplify syntax

The above is all the contents of the article "how to sort in Hive". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.