How to realize full sorting in hive 04/19 Update SLTechnology News&Howtos

Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

How to realize full sorting in hive

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

Hive in how to achieve full sorting, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

Full sort

The sorting keyword of hive is SORT BY, which is intentionally different from the ORDER BY of traditional databases to emphasize the difference between the two-SORT BY can only be sorted within the scope of a single machine.

1.1.1 case 1

Set mapred.reduce.tasks=2

Original value

Select cookie_id,page_id,id fromc02_clickstat_fatdt1

Where cookie_idIN ('1.193.131.218.1288611279693.0)

1.193.148.164.1288609861509.2 113181412886099008861288609901078194082403 684000005

1.193.148.164.1288609861509.2 127001128860563972141288609859828580660473 684000015

1.193.148.164.1288609861509.2 113181412886099165721288609915890452725326 684000018

1.193.131.218.1288611279693.0 01c183da6e4bc50712881288611540109914561053 684000114

1.193.131.218.1288611279693.0 01c183da6e4bc22412881288611414343558274174 684000118

1.193.131.218.1288611279693.0 01c183da6e4bc50712881288611511781996667988 684000121

1.193.131.218.1288611279693.0 01c183da6e4bc22412881288611523640691739999 684000126

1.193.131.218.1288611279693.0 01c183da6e4bc50712881288611540109914561053 684000128

Hive > select cookie_id,page_id,id fromc02_clickstat_fatdt1 where

Cookie_idIN ('1.193.131.218.1288611279693.0)

SORT BY COOKIE_ID,PAGE_ID

Values sorted by SORT

1.193.131.218.1288611279693.0 684000118 01c183da6e4bc22412881288611414343558274174 684000118

1.193.131.218.1288611279693.0 684000114 01c183da6e4bc50712881288611540109914561053 684000114

1.193.131.218.1288611279693.0 684000128 01c183da6e4bc50712881288611540109914561053 684000128

1.193.148.164.1288609861509.2 684000005 113181412886099008861288609901078194082403 684000005

1.193.148.164.1288609861509.2 684000018 113181412886099165721288609915890452725326 684000018

1.193.131.218.1288611279693.0 684000126 01c183da6e4bc22412881288611523640691739999 684000126

1.193.131.218.1288611279693.0 684000121 01c183da6e4bc50712881288611511781996667988 684000121

1.193.148.164.1288609861509.2 684000015 127001128860563972141288609859828580660473 684000015

Select cookie_id,page_id,id fromc02_clickstat_fatdt1

Where cookie_idIN ('1.193.131.218.1288611279693.0)

ORDER BY PAGE_ID,COOKIE_ID

1.193.131.218.1288611279693.0 684000118 01c183da6e4bc22412881288611414343558274174 684000118

1.193.131.218.1288611279693.0 684000126 01c183da6e4bc22412881288611523640691739999 684000126

1.193.131.218.1288611279693.0 684000121 01c183da6e4bc50712881288611511781996667988 684000121

1.193.131.218.1288611279693.0 684000114 01c183da6e4bc50712881288611540109914561053 684000114

1.193.131.218.1288611279693.0 684000128 01c183da6e4bc50712881288611540109914561053 684000128

1.193.148.164.1288609861509.2 684000005 113181412886099008861288609901078194082403 684000005

1.193.148.164.1288609861509.2 684000018 113181412886099165721288609915890452725326 684000018

1.193.148.164.1288609861509.2 684000015 127001128860563972141288609859828580660473 684000015

You can see that the values sorted by SORT and ORDER are not the same. At first I designated two reduce to distribute the data (each sorted). The main reason for the difference is that no reduce key,hive in the above query generates a random number as a reduce key. In this way, the input records are randomly distributed to different reducer machines. To ensure that there are no duplicate cookie_id records between reducer, you can use the DISTRIBUTE BY keyword to specify that the distribution key is cookie_id.

Select cookie_id,country,id,page_id,id fromc02_clickstat_fatdt1 where cookie_idIN ('1.193.131.218.1288611279693.0) distribute by cookie_id SORT BYCOOKIE_ID,page_id

1.193.131.218.1288611279693.0 684000118 01c183da6e4bc22412881288611414343558274174 684000118

1.193.131.218.1288611279693.0 684000126 01c183da6e4bc22412881288611523640691739999 684000126

1.193.131.218.1288611279693.0 684000121 01c183da6e4bc50712881288611511781996667988 684000121

1.193.131.218.1288611279693.0 684000114 01c183da6e4bc50712881288611540109914561053 684000114

1.193.131.218.1288611279693.0 684000128 01c183da6e4bc50712881288611540109914561053 684000128

1.193.148.164.1288609861509.2 684000005 113181412886099008861288609901078194082403 684000005

1.193.148.164.1288609861509.2 684000018 113181412886099165721288609915890452725326 684000018

1.193.148.164.1288609861509.2 684000015 127001128860563972141288609859828580660473 684000015

1.1.2 cases 2

CREATE TABLE if not exists t_order (

Id int,-- order number

Sale_id int,-- selling ID

Customer_id int,-- customer ID

Product _ id int,-- Product ID

Amount int-quantity

) PARTITIONED BY (ds STRING)

Query all sales records in the table and sort by sales ID and quantity:

Set mapred.reduce.tasks=2

Select sale_id, amount from t_order

Sort by sale_id, amount

This query may get an unexpected sort. The data distributed to the two specified reducer may be (sorted by each):

Reducer1:

Sale_id | amount

0 | 100

| 1 | 30 |

| 1 | 50 |

| 2 | 20 |

Reducer2:

Sale_id | amount

0 | 110

0 | 120

3 | 50

4 | 20

Use the DISTRIBUTE BY keyword to specify that the distribution key is sale_id. The modified HQL is as follows:

Set mapred.reduce.tasks=2

Select sale_id, amount from t_order

Distribute by sale_id

Sort by sale_id, amount

This ensures that the number of sales ID in the query's sales record collection is sorted correctly, but the sales ID is not sorted correctly because hive distributes data using hadoop's default HashPartitioner.

This involves the problem of full ordering. There are no more than two solutions:

1.) Do not distribute data, use a single reducer:

Set mapred.reduce.tasks=1

The disadvantage of this method is that the reduce side has become a performance bottleneck, and the results can not be obtained in the case of large amount of data. But in practice, this is still the most common method, because usually sorted queries are designed to get top-ranked results, so you can use the limit clause to greatly reduce the amount of data. After using limit n, the number of data records transferred to the reduce side (stand-alone) is reduced to n * (the number of map).

2.) Modify Partitioner, this method can achieve full sorting. Here you can use the TotalOrderPartitioner that comes with Hadoop (from Yahoo! The TeraSort project, which is a Partitioner developed to support orderly data distribution across reducer, requires a file in SequenceFile format to specify the data range for distribution. If we have generated this file (stored in / tmp/range_key_list, divided into 100 reducer), we can rewrite the above query as

Set mapred.reduce.tasks=100

Set hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner

Settotal.order.partitioner.path=/tmp/ range_key_list

Select sale_id, amount from t_order

Cluster by sale_id

Sort by amount

There are many ways to generate this interval file (such as the o.a.h.mapreduce.lib.partition.InputSampler tool that comes with hadoop). The method of generating with Hive is introduced here, for example, there is a t _ salt table sorted by id:

CREATE TABLE if not exists t_sale (

Id int

Name string

Loc string

);

The method to generate interval files distributed by sale_id is:

Create external tablerange_keys (sale_id int)

Row format serde

'org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe'

Stored as

Inputformat

'org.apache.hadoop.mapred.TextInputFormat'

Outputformat

'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'

Location'/ tmp/range_key_list'

Insert overwrite table range_keys

Select distinct sale_id

From source t_salesampletable (BUCKET 100OUT OF 100ON rand ()) s

Sort by sale_id

The resulting file (under the / tmp/range_key_list directory) allows TotalOrderPartitioner to distribute the data processed by reduce in an orderly manner as sale_id. The main problem to be considered in interval files is the balance of data distribution, which depends on an in-depth understanding of the data.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Weibo

Weibo

Tencent

Tencent

Renren

Renren

QQZone

QQZone

Douban

Douban

More

More

Weibo

Weibo

Tencent

Tencent

Renren

Renren

QQZone

QQZone

Douban

Douban

Yixin

Yixin

Related

Build zoopker+hbase environment

Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope

2023-12-25 21:17:29 shulou Views: 339
Lenovo Rescue Y700 2023 tablet push ZUI 15.0.723 system Grayscale Test: add

CTOnews.com, December 23, ZUI officially announced that the Rescue Y700 2023 tablet will open the ZU

2023-12-25 21:14:50 shulou Views: 359
Cybertruck: future species redefine cars

"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un

2023-12-25 21:13:36 shulou Views: 364
An example Analysis of the Wechat Web Landing Authorization of the Wechat Public platform of php version

This article mainly shows you the "php version of Wechat public platform Wechat web login authorizat

2023-12-25 20:31:50 shulou Views: 270
What are the relevant knowledge points of PHP class

2023-12-25 20:31:29 shulou Views: 322

Servers

More Servers >

Latest Network Security More Network Security >

Latest Internet Technology More Internet Technology >

Latest Development More Development >

Latest Database More Database >

Latest Servers More Servers >

Latest Mobile Phone More Mobile Phone >

Latest Android Software More Android Software >

Latest Apple Software More Apple Software >

Latest Computer Software News More Computer Software News >

Latest IT Information More IT Information >

Wechat

About us Contact us Product review car news thenatureplanet

© 2024 shulou.com SLNews company. All rights reserved.

12

Report