In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
Hive in how to achieve full sorting, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.
Full sort
The sorting keyword of hive is SORT BY, which is intentionally different from the ORDER BY of traditional databases to emphasize the difference between the two-SORT BY can only be sorted within the scope of a single machine.
1.1.1 case 1
Set mapred.reduce.tasks=2
Original value
Select cookie_id,page_id,id fromc02_clickstat_fatdt1
Where cookie_idIN ('1.193.131.218.1288611279693.0)
1.193.148.164.1288609861509.2 113181412886099008861288609901078194082403 684000005
1.193.148.164.1288609861509.2 127001128860563972141288609859828580660473 684000015
1.193.148.164.1288609861509.2 113181412886099165721288609915890452725326 684000018
1.193.131.218.1288611279693.0 01c183da6e4bc50712881288611540109914561053 684000114
1.193.131.218.1288611279693.0 01c183da6e4bc22412881288611414343558274174 684000118
1.193.131.218.1288611279693.0 01c183da6e4bc50712881288611511781996667988 684000121
1.193.131.218.1288611279693.0 01c183da6e4bc22412881288611523640691739999 684000126
1.193.131.218.1288611279693.0 01c183da6e4bc50712881288611540109914561053 684000128
Hive > select cookie_id,page_id,id fromc02_clickstat_fatdt1 where
Cookie_idIN ('1.193.131.218.1288611279693.0)
SORT BY COOKIE_ID,PAGE_ID
Values sorted by SORT
1.193.131.218.1288611279693.0 684000118 01c183da6e4bc22412881288611414343558274174 684000118
1.193.131.218.1288611279693.0 684000114 01c183da6e4bc50712881288611540109914561053 684000114
1.193.131.218.1288611279693.0 684000128 01c183da6e4bc50712881288611540109914561053 684000128
1.193.148.164.1288609861509.2 684000005 113181412886099008861288609901078194082403 684000005
1.193.148.164.1288609861509.2 684000018 113181412886099165721288609915890452725326 684000018
1.193.131.218.1288611279693.0 684000126 01c183da6e4bc22412881288611523640691739999 684000126
1.193.131.218.1288611279693.0 684000121 01c183da6e4bc50712881288611511781996667988 684000121
1.193.148.164.1288609861509.2 684000015 127001128860563972141288609859828580660473 684000015
Select cookie_id,page_id,id fromc02_clickstat_fatdt1
Where cookie_idIN ('1.193.131.218.1288611279693.0)
ORDER BY PAGE_ID,COOKIE_ID
1.193.131.218.1288611279693.0 684000118 01c183da6e4bc22412881288611414343558274174 684000118
1.193.131.218.1288611279693.0 684000126 01c183da6e4bc22412881288611523640691739999 684000126
1.193.131.218.1288611279693.0 684000121 01c183da6e4bc50712881288611511781996667988 684000121
1.193.131.218.1288611279693.0 684000114 01c183da6e4bc50712881288611540109914561053 684000114
1.193.131.218.1288611279693.0 684000128 01c183da6e4bc50712881288611540109914561053 684000128
1.193.148.164.1288609861509.2 684000005 113181412886099008861288609901078194082403 684000005
1.193.148.164.1288609861509.2 684000018 113181412886099165721288609915890452725326 684000018
1.193.148.164.1288609861509.2 684000015 127001128860563972141288609859828580660473 684000015
You can see that the values sorted by SORT and ORDER are not the same. At first I designated two reduce to distribute the data (each sorted). The main reason for the difference is that no reduce key,hive in the above query generates a random number as a reduce key. In this way, the input records are randomly distributed to different reducer machines. To ensure that there are no duplicate cookie_id records between reducer, you can use the DISTRIBUTE BY keyword to specify that the distribution key is cookie_id.
Select cookie_id,country,id,page_id,id fromc02_clickstat_fatdt1 where cookie_idIN ('1.193.131.218.1288611279693.0) distribute by cookie_id SORT BYCOOKIE_ID,page_id
1.193.131.218.1288611279693.0 684000118 01c183da6e4bc22412881288611414343558274174 684000118
1.193.131.218.1288611279693.0 684000126 01c183da6e4bc22412881288611523640691739999 684000126
1.193.131.218.1288611279693.0 684000121 01c183da6e4bc50712881288611511781996667988 684000121
1.193.131.218.1288611279693.0 684000114 01c183da6e4bc50712881288611540109914561053 684000114
1.193.131.218.1288611279693.0 684000128 01c183da6e4bc50712881288611540109914561053 684000128
1.193.148.164.1288609861509.2 684000005 113181412886099008861288609901078194082403 684000005
1.193.148.164.1288609861509.2 684000018 113181412886099165721288609915890452725326 684000018
1.193.148.164.1288609861509.2 684000015 127001128860563972141288609859828580660473 684000015
1.1.2 cases 2
CREATE TABLE if not exists t_order (
Id int,-- order number
Sale_id int,-- selling ID
Customer_id int,-- customer ID
Product _ id int,-- Product ID
Amount int-quantity
) PARTITIONED BY (ds STRING)
Query all sales records in the table and sort by sales ID and quantity:
Set mapred.reduce.tasks=2
Select sale_id, amount from t_order
Sort by sale_id, amount
This query may get an unexpected sort. The data distributed to the two specified reducer may be (sorted by each):
Reducer1:
Sale_id | amount
0 | 100
| 1 | 30 |
| 1 | 50 |
| 2 | 20 |
Reducer2:
Sale_id | amount
0 | 110
0 | 120
3 | 50
4 | 20
Use the DISTRIBUTE BY keyword to specify that the distribution key is sale_id. The modified HQL is as follows:
Set mapred.reduce.tasks=2
Select sale_id, amount from t_order
Distribute by sale_id
Sort by sale_id, amount
This ensures that the number of sales ID in the query's sales record collection is sorted correctly, but the sales ID is not sorted correctly because hive distributes data using hadoop's default HashPartitioner.
This involves the problem of full ordering. There are no more than two solutions:
1.) Do not distribute data, use a single reducer:
Set mapred.reduce.tasks=1
The disadvantage of this method is that the reduce side has become a performance bottleneck, and the results can not be obtained in the case of large amount of data. But in practice, this is still the most common method, because usually sorted queries are designed to get top-ranked results, so you can use the limit clause to greatly reduce the amount of data. After using limit n, the number of data records transferred to the reduce side (stand-alone) is reduced to n * (the number of map).
2.) Modify Partitioner, this method can achieve full sorting. Here you can use the TotalOrderPartitioner that comes with Hadoop (from Yahoo! The TeraSort project, which is a Partitioner developed to support orderly data distribution across reducer, requires a file in SequenceFile format to specify the data range for distribution. If we have generated this file (stored in / tmp/range_key_list, divided into 100 reducer), we can rewrite the above query as
Set mapred.reduce.tasks=100
Set hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner
Settotal.order.partitioner.path=/tmp/ range_key_list
Select sale_id, amount from t_order
Cluster by sale_id
Sort by amount
There are many ways to generate this interval file (such as the o.a.h.mapreduce.lib.partition.InputSampler tool that comes with hadoop). The method of generating with Hive is introduced here, for example, there is a t _ salt table sorted by id:
CREATE TABLE if not exists t_sale (
Id int
Name string
Loc string
);
The method to generate interval files distributed by sale_id is:
Create external tablerange_keys (sale_id int)
Row format serde
'org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe'
Stored as
Inputformat
'org.apache.hadoop.mapred.TextInputFormat'
Outputformat
'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'
Location'/ tmp/range_key_list'
Insert overwrite table range_keys
Select distinct sale_id
From source t_salesampletable (BUCKET 100OUT OF 100ON rand ()) s
Sort by sale_id
The resulting file (under the / tmp/range_key_list directory) allows TotalOrderPartitioner to distribute the data processed by reduce in an orderly manner as sale_id. The main problem to be considered in interval files is the balance of data distribution, which depends on an in-depth understanding of the data.
Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.