In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly shows you the "sample analysis of hive partitions and buckets", which is easy to understand and clear. I hope it can help you solve your doubts. Let the editor lead you to study and learn the "sample analysis of hive partitions and buckets".
1. Why zoning?
When the amount of data in a single table becomes larger and larger, the hive query usually scans the whole table, which will waste a lot of time that we don't care about the scanning of the data. Thus hive leads to the concept of partition partition.
two。 How to partition?
Looking at the specific business, you can split a pile of data into multiple heaps of data. The differences between id, year, month, day, region, province, hive partition and mysql partition are commonly used. The in-table field used for the partition field of the mysql. The partition field of hive uses an off-table field.
3.hive partition details?
1. The essence of partitioning is to create a corresponding directory under this table. 2. Partition names are not case-sensitive, so it is recommended not to use Chinese. 3. You can query the partition information. But our partition field is equivalent to a pseudo field that exists in the metadata but does not actually exist in the data content.
4. Specify a partition when loading data
4. Partition operation
Create a first-level partition table:
Create table if not exists day_part (
Uid int
Uname string
)
Partitioned by (year int)
Row format delimited fields terminated by'\ t'
Load data local inpath'/ root/Desktop/student.txt' into table day_part partition (year=2017)
Load data local inpath'/ root/Desktop/score.txt' into table day_part partition (year=2016)
Show partitions day_part
Secondary zoning
Create table if not exists day_part1 (
Uid int
Uname string
)
Partitioned by (year int,month int)
Row format delimited fields terminated by'\ t'
Load data local inpath'/ root/Desktop/student.txt' into table day_part1 partition (year=2017,month=04)
Load data local inpath'/ root/Desktop/score.txt' into table day_part1 partition (year=2017,month=03)
Level 3 zoning:
Create table if not exists day_part2 (
Uid int
Uname string
)
Partitioned by (year int,month int,day int)
Row format delimited fields terminated by'\ t'
Operate on the partition: display the partition:
Show partitions day_part
New partition: empty
Alter table day_part1 add partition (year=2017,month=2)
Alter table day_part1 add partition (year=2017,month=1) partition (year=2016,month=12)
Add a new partition and load the data:
Alter table day_part1 add partition (year=2016,month=11) location "/ user/hive/warehouse/qf1603.db/day_part1/year=2017/month=2"
Modify the storage path corresponding to the partition:
# # the path must be written from hdfs
Alter table day_part1 partition (year=2016,month=11) set location "hdfs://linux1:9000/user/hive/warehouse/qf1603.db/day_part1/year=2017/month=3"
Delete partition: deleting a partition will delete the corresponding partition directory (data)
# # deleting a Partition
Alter table day_part1 drop partition (year=2017,month=2)
# # deleting multiple
Alter table day_part1 drop partition (year=2017,month=3), partition (year=2017,month=4)
Static Partition, dynamic Partition, mixed Partition static Partition: the partition name has been specified when adding a partition or loading partition data. Dynamic partition: the partition name is unknown when adding a partition or loading partition data. Mixed partitions: both static and dynamic partitions exist.
Related properties of dynamic partitions: hive.exec.dynamic.partition=true: whether dynamic partitions are allowed hive.exec.dynamic.partition.mode=strict: partition mode setting nostrict strict: at least one is required to be static partitions nostrict: can all be dynamic partitions hive.exec.max.dynamic.partitions=1000: maximum number of dynamic partitions allowed hive.exec.max.dynamic.partitions.pernode = 100: maximum partitions allowed to be created by mapper/reducer on a single node
Create a temporary table:
# # creating temporary tables
Create table if not exists tmp (
Uid int
Commentid bigint
Recommentid bigint
Year int
Month int
Day int
)
Row format delimited fields terminated by'\ t'
# # loading data
Load data local inpath'/ root/Desktop/comm' into table tmp
Create a dynamic partition:
# # creating dynamic Partition Table
Create table if not exists dyp1 (
Uid int
Commentid bigint
Recommentid bigint
)
Partitioned by (year int,month int,day int)
Row format delimited fields terminated by'\ t'
Load data for dynamic partitions:
# # strict mode
Insert into table dyp1 partition (year=2016,month,day)
Select uid,commentid,recommentid,month,day from tmp
# # non-strict mode
# # setting non-strict Mode dynamic Partition
Set hive.exec.dynamic.partition.mode=nostrict
# # creating dynamic Partition Table
Create table if not exists dyp2 (
Uid int
Commentid bigint
Recommentid bigint
)
Partitioned by (year int,month int,day int)
Row format delimited fields terminated by'\ t'
# # loading data for non-strict Mode dynamic Partition
Insert into table dyp2 partition (year,month,day)
Select uid,commentid,recommentid,year,month,day from tmp
Hive provides us with a strict model: to prevent users from accidentally submitting malicious hql hive.mapred.mode=nostrict: strict
If the schema value is strict, the following three queries will be blocked: 1. For the partition table query, the filter field in where is not a partition field. 2. Cartesian product join query, join query statement, without on condition or where condition.
Select
Stu.id
Stu.name
Score.grade
From student stu
Join score
You can:
Select
Stu.id
Stu.name
Score.grade
From student stu
Join score
Where stu.id = score.uid
3. For order by queries, queries with order by do not contain limit statements.
Select
Student.*
From student
Order by student.id desc
Note: 1, try not to use dynamic partitioning, because when dynamic partitioning, the number of reducer will be allocated for each partition, when the number of partitions is large, the number of reducer will increase, which is a disaster for the server. 2, the difference between dynamic partitions and static partitions, static partitions will create the partition with or without data, dynamic partitions will be created with result sets, otherwise they will not be created. 3. The strict mode of hive dynamic partition and the strict mode of hive.mapred.mode provided by hive.
Split the barrel 1. Why divide the bucket?
Partition data is still very large, and more fine-grained management of partition data or table data. Sub-bucket keywords: clustered by (uid) into n buckets, bucket, sub-bucket using table fields how to split the bucket? Perform a hash value on the bucket field, then model the hash value to the total number of barrels, and then get the number of barrels.
two。 The meaning of dividing buckets:
1. Quick sampling query. Tablesample 2. Reduce the amount of data scanned by the query and improve the query efficiency.
# # create a bucket table and set 4 buckets
Create table if not exists bucket1 (
Uid int
Uname String
)
Clustered by (uid) into 4 buckets
Row format delimited fields terminated by'\ t'
3. Operation of dividing buckets:
Load data for bucket table: sub-bucket cannot use load to load data, but needs iinsert into to load data and needs to set properties:
# # set buckets to enable
Hive > set hive.enforce.bucketing=true
# # incorrect method of loading data
Load data local inpath'/ root/Desktop/student' into table bucket1
# # create a bucket table and set 4 buckets
Create table if not exists bucket7 (
Uid int
Uname String
)
Clustered by (uid) into 4 buckets
Row format delimited fields terminated by'\ t'
# # loading data for bucket table
Insert into table bucket7
Select id,name from student
Bucket query: tablesample (bucket x out of y on uid) Note: X cannot be greater than y x: the starting position of the bucket taken, y: the total number of buckets taken, y is the factor of the total number of barrels. If y is greater than the total number of barrels of the source, it is equivalent to stretching, and if y is less than the total number of barrels of the source, it is equivalent to compressing 1 out of 2 1, 1 1, 5, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2.
1 out of 4 1 1 1 4
Select * from bucket7
Select * from bucket7 tablesample (bucket 1 out of 4 on uid)
Select * from bucket7 tablesample (bucket 2 out of 4 on uid)
Select * from bucket7 tablesample (bucket 1 out of 2 on uid)
Select * from bucket7 tablesample (bucket 2 out of 2 on uid)
Select * from bucket7 tablesample (bucket 3 out of 2 on uid)
Select * from bucket7 tablesample (bucket 1 out of 8 on uid)
Select * from bucket7 tablesample (bucket 5 out of 8 on uid)
Partition + sub-bucket: (qfstu) uid,uname,class,master gender sub-bucket uid basis even sub-bucket query girls in the student number is the cardinal number?
# # creating a Table
Create table if not exists qftmp (
Uid int
Uname string
Class int
Gender int)
Row format delimited fields terminated by'\ t'
# # loading data
Load data local inpath'/ home/qf' into table qftmp
# # creating a dynamic Partition Bucket Table
Create table if not exists qf (
Uid int
Uname string
Class int)
Partitioned by (gender int)
Clustered by (uid) into 2 buckets
Row format delimited fields terminated by'\ t'
# # loading data for dynamic Partition Bucket Table
Insert into table qf partition (gender)
Select uid,uname,class,gender from qftmp
Query the student number of the girls as the base number?
Select * from qf where gender = 2 and uid%2! = 0
Select * from qf tablesample (bucket 2 out of 2 on uid) where gender = 2
The bucket uses internal keywords, and the partition uses external fields. Both are an optimization of hive. The number of partitions and buckets should be set reasonably, not as many as possible.
Sampling:
Select * from student order by rand () limit 3
Select * from student limit 3
Select * from student tablesample (3 rows)
Select * from student tablesample (20B); # # the smallest unit is B
Select * from student tablesample (20 percent); # # percentage
The above is all the contents of the article "sample Analysis of hive Partition and Bucket". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.