Example Analysis of hive Partition and Bucket 07/15 Update SLTechnology News&Howtos

Example Analysis of hive Partition and Bucket

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you the "sample analysis of hive partitions and buckets", which is easy to understand and clear. I hope it can help you solve your doubts. Let the editor lead you to study and learn the "sample analysis of hive partitions and buckets".

1. Why zoning?

When the amount of data in a single table becomes larger and larger, the hive query usually scans the whole table, which will waste a lot of time that we don't care about the scanning of the data. Thus hive leads to the concept of partition partition.

two。 How to partition?

Looking at the specific business, you can split a pile of data into multiple heaps of data. The differences between id, year, month, day, region, province, hive partition and mysql partition are commonly used. The in-table field used for the partition field of the mysql. The partition field of hive uses an off-table field.

3.hive partition details?

1. The essence of partitioning is to create a corresponding directory under this table. 2. Partition names are not case-sensitive, so it is recommended not to use Chinese. 3. You can query the partition information. But our partition field is equivalent to a pseudo field that exists in the metadata but does not actually exist in the data content.

4. Specify a partition when loading data

4. Partition operation

Create a first-level partition table:

Create table if not exists day_part (

Uid int

Uname string

)

Partitioned by (year int)

Row format delimited fields terminated by'\ t'

Load data local inpath'/ root/Desktop/student.txt' into table day_part partition (year=2017)

Load data local inpath'/ root/Desktop/score.txt' into table day_part partition (year=2016)

Show partitions day_part

Secondary zoning

Create table if not exists day_part1 (

Uid int

Uname string

)

Partitioned by (year int,month int)

Row format delimited fields terminated by'\ t'

Load data local inpath'/ root/Desktop/student.txt' into table day_part1 partition (year=2017,month=04)

Load data local inpath'/ root/Desktop/score.txt' into table day_part1 partition (year=2017,month=03)

Level 3 zoning:

Create table if not exists day_part2 (

Uid int

Uname string

)

Partitioned by (year int,month int,day int)

Row format delimited fields terminated by'\ t'

Operate on the partition: display the partition:

Show partitions day_part

New partition: empty

Alter table day_part1 add partition (year=2017,month=2)

Alter table day_part1 add partition (year=2017,month=1) partition (year=2016,month=12)

Add a new partition and load the data:

Alter table day_part1 add partition (year=2016,month=11) location "/ user/hive/warehouse/qf1603.db/day_part1/year=2017/month=2"

Modify the storage path corresponding to the partition:

# # the path must be written from hdfs

Alter table day_part1 partition (year=2016,month=11) set location "hdfs://linux1:9000/user/hive/warehouse/qf1603.db/day_part1/year=2017/month=3"

Delete partition: deleting a partition will delete the corresponding partition directory (data)

# # deleting a Partition

Alter table day_part1 drop partition (year=2017,month=2)

# # deleting multiple

Alter table day_part1 drop partition (year=2017,month=3), partition (year=2017,month=4)

Static Partition, dynamic Partition, mixed Partition static Partition: the partition name has been specified when adding a partition or loading partition data. Dynamic partition: the partition name is unknown when adding a partition or loading partition data. Mixed partitions: both static and dynamic partitions exist.

Related properties of dynamic partitions: hive.exec.dynamic.partition=true: whether dynamic partitions are allowed hive.exec.dynamic.partition.mode=strict: partition mode setting nostrict strict: at least one is required to be static partitions nostrict: can all be dynamic partitions hive.exec.max.dynamic.partitions=1000: maximum number of dynamic partitions allowed hive.exec.max.dynamic.partitions.pernode = 100: maximum partitions allowed to be created by mapper/reducer on a single node

Create a temporary table:

# # creating temporary tables

Create table if not exists tmp (

Uid int

Commentid bigint

Recommentid bigint

Year int

Month int

Day int

)

Row format delimited fields terminated by'\ t'

# # loading data

Load data local inpath'/ root/Desktop/comm' into table tmp

Create a dynamic partition:

# # creating dynamic Partition Table

Create table if not exists dyp1 (

Uid int

Commentid bigint

Recommentid bigint

)

Partitioned by (year int,month int,day int)

Row format delimited fields terminated by'\ t'

Load data for dynamic partitions:

# # strict mode

Insert into table dyp1 partition (year=2016,month,day)

Select uid,commentid,recommentid,month,day from tmp

# # non-strict mode

# # setting non-strict Mode dynamic Partition

Set hive.exec.dynamic.partition.mode=nostrict

# # creating dynamic Partition Table

Create table if not exists dyp2 (

Uid int

Commentid bigint

Recommentid bigint

)

Partitioned by (year int,month int,day int)

Row format delimited fields terminated by'\ t'

# # loading data for non-strict Mode dynamic Partition

Insert into table dyp2 partition (year,month,day)

Select uid,commentid,recommentid,year,month,day from tmp

Hive provides us with a strict model: to prevent users from accidentally submitting malicious hql hive.mapred.mode=nostrict: strict

If the schema value is strict, the following three queries will be blocked: 1. For the partition table query, the filter field in where is not a partition field. 2. Cartesian product join query, join query statement, without on condition or where condition.

Select

Stu.id

Stu.name

Score.grade

From student stu

Join score

You can:

Select

Stu.id

Stu.name

Score.grade

From student stu

Join score

Where stu.id = score.uid

3. For order by queries, queries with order by do not contain limit statements.

Select

Student.*

From student

Order by student.id desc

Note: 1, try not to use dynamic partitioning, because when dynamic partitioning, the number of reducer will be allocated for each partition, when the number of partitions is large, the number of reducer will increase, which is a disaster for the server. 2, the difference between dynamic partitions and static partitions, static partitions will create the partition with or without data, dynamic partitions will be created with result sets, otherwise they will not be created. 3. The strict mode of hive dynamic partition and the strict mode of hive.mapred.mode provided by hive.

Split the barrel 1. Why divide the bucket?

Partition data is still very large, and more fine-grained management of partition data or table data. Sub-bucket keywords: clustered by (uid) into n buckets, bucket, sub-bucket using table fields how to split the bucket? Perform a hash value on the bucket field, then model the hash value to the total number of barrels, and then get the number of barrels.

two。 The meaning of dividing buckets:

1. Quick sampling query. Tablesample 2. Reduce the amount of data scanned by the query and improve the query efficiency.

# # create a bucket table and set 4 buckets

Create table if not exists bucket1 (

Uid int

Uname String

)

Clustered by (uid) into 4 buckets

Row format delimited fields terminated by'\ t'

3. Operation of dividing buckets:

Load data for bucket table: sub-bucket cannot use load to load data, but needs iinsert into to load data and needs to set properties:

# # set buckets to enable

Hive > set hive.enforce.bucketing=true

# # incorrect method of loading data

Load data local inpath'/ root/Desktop/student' into table bucket1

# # create a bucket table and set 4 buckets

Create table if not exists bucket7 (

Uid int

Uname String

)

Clustered by (uid) into 4 buckets

Row format delimited fields terminated by'\ t'

# # loading data for bucket table

Insert into table bucket7

Select id,name from student

Bucket query: tablesample (bucket x out of y on uid) Note: X cannot be greater than y x: the starting position of the bucket taken, y: the total number of buckets taken, y is the factor of the total number of barrels. If y is greater than the total number of barrels of the source, it is equivalent to stretching, and if y is less than the total number of barrels of the source, it is equivalent to compressing 1 out of 2 1, 1 1, 5, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2.

1 out of 4 1 1 1 4

Select * from bucket7

Select * from bucket7 tablesample (bucket 1 out of 4 on uid)

Select * from bucket7 tablesample (bucket 2 out of 4 on uid)

Select * from bucket7 tablesample (bucket 1 out of 2 on uid)

Select * from bucket7 tablesample (bucket 2 out of 2 on uid)

Select * from bucket7 tablesample (bucket 3 out of 2 on uid)

Select * from bucket7 tablesample (bucket 1 out of 8 on uid)

Select * from bucket7 tablesample (bucket 5 out of 8 on uid)

Partition + sub-bucket: (qfstu) uid,uname,class,master gender sub-bucket uid basis even sub-bucket query girls in the student number is the cardinal number?

# # creating a Table

Create table if not exists qftmp (

Uid int

Uname string

Class int

Gender int)

Row format delimited fields terminated by'\ t'

# # loading data

Load data local inpath'/ home/qf' into table qftmp

# # creating a dynamic Partition Bucket Table

Create table if not exists qf (

Uid int

Uname string

Class int)

Partitioned by (gender int)

Clustered by (uid) into 2 buckets

Row format delimited fields terminated by'\ t'

# # loading data for dynamic Partition Bucket Table

Insert into table qf partition (gender)

Select uid,uname,class,gender from qftmp

Query the student number of the girls as the base number?

Select * from qf where gender = 2 and uid%2! = 0

Select * from qf tablesample (bucket 2 out of 2 on uid) where gender = 2

The bucket uses internal keywords, and the partition uses external fields. Both are an optimization of hive. The number of partitions and buckets should be set reasonably, not as many as possible.

Sampling:

Select * from student order by rand () limit 3

Select * from student limit 3

Select * from student tablesample (3 rows)

Select * from student tablesample (20B); # # the smallest unit is B

Select * from student tablesample (20 percent); # # percentage

The above is all the contents of the article "sample Analysis of hive Partition and Bucket". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.