Introduction to hive partition and bucket operation 07/01 Update SLTechnology News&Howtos

Introduction to hive partition and bucket operation

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "introduction to hive partition and bucket operation". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn the "introduction to hive partition and bucket operation"!

Partition operation

The partition of Hive is achieved by starting PARTITION BY when the table is created, and the dimension used to partition is not a column of the actual data, and the flag of the specific partition is given when the content is inserted. You can use the WHERE statement when you want to query the contents of a partition, such as using "WHERE tablename.partition_key > a" to create a table with a partition. The syntax for creating a partition is as follows.

CREATE TABLE table_name (...) PARTITION BY (dt STRING,country STRING)

1. Create a partition

There are no complex types of partitions for creating partition tables in Hive (range partitions, list partitions, hash partitions, mixed partitions, etc.). A partitioned column is not an actual field in the table, but one or more pseudo columns. This means that the information and data of the partition column are not actually saved in the data file of the table.

Create a simple partition table.

Hive > create table partition_test (member_id string,name string) partitioned by (stat_date string,province string) row format delimited fields terminated by','

In this example, two fields, stat_date and province, are created as partition columns. Typically, you need to create a partition in advance before you can use it. For example:

Hive > alter table partition_test add partition (stat_date='2015-01-18)

This creates a partition. You will see that Hive has created a corresponding folder in the HDFS store.

$hadoop fs-ls / user/hive/warehouse/partition_test/stat_date=2015-01-18/user/hive/warehouse/partition_test/stat_date=2015-01-18Universe provincebeijingMurray-shows the partition you just created

Each partition has a separate folder. In the above example, stat_date is the primary level and province is the secondary level.

2. Insert data

Prepare to insert data into partition_test using an auxiliary non-partitioned table, partition_test_input, as follows.

1) to view the structure of the partition_test_input table, the command is as follows.

Hive > desc partition_test_input

2) to view the partition_test_input data, the command is as follows.

Hive > select * from partition_test_input

3) insert data into the partition of partition_test, as follows.

Insert overwrite table partition_test partition (stat_date='2015-01-18) select member_id,name from partition_test_input where stat_date='2015-01-18 'and province='jiangsu'

Insert data into multiple partitions with the following command.

Hive > from partition_test_input insert overwrite table partition_test partition (stat_date='2015-01-18 pro-vinceurs) select member_id,name from partition_test_input where stat_date='2015-01-18' and province='jiangsu' insert overwrite table partition_test partition (stat_date='2015-01-28) select member_id,name from partition_test_input where stat_date='2015-01-28' and province='sichuan' insert overwrite table partition_test partition (stat_date='2015-01-28') Province='beijing') select member_id,name from partition_test_input where stat_date='2015-01-28' and province='beijing'

3. Dynamic partitioning

According to the above method to insert data into the partition table, if the data source is very large, it is very troublesome to write an insert for a partition. The above problems can be well solved by using dynamic partitioning. The dynamic partition can be automatically matched to the corresponding partition according to the data obtained by the query.

Dynamic partitions can be opened with the following settings:

Set hive.exec.dynamic.partition=true;set hive.exec.dynamic.partition.mode=nonstrict

The use of dynamic partitioning is simple, assuming that data is inserted under the partition stat_date='2015-01-18', and it is up to the database to determine which subpartition the province is inserted into. Stat_date is called static partition column, and province is called dynamic partition column.

Hive > insert overwrite table partition_test partition (stat_date='2015-01-18) select member_id,name province from partition_test_input where stat_date='2015-01-18'

Note that dynamic partitions do not allow primary partitions to use dynamic columns and secondary partitions to use static columns, which will cause all primary partitions to create partitions defined by secondary partition static columns.

Hive.exec.max.dynamic.partitions.pernode: the maximum number of partitions allowed to be created per MapReduce Job. If you exceed this number, an error will be reported (default of100).

Hive.exec.max.dynamic.partitions: the maximum number of all partitions allowed to be created by a dml statement (default of100).

Hive.exec.max.created.files: the maximum number of files allowed to be created by all MapReduce Job (default is 10000).

Try to keep the data with the same value of the partition column in the same MapReduce, so that each MapReduce can generate as few new folders as possible, and the data with the same value of the partition column can be put together through DISTRIBUTE BY, as follows.

Hive > insert overwrite table partition_test partition (stat_date,province) select memeber_id,name,stat_date,province from partition_test_input distribute by stat_date,province; bucket operation

The table in Hive can be split into Partition table and BUCKET. The bucket operation is realized through Partition's CLUSTERED BY, and the data in BUCKET can be sorted by SORT BY.

The main functions of BUCKET are as follows.

1) data sampling

2) improve the efficiency of some query operations, such as Map-Side Join.

What needs to be particularly important is that CLUSTERED BY and SORT BY do not affect the import of data, which means that users must be responsible for the import of data, including data decimation and sorting. 'set hive.enforce.bucketing=true' can automatically control the number of Reduce in the last round to match the number of BUCKET. Of course, you can also configure mapred.reduce.tasks to match the number of BUCKET. It is recommended to use:

Hive > set hive.enforce.bucketing=true

Examples of operations are as follows.

1) create a temporary table student_tmp and import the data.

Hive > desc student_tmp;hive > select * from student_tmp

2) create student table.

Hive > create table student (id int,age int,name string) partitioned by (stat_date string) clustered by (id) sorted by (age) into 2 bucketrow format delimited fields terminated by','

3) set the environment variable.

Hive > set hive.enforce.bucketing=true

4) insert data.

Hive > from student_tmp insert overwrite table student partition (stat_date='2015-01-19') select id,age,name where stat_date='2015-01-18' sort by age

5) View the file directory.

$hadoop fs-ls / usr/hive/warehouse/student/stat_date=2015-01-19 /

6) View sampling data.

Hive > select * from student tablesample (bucket 1 out of 2 on id)

Tablesample is a sample statement and the syntax is as follows.

Tablesample (bucket x out of y)

Y must be a multiple or factor of the total number of BUCKET in table.

At this point, I believe you have a deeper understanding of the "introduction to hive partition and bucket operation". You might as well come to the actual operation. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.