Example Analysis of Partition and Bucket in Hive 07/19 Update SLTechnology News&Howtos

Example Analysis of Partition and Bucket in Hive

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is to share with you the content of sample analysis of partitions and buckets in Hive. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

1. Hive partition table

In Hive Select queries, the contents of the entire table are generally scanned, which consumes a lot of time to do unnecessary work. Sometimes only a portion of the data you care about in the table needs to be scanned, so the concept of partition is introduced when creating the table. The partition table refers to the partition space of the partition specified when the table is created.

Hive can partition the data according to a column or columns, and we can explain the partitioning with the following example.

At present, Internet applications store a large number of log files every day, and a few gigabytes, dozens of gigabytes or even more are possible. Store the log, one of the attributes must be the date on which the log was generated. When a partition is generated, it can be divided according to the date column generated by the log. Treat each day's log as a partition.

Organizing the data into partitions can mainly improve the query speed of the data. It is up to the user to decide which partition to put each record stored by the user. That is, when the user loads the data, it must be displayed to specify the partition to which the part of the data is put.

1.1 implementation details

1. A table can have one or more partitions, each in the form of a folder under the directory of the table folder.

2. Table and column names are not case-sensitive.

3. The partition exists in the table structure in the form of a field. You can see the existence of the field through the describe table command, but the field does not store the actual data content, only the representation of the partition (pseudo column).

1.2 Grammar

1. Create a partition table with ds as the partition column:

Create table invites (id int, name string) partitioned by (ds string) row format delimited fields terminated by 't'stored as textfile

two。 Add data to the partition with time 2013-08-16:

Load data local inpath'/ home/hadoop/Desktop/data.txt' overwrite into table invites partition (ds='2013-08-16')

3. Add data to the partition with time 2013-08-20:

Load data local inpath'/ home/hadoop/Desktop/data.txt' overwrite into table invites partition (ds='2013-08-20')

4. Query data from a partition:

Select * from invites where ds = '2013-08-12'

5. Add data to a partition of a partition table:

Insert overwrite table invites partition (ds='2013-08-12') select id,max (name) from test group by id

You can view the details of the partition and use the command:

Hadoop fs-ls / home/hadoop.hive/warehouse/invites

Or:

Show partitions tablename

2. Hive bucket

For each table (table) or partition, Hive can be further organized into buckets, that is, buckets are finer-grained data range partitions. Hive is also an organization that conducts buckets for a column. Hive uses a hash of column values, and then divides the remainder by the number of buckets to determine which bucket the record is stored in.

There are two reasons to organize tables (or partitions) into Bucket:

(1) to obtain higher query processing efficiency. Buckets add an extra structure to the table that Hive can take advantage of when dealing with some queries. Specifically, joining two tables with buckets on the same column (containing join columns) can be efficiently implemented using Map-side joins (Map-side join). Such as JOIN operations. For JOIN operations, two tables have the same column, if a bucket operation is performed on both tables. Then the bucket that holds the same column values can be operated by JOIN, which can greatly reduce the amount of data in JOIN.

(2) make sampling more efficient. When dealing with a large data set, it will bring a lot of convenience if the query can be run on a small part of the data in the stage of developing and modifying the query.

1. Create a table with a bucket:

Create table bucketed_user (id int,name string) clustered by (id) sorted by (name) into 4 buckets row format delimited fields terminated by'\ t 'stored as textfile

First, let's look at how to tell Hive- that tables should be divided into buckets. We use the CLUSTERED BY clause to specify the columns used to divide buckets and the number of buckets to be divided:

CREATE TABLE bucketed_user (id INT) name STRING)

CLUSTERED BY (id) INTO 4 BUCKETS

Here, we use the user ID to determine how to divide the bucket (Hive uses to hash the value and divide the result by the number of buckets and take the remainder. In this way, there will be a random set of users in any bucket (PS: it can actually be said to be random, isn't it? ).

In the case of a map-side connection, the two tables divide the buckets in the same way. The mapper that processes a bucket in the table on the left knows that the matching rows in the table on the right are in the corresponding bucket. Therefore, mapper only needs to get that bucket (which is only a small portion of the data stored in the table on the right) to join. This optimization method does not necessarily require that the number of buckets of two tables must be the same, and the number of buckets of two tables can also be multiplied. Use HiveQL to join two tables with divided buckets, see the "map connections" section (P400).

The data in the bucket can be sorted separately according to one or more columns. Because this turns the connection to each bucket into an efficient merge sort (merge-sort), it can further improve the efficiency of the map-side connection. The following syntax declares a table to use sort buckets:

CREATE TABLE bucketed_users (id INT, name STRING)

CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS

How can we ensure that the data in the table is divided into buckets? It is certainly possible to load the data generated outside the Hive into a table divided into buckets. In fact, it is easier for Hive to divide buckets. This operation is usually for existing tables.

Hive does not check whether the buckets in the data file are consistent with the buckets in the table definition (whether for the number of buckets or the columns used to divide buckets). If the two do not match, you may encounter errors or undefined results when querying. Therefore, it is recommended that Hive be used to divide the buckets.

There is a user table that does not divide buckets:

Hive > SELECT * FROM users

0 Nat

2 Doe

B Kay

4 Ann

two。 Force multiple reduce to output:

To populate the bucket table with members, you need to set the hive.enforce.bucketing property to true. In the case of ①, Hive knows how to create buckets with the quantity declared in the table definition. Then use the INSERT command. It is important to note that clustered by and sorted by do not affect the import of data, which means that users must be responsible for how the data is imported, including the bucket and sorting of the data.

'set hive.enforce.bucketing = true' can automatically control the number of reduce in the last round to match the number of bucket. Of course, users can also set mapred.reduce.tasks to match the number of bucket. It is recommended to use' set hive.enforce.bucketing = true''.

3. Insert data into the table:

INSERT OVERWRITE TABLE bucketed_users SELECT * FROM users

Physically, each bucket is a file in a table (or partition) directory. Its file name is not important, but bucket n is the nth file in dictionary order. In fact, buckets correspond to the output file partition of MapReduce: a job produces the same number of buckets (output files) and reduce tasks. We can see this by looking at the layout of the bucketd_ users table we just created. Run the following command:

4. View the structure of the table:

Hive > dfs-ls / user/hive/warehouse/bucketed_users

Four new files will be displayed. The file name is as follows (the file name contains a timestamp and is generated by Hive, so it changes each time it is run):

Attempt_201005221636_0016_r_000000_0

Attempt_201005221636_0016_r-000001_0

Attempt_201005221636_0016_r_000002_0

Attempt_201005221636_0016_r_000003_0

The first bucket includes users IDO and 4, because the hash of an INT is the integer itself, here the remainder divided by the number of barrels (4): ②

5. Read the data and look at the data of each file:

Hive > dfs-cat / user/hive/warehouse/bucketed_users/*0_0

0 Nat

4 Ann

We can get the same result by sampling the table with the TABLESAMPLE clause. This clause limits the query to part of the bucket of the table instead of using the entire table:

6. Sample the data in the bucket:

Hive > SELECT * FROM bucketed_users

> TABLESAMPLE (BUCKET 1 OUT OF 4 ON id)

0 Nat

4 Ann

The number of buckets is counted from 1. Therefore, the previous query fetches all users from the first of the four buckets. For a large, evenly distributed dataset, this returns about 1/4 of the data rows in the table. We can also sample several buckets with other proportions (because sampling is not an exact operation, so this ratio does not have to be an integral multiple of the number of barrels). For example, the following query returns half the buckets:

7. Number of buckets returned by half of the query:

Hive > SELECT * FROM bucketed_users

> TABLESAMPLE (BUCKET 1 OUT OF 2 ON id)

0 Nat

4 Ann

2 Joe

Because the query only needs to read buckets that match the TABLESAMPLE clause, sampling bucket tables is a very efficient operation. If you use the rand () function to sample a table that is not divided into buckets, scan the entire input dataset, even if only a small number of samples need to be read:

Hive > SELECT * FROM users

> TABLESAMPLE (BUCKET 1 OUT OF 4 ON rand ())

2 Doe

① starts with Hive 0. 6. 0, and for previous versions, mapred.reduce. Tasks must be set to the number of buckets to be filled in the table. If the buckets are sorted, you also need to set hive.enforce.sorting to true.

When ② explicitly originates the file, the fields are squeezed together because the delimiter is a control character that cannot be printed.

3. Give a complete example of Little Chestnut: (1) build student & student1 table: create table student (id INT, age INT, name STRING) partitioned by (stat_date STRING) clustered by (id) sorted by (age) into 2 bucketsrow format delimited fields terminated by','; create table student1 (id INT, age INT, name STRING) partitioned by (stat_date STRING) clustered by (id) sorted by (age) into 2 bucketsrow format delimited fields terminated by','; (2) set environment variables:

Set hive.enforce.bucketing = true

(3) insert data: cat bucket.txt1,20,zxm2,21,ljz3,19,cds4,18,mac5,22,android6,23,symbian7,25,wpLOAD DATA local INPATH'/ home/lijun/bucket.txt' OVERWRITE INTO TABLE student partition (stat_date= "20120802"); from student insert overwrite table student1 partition (stat_date= "20120802") select id,age,name where stat_date= "20120802" > (4) View the file directory:

Hadoop fs-ls / hive/warehouse/test.db/student1/stat_date=20120802

Found 2 items

-rw-r--r-- 2 lijun supergroup 31 2013-11-24 19:16 / hive/warehouse/test.db/student1/stat_date=20120802/000000_0

-rw-r--r-- 2 lijun supergroup 39 2013-11-24 19:16 / hive/warehouse/test.db/student1/stat_date=20120802/000001_0

(5) View sampling data:

Hive > select * from student1 tablesample (bucket 1 out of 2 on id)

Total MapReduce jobs = 1

Launching Job 1 out of 1

4 18 mac 20120802

2 21 ljz 20120802

6 23 symbian 20120802

Time taken: 20.608 seconds

Note: tablesample is a sampled statement, syntax: TABLESAMPLE (BUCKET x OUT OF y)

Y must be a multiple or factor of the total bucket number of table. Hive determines the proportion of sampling according to the size of y. For example, table has a total of 64 portions, extracting (64 bucket 32 =) 2 bucket data when yearly 32, and (64 Universe 128 =) 1 canister 2 bucket data when yearly 128. X indicates which bucket to start the extraction from. For example, the total number of bucket of table is 32 bucket (bucket 3 tablesample (Tablesample 3 out of 16), which means that a total of 2 Tablesamples are extracted, which are the data of the 3rd bucket and the 19th bucket, respectively.

Thank you for reading! This is the end of this article on "sample analysis of partitions and buckets in Hive". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.