How to realize Partition and Bucket in hive 04/01 Update SLTechnology News&Howtos

How to realize Partition and Bucket in hive

2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to achieve partition and sub-bucket in hive, which is very detailed and has certain reference value. Friends who are interested must finish reading it!

1. Hive partition table

In Hive Select queries, the contents of the entire table are generally scanned, which consumes a lot of time to do unnecessary work. Sometimes only a portion of the data you care about in the table needs to be scanned, so the concept of partition is introduced when creating the table. The partition table refers to the partition space of the partition specified when the table is created. Hive can partition the data according to a column or columns, and we can explain the partitioning with the following example.

At present, Internet applications store a large number of log files every day, and a few gigabytes, dozens of gigabytes or even more are possible. Store the log, one of the attributes must be the date on which the log was generated. When a partition is generated, it can be divided according to the date column generated by the log. Treat each day's log as a partition.

Organizing the data into partitions can mainly improve the query speed of the data. It is up to the user to decide which partition to put each record stored by the user. That is, when the user loads the data, it must be displayed to specify the partition to which the part of the data is put.

1.1 implementation details

1. A table can have one or more partitions, each in the form of a folder under the directory of the table folder.

2. Table and column names are not case-sensitive.

3. The partition exists in the table structure in the form of a field. You can see the existence of the field through the describetable command, but the field does not store the actual data content, only the representation of the partition (pseudo column).

1.2 Grammar

1. Create a partition table with ds as the partition column:

Create table invites (id int, name string) partitioned by (ds string) row format delimited fieldsterminated by''stored as textfile

two。 Add data to the partition with time 2013-08-16:

Load data local inpath'/home/hadoop/Desktop/data.txt' overwrite into table invites partition (ds='2013-08-16')

3. Add data to the partition with time 2013-08-20:

Load data local inpath'/home/hadoop/Desktop/data.txt' overwrite into table invites partition (ds='2013-08-20')

4. Query data from a partition:

Select * from inviteswhere ds = '2013-08-12'

5. Add data to a partition of a partition table:

Insert overwrite tableinvites partition (ds='2013-08-12') select id,max (name) from test group by id

You can view the details of the partition and use the command:

Hadoop fs-ls / home/hadoop.hive/warehouse/invites

Or:

Show partitionstablename

2. Hive bucket

For each table (table) or partition, Hive can be further organized into buckets, that is, buckets are finer-grained data range partitions. Hive is also an organization that conducts buckets for a column. Hive uses a hash of column values, and then divides the remainder by the number of buckets to determine which bucket the record is stored in.

There are two reasons to organize tables (or partitions) into Bucket:

(1) to obtain higher query processing efficiency. Buckets add an extra structure to the table that Hive can take advantage of when dealing with some queries. Specifically, joining two tables with buckets on the same column (containing join columns) can be efficiently implemented using Map-side joins (Map-side join). Such as JOIN operations. For JOIN operations, two tables have the same column, if a bucket operation is performed on both tables. Then the bucket that holds the same column values can be operated by JOIN, which can greatly reduce the amount of data in JOIN.

(2) make sampling more efficient. When dealing with a large data set, it will bring a lot of convenience if the query can be run on a small part of the data in the stage of developing and modifying the query.

1. Create a table with a bucket:

Create tablebucketed_user (id int,name string) clustered by (id) sorted by (name) into 4buckets row format delimited fields terminated by''stored as textfile

First, let's take a look at how to tell Hive

A table should be divided into buckets. We use the CLUSTERED BY clause to specify the columns used to divide buckets and the number of buckets to be divided:

CREATE TABLE bucketed_user (id INT) name STRING)

CLUSTERED BY (id) INTO 4BUCKETS; here, we use the user ID

To determine how to divide the buckets (Hive uses to hash the value and divide the result by the number of buckets to take the remainder. In this way, there will be a random set of users in any bucket (PS: it can actually be said to be random, isn't it? ).

For map

In the case of an end connection, the two tables divide the buckets in the same way. The mapper that processes a bucket in the table on the left knows that the matching rows in the table on the right are in the corresponding bucket. Therefore, mapper only needs to get that bucket (which is only a small portion of the data stored in the table on the right) to join. This optimization method does not necessarily require that the number of buckets of two tables must be the same, and the number of buckets of two tables can also be multiplied.

Use HiveQL

Join two tables with divided buckets, see the "map connections" section (P400).

The data in the bucket can be sorted separately according to one or more columns. Since the connection to each bucket becomes an efficient merge sort (merge-sort), map can be further improved

The efficiency of the end connection. The following syntax declares a table to use sort buckets:

CREATE TABLE bucketed_users (id INT, name STRING)

CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS; how can we ensure that the data in the table is divided into buckets? Put it in Hive

It is certainly possible to load the externally generated data into a table divided into buckets. In fact, it is easier for Hive to divide buckets. This operation is usually for existing tables.

Hive does not check that the bucket in the data file is consistent with the bucket in the table definition

Whether for the number of buckets or the columns used to divide buckets). If the two do not match, you may encounter errors or undefined results when querying. Therefore, it is recommended that Hive be used to divide the buckets.

two。 Insert data into the table:

INSERT OVERWRITE TABLEbucketed_users SELECT * FROM users; physically, each bucket is a table (

Or partition) a file in the directory. Its file name is not important, but bucket n is the nth file in dictionary order.

In fact, buckets correspond to MapReduce

Output file partition: a job produces the same number of buckets (output files) and reduce tasks.

3. Sample the data in the bucket:

Hive > SELECT * FROMbucketed_users

> TABLESAMPLE (BUCKET 1 OUT OF 4 ON id)

0 Nat

4 the number of Ann buckets is from 1

Start counting. Therefore, the previous query fetches all users from the first of the four buckets. For a large, evenly distributed dataset, this returns about 1/4 of the data rows in the table. We can also sample several buckets with other proportions (because sampling is not an exact operation, so this ratio does not have to be an integral multiple of the number of barrels).

Note: tablesample is a sampled statement, syntax: TABLESAMPLE (BUCKET x OUTOF y)

Y must be a multiple or factor of the total bucket number of table. Hive determines the proportion of sampling according to the size of y. For example, table has a total of 64 portions, extracting (64 bucket 32 =) 2 bucket data when yearly 32, and (64 Universe 128 =) 1 canister 2 bucket data when yearly 128. X indicates which bucket to start the extraction from. For example, the total number of bucket of table is 32 bucket (bucket 3 tablesample (Tablesample 3 out of 16), which means that a total of 2 Tablesamples are extracted, which are the data of the 3rd bucket and the 19th bucket, respectively.

These are all the contents of the article "how to partition and split buckets in hive". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.