How to realize data Partition in Hive 04/01 Update SLTechnology News&Howtos

How to realize data Partition in Hive

2026-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Partitions in Hive are subdirectories, dividing a large dataset into smaller datasets according to business needs. So how to partition data in Hive? What kind of problems should be paid attention to when zoning? How to limit the number of partitions?

1. Hive only: a quick way to load partition data

If the specified partition does not exist, Hive will create a new partition

This command will:

(1) add partitioned metadata to the table if it does not exist

(2) if it exists, create a subdirectory: / user/hive/warehouse/call_logs/call_date=2014-10-02

(3) move the HDFS file call-20141002.log to the partition subdirectory

II. View, add, and remove partitions

(1) View the current table partition

(2) use ALTER TABLE to add or remove partitions

Create a partition from an existing partition directory

(1) the partition directory of HDFS can be created and data outside of Hive or Impala, for example, through Spark or MapReduce application

(2) use the MSCK REPAIR TABLE command in Hive to create partitions for existing tables

When to use partitions

Use partitions in the following situations

(1) it takes a long time to read the entire dataset

(2) the query almost only filters the partition fields.

(3) the partition has a reasonable number of different values.

(4) the data generation or ETL process segments data by file or directory name.

(5) the partition column value is not in the data itself.

When not to use partitions

(1) avoid partitioning data into many small data files

-do not partition columns that have too many unique values

(2) Note: it is easy to occur when dynamic partitioning is used.

-for example, partitioning the customer table according to fname will result in thousands of partitions

Partition by Hive

In older versions of Hive, dynamic partitioning is not enabled by default and is enabled by setting these two properties:

However, we should pay attention to some problems in the hive partition, such as:

(1) Note: the Hive variable set by Beeline is only valid for the current session, and the system administrator can set it to take effect permanently.

(2) Note: if the partition column has many unique values, many partitions will be created.

In addition, we can configure parameters for Hive to limit the number of partitions:

(1) hive.exec.max.dynamic.partitions.pernode

Query the maximum number of dynamic partitions that can be created on a node. Default is 100.

(2) hive.exec.max.dynamic.partitions

The maximum number of dynamic partitions that can be created by a HiveQL statement, default is 1000

(3) hive.exec.max.created.files

The maximum number of dynamic partitions that can be created by a query. Default is 1000000.

The above is the sharing of data partitioning in Hive. Usually to master and understand more, it plays a vital role in big data's study. "big data cn" Wechat Subscription account is recommended here. Some of the introductions to big data are good, so you can follow them.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.