In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Partitions in Hive are subdirectories, dividing a large dataset into smaller datasets according to business needs. So how to partition data in Hive? What kind of problems should be paid attention to when zoning? How to limit the number of partitions?
1. Hive only: a quick way to load partition data
If the specified partition does not exist, Hive will create a new partition
This command will:
(1) add partitioned metadata to the table if it does not exist
(2) if it exists, create a subdirectory: / user/hive/warehouse/call_logs/call_date=2014-10-02
(3) move the HDFS file call-20141002.log to the partition subdirectory
II. View, add, and remove partitions
(1) View the current table partition
(2) use ALTER TABLE to add or remove partitions
Create a partition from an existing partition directory
(1) the partition directory of HDFS can be created and data outside of Hive or Impala, for example, through Spark or MapReduce application
(2) use the MSCK REPAIR TABLE command in Hive to create partitions for existing tables
When to use partitions
Use partitions in the following situations
(1) it takes a long time to read the entire dataset
(2) the query almost only filters the partition fields.
(3) the partition has a reasonable number of different values.
(4) the data generation or ETL process segments data by file or directory name.
(5) the partition column value is not in the data itself.
When not to use partitions
(1) avoid partitioning data into many small data files
-do not partition columns that have too many unique values
(2) Note: it is easy to occur when dynamic partitioning is used.
-for example, partitioning the customer table according to fname will result in thousands of partitions
Partition by Hive
In older versions of Hive, dynamic partitioning is not enabled by default and is enabled by setting these two properties:
However, we should pay attention to some problems in the hive partition, such as:
(1) Note: the Hive variable set by Beeline is only valid for the current session, and the system administrator can set it to take effect permanently.
(2) Note: if the partition column has many unique values, many partitions will be created.
In addition, we can configure parameters for Hive to limit the number of partitions:
(1) hive.exec.max.dynamic.partitions.pernode
Query the maximum number of dynamic partitions that can be created on a node. Default is 100.
(2) hive.exec.max.dynamic.partitions
The maximum number of dynamic partitions that can be created by a HiveQL statement, default is 1000
(3) hive.exec.max.created.files
The maximum number of dynamic partitions that can be created by a query. Default is 1000000.
The above is the sharing of data partitioning in Hive. Usually to master and understand more, it plays a vital role in big data's study. "big data cn" Wechat Subscription account is recommended here. Some of the introductions to big data are good, so you can follow them.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.