Data partitioning in Impala and Hive (1) 12/14 Update SLTechnology News&Howtos

Data partitioning in Impala and Hive (1)

2025-12-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Data partitioning will greatly improve the efficiency of data query, especially for the current use of big data, is an indispensable knowledge. So how does the data create partitions? How is the data loaded into the partition?

Impala/Hive partitions Accounts by State

(1) example: accounts is a non-partitioned table

If created in the above way, the data is stored in the accounts directory. So, what if most of Loudacre's analysis of the customer table is done by state? For example:

In this case, if the amount of data is large, in order to avoid a full table scan, we can create a partition. If you do not create a partition, it defaults to all queries that have to scan all files in the directory. Create a partition to store data in different subdirectories by state. When querying according to the conditions of "NY", it will only scan the subdirectories. Let me take a look at partition creation.

II. Partition creation

(1) use PARTITIONED BY to create partition tables

Note here that state is deleted because it is used as a partition field, and we know that partition data will not appear in the actual file, so state as a partition field will not appear in the column. In other words, the partitioning key is a virtual column that does not exist in the column. So, how do we view the columns of our partition? Will it appear in our structure? I will.

Third, view the partition column

Use DESCRIBE to display the partitioned column, which appears in the last column of the structure, which is a virtual column, not a column that actually exists in the data.

We create a single partition, but sometimes there are nested partitions, how to deal with it?

Create nested partitions:

Once the partition is created, how do we load the data into the partition? There are two ways to partition dynamically and statically. Dynamic partitioning means that Impala/Hive automatically adds new partitions when it loads, and the data is stored in the correct partition (subdirectory) based on column values. Static partitions require us to define the name of the partition in advance through ADD PARTITION, and specify the partition to which the data is stored when the data is loaded. So what are the characteristics of dynamic and static partitions? Follow-up for everyone to share.

For big data, we should take the initiative to cater to and learn, because it does not have a mature system, it is still developing and rising, and only continuous learning and promotion can catch up with the pace of development. It is suggested that we should study and communicate more in peacetime. I like to follow the official Wechat account of "big data cn". Personally, it is very good. I recommend onlookers.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.