Differences between internal tables, external tables, partition tables, and bucket tables in Hive 07/19 Update SLTechnology News&Howtos

Differences between internal tables, external tables, partition tables, and bucket tables in Hive

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "the difference between internal table, external table, partition table and bucket table in Hive". In the operation of actual cases, many people will encounter this dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

In the Hive data warehouse, the important points are the four tables in Hive. Tables in Hive are divided into internal tables, external tables, partition tables, and bucket separation tables.

Internal table

The tables created by default are so-called internal tables, sometimes referred to as administrative tables. Because of this table, Hive controls (more or less) the life cycle of the data. Hive stores the data for these tables by default in a subdirectory of the directory defined by the configuration item hive.metastore.warehouse.dir (for example, / user/hive/warehouse). When we delete a management table, Hive also deletes the data in that table. Management tables are not suitable for sharing data with other tools.

Specific internal table creation commands

CREATE TABLE emp (empno INT, ename STRING, job STRING, mgr INT, hiredate TIMESTAMP, sal DECIMAL (7jue 2), comm DECIMAL (7jue 2), deptno INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\ t";-- delimiter\ t external table

The external table is called EXTERNAL_TABLE;. You can specify the directory location (LOCATION) when you create the table. If you delete the external table, only the metadata will be deleted, not the table data.

The specific external table creation command has only one more LOCATION than the internal table.

CREATE EXTERNAL TABLE emp_external (empno INT, ename STRING, job STRING, mgr INT, hiredate TIMESTAMP, sal DECIMAL (7 LOCATION 2), comm DECIMAL (7 LOCATION 2), deptno INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\ t" LOCATION'/ hive/emp_external'; "the difference between internal and external tables:"

When creating an internal table: the data is moved to the path pointed to by the data warehouse

When creating an external table: only record the path where the data is located and change the location of the data

When deleting internal tables: delete table metadata and data

When you delete an external table, delete the metadata without deleting the data.

Partition table

The partition table is actually a separate folder on a HDFS file system that contains all the data files for that partition. Partitions in Hive are subdirectories, dividing a large data set into small data sets according to business needs. When querying, selecting the specified partition required by the query through the expression in the WHERE clause will greatly improve the query efficiency.

The specific partition table creation command is as follows, which has one more PARTITIONED than the external table. PARTITIONED means partitioned in English, and you need to specify one of the fields in the table, which is divided into different folders according to the different fields.

CREATE EXTERNAL TABLE emp_partition (empno INT, ename STRING, job STRING, mgr INT, hiredate TIMESTAMP, sal DECIMAL (7jue 2), comm DECIMAL (7pm 2)) PARTITIONED BY (deptno INT)-partition ROW FORMAT DELIMITED FIELDS TERMINATED BY "\ t" LOCATION'/ hive/emp_partition'; bucket table by department number

The partition is represented as a directory on the HDFS, and the bucket is a separate file. Sub-bucket is to specify a column of the sub-bucket table, so that the column data is randomly and evenly distributed to each bucket file in the way of hash mode.

The specific bucket table creation command is as follows, and the difference of the score table is CLUSTERED. CLUSTERED means clustered in English. Bucket splitting operation is the same as partitioning, which requires hashing mode operation according to a column of specific data, so the specified bucket column must be based on a column (field) in the table.

CREATE EXTERNAL TABLE emp_bucket (empno INT, ename STRING, job STRING, mgr INT, hiredate TIMESTAMP, sal DECIMAL (7SORTED BY 2), comm DECIMAL (7page2), deptno INT) CLUSTERED BY (empno) SORTED BY (empno ASC) INTO 4 BUCKETS-the difference between ROW FORMAT DELIMITED FIELDS TERMINATED BY "\ t" LOCATION'/ hive/emp_bucket'; "partition table and bucket table in the four bucket by employee number:"

The Hive data table can be partitioned according to certain fields to refine the data management, which can make some queries faster. At the same time, tables and partitions can also be further divided into Buckets, the principle of bucket table and MapReduce programming in the principle of HashPartitioner is similar; partition and bucket are refined data management, but partition table is to manually add distinction, because Hive is read mode, so the data added into the partition does not do mode check, bucket table data is in accordance with certain bucket fields for hash hashing to form a number of files, so the accuracy of the data is much higher.

There are three ways to build a bucket table: direct table creation, CREATE TABLE LIKE and CREATE TABLE AS SELECT.

Note: you can't load data directly into the bucket table, you need to use insert statements to insert data, so whenever you see a load data to a bucket table, it's basically random. Usually, the data of the bucket-splitting table can only be inserted by CTAS (CREATE TABLE AS SELECT), because the CTAS operation triggers MapReduce, so the bucket-splitting time is longer because of the MapReduce operation.

According to the above command, the internal table, external table, partition table and bucket table are created successfully.

The following inserts data into four tables in turn. The details of emp.txt are as follows:

7369 SMITH CLERK 7902 1980-12-17 00:00:00 800.00 20 7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00 1600.00 300.00 30 7521 WARD SALESMAN 7698 1981-02-22 00:00:00 1250.00 500.00 30 7566 JONES MANAGER 78391981-04-02 00:00:00 2975.00 20 7654 MARTIN SALESMAN 76981981-09-28 00:00:00 1250.00 1400.00 30 7698 BLAKE MANAGER 7839 00:00:00 2850.00 30 7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450.00 10 7788 SCOTT ANALYST 7566 1987-04-19 00:00:00 1500.00 20 7839 KING PRESIDENT 1981-11-17 00:00:00 5000.00 10 7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500.00 0.00 30 7876 ADAMS CLERK 77881987-05-23 00:00:00 1100.00 207900 JAMES CLERK 76981981-12-03 00:00:00 950.00 307902 FORD ANALYST 7566 00:00: 00 3000.00 20 7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300.00 10

The specific insert data command is as follows:

# # Internal table load data local inpath "emp.txt" into table emp; # # external table load data local inpath "emp.txt" into table emp_external; # # Partition table LOAD DATA LOCAL INPATH "emp.txt" OVERWRITE INTO TABLE emp_partition PARTITION (deptno=10); LOAD DATA LOCAL INPATH "emp.txt" OVERWRITE INTO TABLE emp_partition PARTITION (deptno=20); LOAD DATA LOCAL INPATH "emp.txt" OVERWRITE INTO TABLE emp_partition PARTITION (deptno=30); # # bucket table-enable bucket table set hive.enforce.bucketing=true -- restrict load operations on bucket tables set hive.strict.checks.bucketing = true; INSERT INTO TABLE emp_bucket SELECT * FROM emp;-- the emp table here is an ordinary employee table

Every time you INSERT a bucket table, you actually need to create an intermediate table.

That's all for the difference between internal table, external table, partition table and bucket table in Hive. Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.