The arrangement of Hive Notes (2) 07/06 Update SLTechnology News&Howtos

The arrangement of Hive Notes (2)

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

[TOC]

Hive Note arrangement (2) Classification of tables in Hive the life cycle / existence of the data in the controlled table, the management table and the internal table is affected by the table structure. When the table structure is deleted, the data in the table is deleted. This is the table that is created by default.

You can view the details of the table through desc extended tableName in cli, or you can view it in the metadata information table TBLS of hive in MySQL.

The life cycle / existence of the data in the external_table- external table is not affected by the table structure. When the table structure is deleted, the corresponding data in the table still exists. This is equivalent to just a reference to the corresponding data by the table.

Create an external table:

Create external table t6_external (id int); add data: alter table t6_external set location "/ input/hive/hive-t6.txt"; you can also specify the corresponding data create external table t6_external_1 (id int) location "/ input/hive/hive-t6.txt" when creating an external table. The above hql error: MetaException (message:hdfs://ns1/input/hive/hive-t6.txt is not a directory or unable to create one means that the data specified when the table is created is not expected to be a specific file, but a directory create external table t6_external_1 (id int) location "/ input/hive/"

When using external tables, deletion is not allowed, but data can be added, and doing so also affects the text data referenced in the hdfs.

Simple use differences between internal and external tables:

When considering the security of the data, or when the data is coordinated by multiple departments, external tables are generally used. External tables are generally used when considering the coordinated integration of hive and other frameworks, such as hbase.

Internal and external tables can be converted to each other:

Outer-> internal alter table t6_external set tblproperties ("EXTERNAL" = "FALSE"); internal table-> external table alter table T2 set tblproperties ("EXTERNAL" = "TRUE"); persistent tables and tables above temporary tables are persistent tables, and the existence of the table has nothing to do with the session.

Temporary table:

When a temporary table is created in a session session, all data (including metadata) of the table disappears when the session is disconnected, and the table data is temporarily stored in memory. (in fact, after the temporary table is created, the / tmp/hive directory is created in hdfs to hold the temporary file, but the data is deleted as soon as you exit the session.) it is not displayed in the Metabase. This kind of temporary table is usually used as temporary data storage or exchange temporary table. The characteristic can not be partitioned table to create temporary table is very simple. Just like external table, replace the external keyword with temporary to partition the table.

Suppose the structure of the up_web_ log table is as follows:

User/hive/warehouse/up_web_log/ web_log_2017-03-09.log web_log_2017-03-10.log web_log_2017-03-11.log web_log_2017-03-12.log.... Web_log_2018-03-12.log

The structure of the table is explained as follows:

This table stores web logs. In hive, a table is a directory in hdfs. The preservation statistics of web logs are carried out on a daily basis, so the log data will be loaded into hive at the end of each day, so you can see that there are many files in the up_web_log directory, but for hive, these log data belong to the up_web_log table. Obviously, with the passage of time There will be more and more data in this table.

Problems with the table:

In order to view the data of one day, we can only define a date field (for example: dt) in the table, and then add the filter field where dt= "2017-03-12" when querying in order to get the corresponding results. But there is a problem. In this way, you need to load the data in all the files under the table. As a result, a large amount of irrelevant data is loaded in memory, resulting in the inefficiency of our hql.

So how do you optimize the table?

To optimize this problem, we can, according to the characteristics of the hive table, actually manage a folder, that is, through the table can be located to a directory above the hdfs, we can create a / multi-level subdirectory on the basis of the table / directory, to complete a division / split of the large table, we use some kind of feature identification, such as the subfolder name datadate=2017-03-09. When you query the data for one day later, you only need to navigate to the subfolder and load all the data under the subfolder just like loading a table. In this way, all the data under the large table will not be fully loaded, only some of them will be loaded, which reduces the amount of data in memory and improves the running efficiency of hql. We call this technique the partitioning of tables, and this kind of table is called partitioned tables. This subfolder is called the partition of the partition table.

The composition of the partition table is as follows:

The partition is composed of two parts, the partition field and the specific partition value, and the "=" connection is used in the middle. The partition field is equivalent to a field in the whole table. If you want to query the data under a partition, The storage structure of the table in where datadate= "2017-03-09" hdfs is user/hive/warehouse/up_web_log/ / datadate=2017-03-09 web_log_2017-03-09.log / datadate=2017-03-10 web_log_2017-03-10.log / datadate=2017-03-11 web _ log_2017-03-11.log / datadate=2017-03-12 web_log_2017-03-12.log.... Web_log_2018-03-12.log

Create a partition table:

Create table t7_partition (id int) partitioned by (dt date comment "date partition field"); load data local inpath'/ opt/data/hive/hive-t6.txt' into table t7 partitioning failed: SemanticException [Error 10062]: Need to specify partition columns because the destination table is partitioned cannot load data directly into a partition table, but must specify which partition, that is, a subfolder, before loading the data.

DDL of the partition table:

Create a partition: alter table t7_partition add partition (dt= "2017-03-10"); view the list of partitions: show partitions t7partition; delete a partition: alter table t7_partition drop partition (dt= "2017-03-10")

Add data:

Add data to the specified partition: load data local inpath'/ opt/data/hive/hive-t6.txt' into table t7_partition partition (dt= "2017-03-10"); in this way, the partition is automatically created

Situations where there are multiple partition fields:

Statistics of schools, enrollment per discipline, employment / annual employment create table t7_partition_1 (id int) partitioned by (year int, school string); added data: load data local inpath'/ opt/data/hive/hive-t6.txt' into table t7_partition_1 partition (year=2015, school='python'); bucket table

Problems with partitioned tables:

Because partitioned tables may also cause some partitioned data to be very large and some very small, resulting in uneven queries, which is not what we expected, we need to use a technique to break up these tables relatively evenly. This technique is called split buckets, and the tables after dividing buckets are called bucket tables.

Create a bucket table:

Create table t8_bucket (id int) clustered by (id) into 3 buckets

Add data to the bucket table:

Can only be converted from the table table, can not use the above load (without splitting the data) insert into t8_bucket select * from t7_partition_1 where year=2016 and school= "mysql"; FAILED: SemanticException [Error 10044]: Line 1:12 Cannot insert into target table because column number/types are different't 8 tables buckets: Table insclause-0 has 1 columns, but query has 3 columns. There is only one field in our bucket table, but there are three fields in the partition table, so when importing data using insert into, be sure to keep the same number of fields before and after. Insert into t8_bucket select id from t7_partition_1 where year=2016 and school= "mysql"; after adding data, view the data in the table: > select * from t8_bucket OK634152Time taken: 0.08seconds, Fetched: 6 row (s) you can see that the order of the data is different because the data is divided into three parts, and the hashing algorithm is used as follows: 6% 3 = 0,3% 3 = 0, 4% 3 = 1,2% 3 = 1 in the first bucket and 5% 3 = 2,2% 3 = 2 in the second bucket Put it in the third bucket to view the structure of the table t8_bucket in hdfs: hive (mydb1) > dfs-ls / user/hive/warehouse/mydb1.db/t8_bucket Found 3 items-rwxr-xr-x 3 uplooking supergroup 4 2018-03-09 23:00 / user/hive/warehouse/mydb1.db/t8_bucket/000000_0-rwxr-xr-x 3 uplooking supergroup 4 2018-03-09 23:00 / user/hive/warehouse/mydb1.db/t8_bucket/000001_0-rwxr-xr-x 3 uplooking supergroup 4 2018-03-09 23:00 / user/hive/warehouse/mydb1.db/t8_bucket/000002_0 can be seen The data is stored in three different subdirectories of t8_bucket.

Note: local mode does not work when operating a bucket table.

Loading and exporting of data [] = > optional, = > loadload data [local] inpath 'path' [overwrite] into table [partition_psc] must be loaded Local: have = > load data locally from linux none = > load data from hdfs, which is equivalent to performing a mv operation (when there is no local parameter, rather than not having this file locally) overwrite has = > overwriting the original data in the table, none = > loading insert from other tables by appending new data on the original basis [table (table must be added when the current face parameter is overwrite)] t_des select [...] From t_src [...]; overwrite has = > overwrite the original data in the table, no = > append new data on the original basis = > it will be transformed into MR execution that needs to be paid attention to: columns in t_des and select [...] From t_src in this [...] One by one. When the selected parameter is overwrite, table must be added later, such as: insert overwrite table test select * from t8 pocket; load create table t_des as select [...] when creating the table. From t_src [...]; this creates a table with the structure of select [...] [...] eg.create temporary table tmp as select distinct (id) from t8 bucket in from t_src; loading of dynamic partitions

Quick copy table structure:

Create table t_d_partition like tweak partitionals 1 * * Hive (default) > show partitions tweets partitionals 1 * OK partitionyears2015 * classy linuxyearths 2016 * * classy Linux 2016 * *.

To put 2016 of the data into the relevant partition of t_d_partition:

Insert into table t_d_partition partition (class, year=2016) select id, name, class from t_partition_1 where year=2016

To put all the data in t_partition_1 into the relevant partition of t_d_partition:

Insert overwrite table t_d_partition partition (year, class) select id, name, year, class from tasking partitions1; (the hint that appears when I operate: FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set shive.exec.dynamic.partition.mode=nonstrict)

Other questions:

The data deleted from the hdfs does not delete the table structure, we show partitions tweeddeleted partitions; it is the content queried from the metastore, if you delete the data above the hdfs manually, its metadata information is still there. Insert into t10_p_1 partition (year=2016, class) select * from tasking partitions1 failed: SemanticException [Error 10094]: Line 1:30 Dynamic partition cannot be the parent of a static partition 'professional' dynamic partition cannot be the parent directory of a static partition needs to set hive.exec.dynamic.partition.mode to nonstrict hive.exec.max.dynamic.partitions 1000 Maximum number of dynamic partitions allowed to be created in total.

Import imports data on hdfs:

The following errors occur in the current test of import table stu from'/ data/stu';: hive (mydb1) > import table test from'/ input/hive/';FAILED: SemanticException [Error 10027]: Invalid pathhive (mydb1) > import table test from 'hdfs://input/hive/';FAILED: SemanticException [Error 10324]: Import Semantic Analyzer Errorhive (mydb1) > import table test from' hdfs://ns1/input/hive/' FAILED: SemanticException [Error 10027]: Invalid path export 1.hadoop fs-cp src_uri dest_uri (hdfs dfs-cp src_uri dest_uri) 2.hive > the hdfs directory to which export table tblName to 'hdfs_uri'; is exported must be an empty directory, or it will be created automatically if it does not exist. This approach also exports metadata information. 3.insert overwrite [local] directory 'linux_fs_path' select... from... Where.; if you do not add local, the data will be exported to hdfs, otherwise it will be exported to the linux file system either way, if the directory does not exist, it will be automatically created, if it does, it will be overwritten.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.