How to use Hive data Warehouse 07/06 Update SLTechnology News&Howtos

How to use Hive data Warehouse

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to use the Hive data warehouse. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

Hive is a basic data warehouse tool used in Hadoop to deal with structured data. It is based on Hadoop, always big data, and makes query and analysis convenient. And provide a simple sql query function, you can convert sql statements into MapReduce tasks to run.

Hive table type test internal table

To prepare the data, first prepare the text file on HDFS, split the comma, and upload it to the / test directory, and then create a table in Hive with the same table name and file name.

$cat / tmp/table_test.csv1,user1,10002,user2,20003,user3,30004,user4,40005,user5,5000Hive create table hive > CREATE TABLE table_test (id int, name string, value INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY','

The first half is similar to the way we use SQL syntax, and the latter setting indicates that we import data with','as the delimiter.

Hive loads HDFS data $hive-e 'load data local inpath' / tmp/table_test.csv' into table db_test.table_test'Loading data to table db_test.table_testOKTime taken: 0.148 seconds

The same file can be loaded multiple times (append data), and an additional file is generated under the HDFS data directory. In addition, the local keyword for loading data here means that we load from the local file, and if we do not add local, it means loading data from HDFS.

Hive View data hive > select * from table_test;OK1 user1 10002 user2 20003 user3 30004 user4 40005 user5 5000Time taken: 0.058 seconds, Fetched: 5 row (s)

You can also use select id from table_test, but note that in Hive, except for select * from table, you can use full table scans, and any other query needs to go to MapRedure.

View the HDFS data file [hadoop@hadoop-nn ~] $hdfs dfs-ls / user/hive/warehouse/db_test.db/table_test/Found 1 items-rwxrwxrwx 2 root supergroup 65 2017-06-15 22:27 / user/hive/warehouse/db_test.db/table_test/table_test.csv

Note that the owner of the file permission is root. This is because I entered hive under the root user. Generally, I entered the hive command line under the Hadoop user to create the table.

Load data from HDFS to Hive, first upload data to HDFS cluster [hadoop@hadoop-nn ~] $hdfs dfs-mkdir / test [hadoop@hadoop-nn ~] $hdfs dfs-put / tmp/table_test.csv / test/table_test.csv create table [hadoop@hadoop-nn ~] $hivehive > CREATE TABLE hdfs_table (id int, name string, value INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY','; load data hive > LOAD DATA INPATH'/ test/table_test.csv' OVERWRITE INTO TABLE db_test.hdfs_table Loading data to table db_test.hdfs_tableOKTime taken: 0.343 secondshive > select * from db_test.hdfs_table;OK1 user1 10002 user2 20003 user3 30004 user4 40005 user5 5000Time taken: 0.757 seconds, Fetched: 5 row (s)

Note that if the data is loaded from HDFS to Hive, the original HDFS data file will not exist.

[hadoop@hadoop-nn ~] $hdfs dfs-ls / test/table_test.csvls: `/ test/table_test.csv': No such file or directory View HDFS data file [hadoop@hadoop-nn ~] $hdfs dfs-ls / user/hive/warehouse/db_test.db/hdfs_table/Found 1 items-rwxrwxrwx 2 hadoop supergroup 65 2017-06-15 22:54 / user/hive/warehouse/db_test.db/hdfs_table/table_test.csv

Upload a file to the directory of the corresponding table (/ user/hive/warehouse/db_test.db/hdfs_table) again

[hadoop@hadoop-nn ~] $cat / tmp/table_test.csv6,user6,6000 [hadoop@hadoop-nn ~] $hdfs dfs-put / tmp/table_test.csv / user/hive/warehouse/db_test.db/hdfs_table/table_test_20170616.csv look at Hive table hive > select * from db_test.hdfs_table again OK1 user1 10002 user2 20003 user3 30004 user4 40005 user5 50006 user6 6000Time taken: 0.053 seconds, Fetched: 6 row (s)

As you can see, a table information we appended is also displayed.

Partition table

When you create a partition table, you need to give a partition field that either already exists or does not exist (it is automatically added when the table is created if it does not exist). The concept of Hive partition is similar to that of MySQL partition. Let's create a partition table partitioned by month.

CREATE TABLE par_table (id int, name string, value INT) partitioned by (day int) ROW FORMAT DELIMITED FIELDS TERMINATED BY','; View table information hive > desc par_table OKid int name string value int day int # Partition Information # col_name data_type comment day int Time taken: 0.023 seconds Fetched: 9 row (s) load data into Hive partition table You need to specify the corresponding partition table for data loading hive > LOAD DATA LOCAL INPATH'/ tmp/table_test.csv' OVERWRITE INTO TABLE db_test.par_table PARTITION (day='22') Loading data to table db_test.par_table partition (day=22) OKTime taken: 0.267 secondshive > LOAD DATA LOCAL INPATH'/ tmp/table_test.csv' OVERWRITE INTO TABLE db_test.par_table PARTITION (day='23') Loading data to table db_test.par_table partition (day=23) OKTime taken: 0.216 seconds View HDFS data file display style [hadoop@hadoop-nn ~] $hdfs dfs-ls / user/hive/warehouse/db_test.db/par_table/Found 1 itemsdrwxrwxrwx-hadoop supergroup 0 2017-06-16 01:12 / user/hive/warehouse/db_test.db/par_table/day=22drwxrwxrwx-hadoop supergroup 0 2017-06-16 01:12 / user / hive/warehouse/db_test.db/par_table/day=23

You can see that there are more corresponding partition directories.

When querying data, the query is a little different. If a where condition is given to specify a partition field (that is, partition based on the query field), only the contents of this partition will be queried, and there is no need to load all the tables. If the query field is not a partition field, then all partitions need to be scanned. Here are two examples:

Hive > select * from db_test.par_table;OK6 user6 6000 226 user6 6000 23Time taken: 0.054 seconds, Fetched: 2 row (s) hive > select * from db_test.par_table where day=22;OK6 user6 6000 22Time taken: 0.068 seconds, Fetched: 1 row (s) external table

Hive supports external tables, which are different from internal and partitioned tables. You only need to have the corresponding file in HDFS, and then you can create a table and specify the corresponding directory in Hive, and you can directly look up the data without performing the data loading task. Let's test it:

First create a directory and upload files in HDFS:

[hadoop@hadoop-nn ~] $hdfs dfs-mkdir-p / hive/external [hadoop@hadoop-nn ~] $hdfs dfs-put / tmp/table_test.csv / hive/external/ext_table.csv

Then create the table directly in Hive:

CREATE EXTERNAL TABLE ext_table (id int, name string, value INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY', 'LOCATION' / hive/external'

At this point, query the table directly and there is no need to load data.

Hive > select * from ext_table;OK6 user6 6000Time taken: 0.042 seconds, Fetched: 1 row (s)

Hive also supports bucket tables, so I won't talk about it here. I seldom use it. I'm interested in checking the data on my own.

Finally, there is a process of dealing with Hive by MapReduce.

Hive > select count (*) from table_test;WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. Spark, tez) or using Hive 1.X releases.Query ID = hadoop_20170616021047_9c0dc1bf-383f-49ad-83e2-e2e5dfdcb20cTotal jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1497424827481_0004 Tracking URL = http://master:8088/proxy/application_1497424827481_0004/Kill Command = / usr/local/hadoop/bin/hadoop job-kill job_1497424827481_0004Hadoop job information for Stage-1: number of mappers: 1 Number of reducers: 12017-06-16 02 reduce = 0%-06-16 02 reduce = 0% 57062 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.11 sec2017-06-16 02 11 reduce 2204 Stage-1 map = 100%, reduce = 100% Cumulative CPU 2.53 secMapReduce Total cumulative CPU time: 2 seconds 530 msecEnded Job = job_1497424827481_0004MapReduce Jobs Launched:Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.53 sec HDFS Read: 7980 HDFS Write: 7980 SUCCESSTotal MapReduce CPU Time Spent: 2 seconds 530 msecOK10Time taken: 15.254 seconds, Fetched: 1 row (s)

You can take a good look at the process, because it is a test environment, so MP takes a long time.

View

In addition, Hive also supports views, which is very easy to use and is configured as follows:

Hive > create view view_test as select * from table_test;OKTime taken: 0.054 secondshive > select * from view_test;OKd1 user1 1000d1 user2 2000d1 user3 3000d2 user4 4000d2 user5 5000Time taken: 0.057 seconds, Fetched: 5 row (s) Hive metadata Information

Then let's take a look at the Hive metadata table information and store the library information created by Hive in the DBS table under MySQL's hive library:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.