Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

(part 7) A flexible, easy-to-use and easy-to-maintain hadoop data warehouse tool-- Hive

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

Hive

Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table and provide complete sql query functions. Hive defines a simple SQL-like query language, called HQL, which allows users who are familiar with SQL to query data to convert sql statements into MapReduce tasks to run without having to develop special MapReduce. After all, more people can write SQL than JAVA, which gives a large number of operators direct access to huge amounts of data. In the construction of data warehouse, HIVE is flexible and easy to use and easy to maintain, so it is very suitable for the statistical analysis of data warehouse.

The content of this chapter:

1) introduction to Hive

2) basic composition of Hive

3) Hive execution process

4) basic operation of Hive

1. Basic principles of Hive

Hive is built on top of Hadoop, and Hive includes the following components: CLI (command line interface), JDBC/ODBC, Thrift Server, WEB GUI, MetaStore and Driver (Complier, Optimizer and Executor).

1) Driver components: including Complier, Optimizer and Executor, its function is to parse, compile and optimize the HiveQL (class SQL) statements we write, generate an execution plan, and then call the underlying MapReduce computing framework.

2) Metastore component: the metadata service component stores the metadata of hive, the metadata of hive is stored in the relational database, and the relational databases supported by hive are derby and mysql. Hive also supports the installation of metastore services into remote server clusters, decoupling hive services and metastore services.

3) Thrift service: thrift is a software framework developed by facebook, which is used to develop extensible and cross-language services. Hive integrates this service and allows different programming languages to call hive interfaces.

4) CLI:command line interface, command line interface.

5) Thrift client: many client interfaces of the hive architecture are based on thrift clients, including JDBC and ODBC interfaces.

6) the WEBGUI:hive client provides a way to access the services provided by hive through web pages.

There are three main user interfaces: CLI,Client and WUI. One of the most commonly used is CLI, the company can connect to ssh hdp_lbg_ectech@10.126.101.7 through the fortress machine, directly enter hive, you can connect to HiveServer.

The metastore component of Hive is where hive metadata is centrally stored. The Metastore component consists of two parts: metastore service and background data storage. The media for background data storage is relational databases, such as hive's default embedded disk database derby, and mysql databases. Metastore service is a service component that is built on the background data storage medium and can interact with hive service. By default, metastore service and hive service are installed together and run in the same process. I can also split the metastore service from the hive service, metastore is installed independently in a cluster, hive remotely invokes the metastore service, so that we can put the metadata layer behind the firewall, the client accesses the hive service, and can connect to the metadata layer, thus providing better management and security. Using remote metastore services, we can make metastore services and hive services run in different processes, which also ensures the stability of hive and improves the efficiency of hive services.

For data storage, Hive has no special data storage format, so it is very free to organize tables in Hive. Just tell Hive the column delimiter and row delimiter in the data when creating the table, and Hive can parse the data. All the data in Hive is stored in HDFS, and the storage structure mainly includes database, file, table and view. Hive contains the following data models: Table internal tables, External Table external tables, Partition partitions, Bucket buckets. Hive can load text files directly by default, and also supports sequence file and RCFile.

The data model of Hive is described as follows:

1) Hive database

DataBase similar to traditional database, such as hive > create database test_database

2) Internal table

The internal tables of Hive are conceptually similar to the tables in the database. Each Table has a corresponding directory to store data in the Hive. For example, a table hive_test has the path / home/hdp_lbg_ectech/warehouse/hdp_lbg_ectech_bdw.db/hive_test in HDFS, where / home/hdp_lbg_ectech/warehouse is the directory of the data warehouse specified by ${hive.metastore.warehouse.dir} in hive-site.xml, where all Table data (excluding external tables) is stored in this directory. When you delete a table, both metadata and data are deleted.

Example of a table-building statement:

CREATE EXTERNAL TABLE hdp_lbg_ectech_bdw.hive_test

(`userid` string COMMENT'')

ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ 001'

Load data inpath'/ home/hdp_lbg_ectech/resultdata/test.txt'overwrite into table hive_test

3) external table

The external table points to data that already exists in the HDFS, and partitions can be created. It is the same as the internal table in the organization of metadata, but the storage of actual data is quite different. Internal tables in the process of loading data, the actual data is moved to the data warehouse directory. When you delete a table, the data and metadata in the table are deleted at the same time. While there is only one process for external tables, loading data and creating tables are completed at the same time (CREATE EXTERNAL TABLE... LOCATION), the actual data is stored in the HDFS path specified after the LOCATION and is not moved to the data warehouse directory. When you delete an external table, only the metadata for that table is deleted, but the data from the actual external directory is not deleted, and this mode is recommended.

4) Subarea

Partition is equivalent to an index of columns in a database, but Hive is organized in a very different way from that in a database. In Hive, a partition in a table corresponds to a directory under the table, and all partition data is stored in the corresponding directory.

Generally, it is divided according to time, region and category, which is convenient for local query and avoids scanning the whole data source.

5) barrel

Buckets is to further decompose the columns of a table into different file stores through the Hash algorithm. It calculates the hash for the specified column and splits the data according to the hash value in order to be parallel, with each Bucket corresponding to a file. For example, if you divide the userid column into 32 bucket, first calculate the hash for the value of the userid column, and the HDFS directory with a hash value of 0 is / home/hdp_lbg_ectech/resultdata/part-00000; and the HDFS directory with a hash value of 20 is / home/hdp_lbg_ectech/resultdata/part-00020.

6) View of Hive

The view is similar to that of a traditional database. At present, there are only logical views, not materialized views; views can only query, not Load/Insert/Update/Delete data; when the view is created, it only saves a copy of metadata, and when the view is queried, it starts to execute the subqueries corresponding to the view.

2. Basic operation of Hive

1) DDL operation: including

L create and delete tables

L modify table structure

L create / delete views

L create database and display commands

L add partition and delete partition

L rename table

L modify the name, type, location and comments of the column

L add / update columns

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

[(col_name data_type [COMMENT col_comment],...)]

[COMMENT table_comment]

[PARTITIONED BY (col_name data_type [COMMENT col_comment],...)]

[CLUSTERED BY (col_name, col_name,...)

[SORTED BY (col_name [ASC | DESC],...)] INTO num_buckets BUCKETS]

[ROW FORMAT row_format]

[STORED AS file_format]

[LOCATION hdfs_path]

L CREATE TABLE creates a table with a specified name. If a table with the same name already exists, an exception is thrown; the user can use the IF NOT EXIST option to ignore the exception

The EXTERNAL keyword allows the user to create an external table and specify a path to the actual data (LOCATION) while creating the table

L LIKE allows users to copy existing table structures, but not data

L COMMENT can add descriptions to tables and fields

L ROW FORMAT

DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]

[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]

| | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value,...)] |

Users can customize SerDe or use their own SerDe when creating tables. If no ROW FORMAT or ROW FORMAT DELIMITED is specified, the built-in SerDe will be used. When you create a table, you also need to specify columns for the table. When you specify the columns of the table, you will also specify a custom SerDe,Hive to determine the data of the specific columns of the table through SerDe.

L STORED AS

SEQUENCEFILE

| | TEXTFILE |

| | RCFILE |

| | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname |

If the file data is plain text, you can use STORED AS TEXTFILE. If the data needs to be compressed, use STORED AS SEQUENCE.

Example 1: create a simple table

CREATE TABLE pokes (foo INT, bar STRING)

Example 2: create an external table

CREATE EXTERNAL TABLE page_view (viewTime INT, userid BIGINT

Page_url STRING, referrer_url STRING

Ip STRING COMMENT'IP Address of the User'

Country STRING COMMENT 'country of origination')

COMMENT 'This is the staging page view table'

ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ 054'

STORED AS TEXTFILE

LOCATION''

Example 3: create a partition table

CREATE TABLE par_table (viewTime INT, userid BIGINT

Page_url STRING, referrer_url STRING

Ip STRING COMMENT'IP Address of the User')

COMMENT 'This is the page view table'

PARTITIONED BY (date STRING, pos STRING)

ROW FORMAT DELIMITED'\ t'

FIELDS TERMINATED BY'\ n'

STORED AS SEQUENCEFILE

Example 4: create a Bucket table

CREATE TABLE par_table (viewTime INT, userid BIGINT

Page_url STRING, referrer_url STRING

Ip STRING COMMENT'IP Address of the User')

COMMENT 'This is the page view table'

PARTITIONED BY (date STRING, pos STRING)

CLUSTERED BY (userid) SORTED BY (viewTime) INTO 32 BUCKETS

ROW FORMAT DELIMITED'\ t'

FIELDS TERMINATED BY'\ n'

STORED AS SEQUENCEFILE

Example 5: create the table and create the index field ds

CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING)

Example 6: copy an empty table

CREATE TABLE empty_key_value_store

LIKE key_value_store

Example 7: show all tables

SHOW TABLES

Example 8: display the table by regular condition (regular expression)

SHOW TABLES. * s'

Example 9: add a column to the table

ALTER TABLE pokes ADD COLUMNS (new_col INT)

Example 10: add a column and add column field comments

ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT'a comment')

Example 11: change the table name

ALTER TABLE events RENAME TO 3koobecaf

Example 12: delete a column

DROP TABLE pokes

Example 13: add or delete partitions

Add:

ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [LOCATION 'location1'] partition_spec [LOCATION' location2']...

Partition_spec:

PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value,...)

Delete:

ALTER TABLE table_name DROP partition_spec, partition_spec,...

Example 14: rename table

ALTER TABLE table_name RENAME TO new_table_name

Example 15: modify column names, types, locations, comments

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST | AFTER column_name]

This command allows you to change column names, data types, comments, column locations, or any combination of them

Example 16: create / delete views

CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment],...)] [COMMENT view_comment] [TBLPROPERTIES (property_name = property_value,...)] AS SELECT

Add view

If no table name is provided, the name of the view column is automatically generated by the defined SELECT expression

If you modify the properties of the basic table, it will not be reflected in the view and the invalid query will fail.

View is read-only and cannot use LOAD/INSERT/ALTER

DROP VIEW view_name

Delete View

Example 17: create a database

CREATE DATABASE name

Example 18: display commands

Show tables

Show databases

Show partitions

Show functions

Describe extended table_name dot col_name

2) DML operation: metadata storage

Hive does not support insert statements for insert operations, nor does it support update operations. The data is loaded into the established table as load. Once the data is imported, it cannot be modified.

DML includes:

L INSERT insert

L UPDATE update

L DELETE deletion

L load files into the datasheet

Insert the query results into the Hive table

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2...)]

The Load operation is simply a copy / move operation, moving the data file to the location corresponding to the Hive table.

Filepath

Relative path, for example: project/data1

Absolute path, for example: / user/hive/project/data1

Contains the full URI of the schema, for example: hdfs://namenode:9000/user/hive/project/data1

Example 1: load a file into a datasheet

LOAD DATA LOCAL INPATH'. / examples/files/kv1.txt' OVERWRITE INTO TABLE pokes

Example 2: loading local data while giving partition information

LOAD DATA LOCAL INPATH'. / examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15')

Example 3: loading local data while giving partition information

LOAD DATA INPATH'/ user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15')

Example 4: insert the query results into the Hive table

Basic mode:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2...)] Select_statement1 FROM from_statement

Multi-insert mode:

FROM from_statement

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2...)] Select_statement1

[INSERT OVERWRITE TABLE tablename2 [PARTITION...] Select_statement2]...

Automatic Partition Mode:

INSERT OVERWRITE TABLE tablename PARTITION (partcol1 [= val1], partcol2 [= val2]...) Select_statement FROM from_statement

Example 3: write query results to the HDFS file system

INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT... FROM...

FROM from_statement

INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1

[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2]

When data is written to the file system, the text is serialized, and each column is distinguished by ^ A,\ nWrap

Example 3:INSERT INTO

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2...)] Select_statement1 FROM from_statement

3) DQL operation: data query SQL

DQL includes:

L basic Select operation

L query based on Partition

L Join

Basic Select operations:

SELECT [ALL | DISTINCT] select_expr, select_expr,...

FROM table_reference

[WHERE where_condition]

[GROUP BY col_list [HAVING condition]]

[CLUSTER BY col_list

| | [DISTRIBUTE BY col_list] [SORT BY | ORDER BY col_list] |

]

[LIMIT number]

Use the ALL and DISTINCT options to distinguish between the processing of duplicate records. The default is ALL, which means that all records are queried. DISTINCT means to remove duplicate records.

L Where condition

L similar to the where condition of our traditional SQL

L currently supports AND,OR, version 0.9 supports between

L IN, NOT IN

L does not support EXIST, NOT EXIST

The difference between ORDER BY and SORT BY

L ORDER BY global sort, with only one Reduce task

L SORT BY only sorts on the local machine

Limit

L Limit can limit the number of records in a query

Example 1: query by antecedent

SELECT a.foo FROM invites a WHERE a.dskeeper'

Example 2: output query data to a directory

INSERT OVERWRITE DIRECTORY'/ tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.dsfolk

Example 3: output the query results to the local directory

INSERT OVERWRITE LOCAL DIRECTORY'/ tmp/local_out' SELECT a.* FROM pokes a

Example 4: select all columns to the local directory

Hive > INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a

INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key

< 100; INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a WHERE a.ds=''; INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a; INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a; 例子5:将一个表的统计结果插入另一个表中 INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo >

0 GROUP BY a.bar

FROM pokes T1 JOIN invites T2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo

Example 6: insert multi-table data into the same table

FROM src

INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key

< 100 INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >

= 100 and src.key

< 200 INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >

= 200 and src.key

< 300 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >

= 300

Example 7: inserting a file stream directly into a file

FROM invites an INSERT OVERWRITE TABLE events SELECT TRANSFORM (a.foo, a.bar) AS (oof, rab) USING'/ bin/cat' WHERE a.ds > '2008-08-09'

In the Hadoop ecosystem, batch computing for big data usually requires one or more MapReduce jobs to complete, but this batch computing method can not meet the high real-time requirements.

When it's time for Storm to play, how does Storm efficiently complete real-time computing? In the next article, I will introduce the open source distributed real-time computing system, Storm.

How to learn Hadoop development in 4 months and find a job with an annual salary of 250000?

Free to share a set of 17 years of the latest Hadoop big data tutorials and 100 Hadoop big data must meet questions.

Because links are often harmonious, friends who need them please add Wechat ganshiyun666 to get the latest download link, marked "51CTO"

The tutorials have helped 300 + people successfully transform Hadoop development, with a starting salary of more than 20K, double the previous salary.

Baidu Hadoop core architect recorded it himself.

The content includes three parts: basic introduction, Hadoop ecosystem and real business project. Among them, business cases allow you to come into contact with the real production environment and train your development skills.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report