Introduction to Analysis of Hive (4) 02/13 Update SLTechnology News&Howtos

Introduction to Analysis of Hive (4)

2026-02-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

10 Hive architecture

10.1 concept

User interface: the entrance for users to access Hive

Metadata: user information of Hive and MetaData of tables

Interpreter: analyze and translate the components of HQL

Compilers: compiling components of HQL

Optimizer: optimizing the components of HQL

10.2 Hive architecture and basic composition

1. Architecture diagram

2. Basic composition

User interface, including CLI,JDBC/ODBC,WebUI

Metadata storage, usually stored in relational databases such as mysql, derby

Interpreter, compiler, optimizer, executor

Hadoop: use HDFS for storage and MapReduce for calculation

3. The basic functions of each component

There are three main user interfaces: CLI,JDBC/ODBC and WebUI

CLI, the Shell command line

JDBC/ODBC is the JAVA of Hive, similar to the way you use traditional database JDBC

WebGUI accesses Hive through a browser

Hive stores metadata in databases. Currently, only mysql and derby are supported, and more databases will be supported in the next release. The metadata in Hive includes the name of the table, the columns and partitions of the table and its attributes, the attributes of the table (whether it is an external table, etc.), the directory where the data of the table is located, and so on.

The interpreter, compiler and optimizer complete the HQL query sentence from lexical analysis, syntax analysis, compilation, optimization and query plan generation. The generated query plan is stored in HDFS and subsequently executed by a MapReduce call

Hive's data is stored in HDFS, and most queries are done by MapReduce (queries that include *, such as select * from table do not generate MapRedcue tasks)

4 、 Metastore

Metastore is the system catalog (catalog) used to hold metadata (metadata) information for tables stored in Hive.

Metastore is a feature that distinguishes Hive from other similar systems when it is used as a traditional database solution such as oracle and db2

The Metastore contains the following sections:

Database is the namespace of the table. The default database (database) is named 'default'

The original data of the Table table (table) contains information: columns (list of columns) and their type (types), owner (owner), storage space (storage) and SerDei information

Partition each partition has its own column (columns), SerDe, and storage space (storage). This feature will be used to support schema evolution (schema evolution) in Hive

5 、 Compiler

Driver calls the compiler (compiler) to process HiveQL strings, which may be a DDL, DML, or query statement

The compiler converts strings to policies (plan)

The policy consists only of metadata operations and HDFS operations, metadata operations contain only DDL statements, and HDFS operations contain only LOAD statements

For inserts and queries, the policy consists of directed acyclic graphs (directedacyclic graph,DAG) in map-reduce tasks

10.3 Hive operation mode

The running mode of Hive is the execution environment of the task.

It can be divided into local and cluster types.

We can specify through mapred.job.tracker

Setting method: hive > SET mapred.job.tracker=local

10.4 data types

1. Original data type

Integers:TINYINT-1 byte, SMALLINT-2 byte, INT-4 byte, BIGINT-8 byte

Boolean type:BOOLEAN-TRUE/FALSE

Floating point numbers:FLOAT-single precision, DOUBLE-double precision

String type:STRING-sequence of charactersin a specified character set

2. Complex data types

Structs: example {c INT; d INT}

Maps (key-value tuples):. Example 'group'-> gid M [' group']

Arrays (indexable lists): example ['1','2','3']

New attributes added in TIMESTAMP version 0.8

10.5 metadata storage for Hive

1. Storage mode and mode

Hive stores metadata in a database

There are three ways to connect to a database schema

Single user mode

Multi-user mode

Remote server mode

2. Single user mode

This mode connects to an In-memory database Derby, which is generally used for Unit Test

3. Multi-user mode

Connecting to a database over a network is the most frequently used mode

4. Remote server mode

It is used for non-Java client to access Metabase, MetaStoreServer is started on server, and client uses Thrift protocol to access Metabase through MetaStoreServer.

10.6 Hive data storage

1. The basic concept of Hive data storage.

Hive's data storage is based on Hadoop HDFS.

Hive does not have a special data storage format

The storage structure mainly includes: database, file, table and view.

Hive can load text files directly by default, and also supports sequence file and RCFile.

When creating a table, we directly tell Hive the column and row delimiters of the data, and Hive can parse the data

2. The data model of Hive-database

DataBase similar to traditional database

It's actually a table in a third-party database.

Simple example: command line hive > create database test_database

3. Internal table

Similar in concept to Table in the database

Each Table has a corresponding directory to store data in the Hive

For example, a table test whose path in HDFS is: / warehouse / test

Warehouse is created in hive-site.xml by the

The directory of the data warehouse specified by ${hive.metastore.warehouse.dir}

All Table data (excluding External Table) is stored in this directory.

When you delete a table, both metadata and data are deleted

4. Simple example of internal table

Create a data file test_inner_table.txt

Create a tabl

Create table test_inner_table (key string)

Load data

LOAD DATA LOCAL INPATH 'filepath' INTO TABLE test_inner_table

View data

Select * from test_inner_tableselect count (*) from test_inner_table

Delete tabl

Drop table test_inner_table

5. Partition table

Partition corresponds to a dense index of Partition columns in the database

In Hive, a Partition in a table corresponds to a directory under the table, and all Partition data is stored in the corresponding directory

For example, if the test table contains two Partition, date and position, the HDFS subdirectory corresponding to date=20120801 and position = zh is: / warehouse / test/date=20120801/ position = zh

The HDFS subdirectory corresponding to = 20100801, position = US is; / warehouse/xiaojun/date=20120801/ position = US

6. Simple example of partition table

Create a data file test_partition_table.txt

Create a tabl

Create table test_partition_table (key string) partitioned by (dtstring)

Load data

LOAD DATA INPATH 'filepath' INTO TABLE test_partition_tablepartition (dt='2006')

View data

Select * from test_partition_tableselect count (*) from test_partition_table

Delete tabl

Drop table test_partition_table

7. External table

To point to data that already exists in HDFS, you can create a Partition

It is the same as the internal table in the organization of metadata, but the storage of actual data is quite different.

During the process of creating the internal table and loading the data (both of which can be done in the same statement), the actual data is moved to the data warehouse directory during the loading process; after that, the access to the data pair will be completed directly in the data warehouse directory. When you delete a table, the data and metadata in the table will be deleted at the same time

There is only one process for an external table, which loads the data and creates the table at the same time, and does not move to the data warehouse directory, but simply establishes a link with the external data. When you delete an external table, only the link is deleted

8. Simple examples of external tables

Create a data file test_external_table.txt

Create a tabl

Create external table test_external_table (key string)

Load data

LOAD DATA INPATH 'filepath' INTO TABLE test_inner_table

View data

Select * from test_external_tableselect count (*) from test_external_table

Delete tabl

Drop table test_external_table

9. Bucket Table (bucket table)

The columns of the table can be further decomposed into different file stores through the Hash algorithm.

For example, if the age column is divided into 20 files, the first step is to Hash the AGE, corresponding to the write / warehouse/test/date=20120801/postion=zh/part-00000 of 0 and the write / warehouse/test/date=20120801/postion=zh/part-00001 of 1

If you want to apply a lot of Map tasks, this is a good choice.

10. Simple example of Bucket Table

Create a data file test_bucket_table.txt

Create a tabl

Create table test_bucket_table (key string) clustered by (key) into 20 buckets

Load data

LOAD DATA INPATH 'filepath' INTO TABLE test_bucket_table

View data

Select * from test_bucket_tableset hive.enforce.bucketing = true

11. Data model-view of Hive

The view is similar to that of a traditional database

The view is read-only

The basic table on which the view is based, if changed, means that the increase will not affect the rendering of the view; if deleted, there will be a problem

If you do not specify the columns for the view, it will be generated according to the select statement

Example

Create view test_view as select * from test

10.7 Hive data storage

Configuration steps:

Hive-site.xml add

Hive.hwi.war.file lib/hive-hwi-0.8.1.war

Start UI sh $HIVE_HOME/bin/hive of Hive-- service hwi

11 Hive principle

11.1 Hive principle

1. What is necessary to learn the principle of Hive

How many MR jobs will a Hive HQL be converted into

How to speed up the execution of Hive

What can we do when writing Hive HQL

How to convert HQL to MR Job by Hive

What kind of optimization method will Hive adopt?

2. Hive architecture & execution flow chart

3. Hive execution process

The compiler puts a Hive QL conversion operator

Operator is the smallest processing unit of Hive

Each operator represents an operation of HDFS or a MapReduce job

4 、 Operator

Operator is a process defined by hive.

Operator is defined as:

Protected List

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.