The basic theory of hive 04/28 Update SLTechnology News&Howtos

The basic theory of hive

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. Introduction of hive

what is hive:Hive is a data warehouse tool based on hadoop, which is essentially a MapReduce computing framework based on hdfs, which analyzes and manages the data stored in HDFS.

How hive works: abstract the data stored in hive into a two-dimensional table, providing an operation mode similar to sql statements. These sql statements are eventually translated into MapReduce programs by the underlying hive, and finally run on the hadoop cluster. The results will also be output in hdfs. (must be structured data). Hive does not check the data when it is stored, but when it is read.

The advantage of hive: it greatly simplifies the programming of distributed computing programs. So that can not be distributed programming, other staff can carry out large amounts of data statistical analysis.

The disadvantages of hive: do not support row-level add, delete and modify operations, hive query delay is very serious, hive does not support transactions, mainly used for OLAP (online Analytical processing).

The applicable scenario of hive: data in hive data warehouse, mainly storage, structured data after ETL (data cleaning, extraction, transformation, loading) operations. However, there are no special requirements for the format of data storage, which can make ordinary files, overwrite compressed files and so on.

hive's comparison with relational databases:

2. The architecture of hive

The architecture of hive consists of four parts:

user Interface:

-CLI (command line interface), shell terminal command line, interactive use of hive command line to interact with hive, most commonly used (learning, generation, debugging)

-Jdbc/odbc: the client provided by hive based on jdbc operations, and users (developers, operators) use this link to hive server services

-Web UI: access hive through a browser (almost nothing)

Thrift Server:Thrift is a software framework developed by facebook, which can be used to develop extensible and cross-language services. Hive integrates this service and allows different programming languages to call hive interfaces.

The bottom four components of : the bottom four components complete the generation of hql query statements from lexical analysis, syntax analysis, compilation, optimization, and generation logic execution plan. The generated logical execution plan is stored in the hdfs and then executed by the MapReduce call.

-interpreter: the role of the interpreter is to convert hiveSQL statements into abstract grammar numbers

-compiler: the compiler compiles the syntax tree into a logical execution plan

-optimizer: the optimizer optimizes logical execution plans

-executor: call the underlying running framework to execute the logical execution plan at execution time

The execution process is: hiveQL, through the command or client submission, through the compiler compiler, using the metadata in metastore for type detection and syntax analysis, generate a logical scheme, and then through the optimization processing, generate a maptask program.

Metabase: is the description of the data stored in hive, usually including the name of the table, the columns and partitions of the table and its attributes, the attributes of the table (internal and external tables), and the directory where the data of the table resides. Hive has two metadata storage schemes:

-Metastore is stored by default in its own derby database. The disadvantage is that it is not suitable for multi-user operation, and the data storage directory is not fixed. The database follows the entry directory of hive, which is extremely inconvenient to manage.

-Hive and mysql interact (locally or remotely) through Metastore services

3. Data storage of hive

Storage characteristics of hive:

All the data in -hive is stored in hdfs, and there is no special data storage format, because hive is read mode and can support TezxtFile, SequenceFile (serialization) RCFile (row-column combination) or custom format, etc.

-just tell hive the column delimiter and row delimiter in the data when creating the table, and hive can parse the data. The default column separator is: (Ctrl + an invisible character:\ X01), and the row separator is: (\ nnewline character)

hive storage structure: hive storage structure: database, table, view, partition and table data, etc. Databases, tables, views, partitions, and so on all correspond to a directory on the hdfs, and the number of tables corresponds to the files in the directory corresponding to the hdfs.

Example:

Hdfs://Hadoop01/user/hive/warehouse/myhive.db/student/student.txtHdfs://Hadoop01/user/hive/warehouse: data warehouse for hive Hdfs://Hadoop01/user/hive/warehouse/myhive.db: a table Hdfs://Hadoop01/user/hive/warehouse/myhive.db/student/student.txt data file in a database Hdfs://Hadoop01/user/hive/warehouse/myhive.db/student hive of hive

Note: when we create a table, we will first generate a file in the appropriate directory on the hdfs, and a record for the newly created table in the hive Metabase.

Specific storage structure of hive:

-data warehouse: represented in HDFS as a folder under the ${hive.metastore.warehouse.dir} directory

-tables: hive tables are divided into internal tables, external tables, partition tables and bucket tables. Tables are also represented as catalogs in hdfs, but different tables have different representations.

-View: materialization, hive is not materialized, which is equivalent to creating a shortcut to a sql statement, saving the sql statement in a view. Read-only, created based on the base table.

-data file: real data in the table

4. Special important points of hive 1) the difference between internal tables and external tables in hive

internal tables: also known as management tables, table creation and deletion are determined by hive itself.

external table: the table structure is the same as the internal table, but the stored data is defined by itself. The external table only deletes metadata when it is deleted, but cannot be deleted when the original data is deleted.

The difference between internal tables and external tables is mainly reflected in two aspects:

-Delete: delete internal tables, delete metadata and data; delete external tables, delete metadata, retain data.

-use: internal tables are preferred if all data processing is done in Hive, but external tables are more appropriate if Hive and other tools are dealing with the same dataset. Use an external table to access the data stored on the hdfs, then convert the data through hive and store it in the internal table.

2) the difference between bucket table and partition table in hive

partition table: the original large table is stored in different data directories.

If is a single-partitioned table, then there are only one-level subdirectories under the table directory, and if it is a multi-partitioned table, there are as many subdirectories as there are under the table directory. Whether it is a single-partitioned table or a multi-partitioned table, data files cannot be stored between the table directory and the non-final partition directory.

Example:

Single partition table: Hdfs://Hadoop01/user/hive/warehouse/myhive.db/student/p0 multi-partition table: Hdfs://Hadoop01/user/hive/warehouse/myhive.db/student/p0Hdfs://Hadoop01/user/hive/warehouse/myhive.db/student/p1Hdfs://Hadoop01/user/hive/warehouse/myhive.db/student/p2Hdfs://Hadoop01/user/hive/warehouse/myhive.db/student/p1/p11

bucket table: the principle is the same as hashpartitioner, when the data of a table in hive is inductively classified, the inductive classification rule is hashpartitioner. (you need to specify a sub-bucket field and specify how many buckets to split)

bucket: in hdfs, it is shown as multiple files after Hash hashing according to the value of a field in the same table directory or partition directory, and the bucket is represented as a separate file.

Example:

Hdfs://Hadoop01/user/hive/warehouse/myhive.db/student/age > 15Hdfs://Hadoop01/user/hive/warehouse/myhive.db/student/age > 20Hdfs://Hadoop01/user/hive/warehouse/myhive.db/student/age > 30

In addition to the difference between partitioned tables and buckets, the most important thing is the function:

-Partition Table: refine data management and reduce the amount of data that mapreduce programs need to scan.

-bucket separation table: improve the efficiency of join query, establish sub-bucket when a piece of data is often used for join query, and the sub-bucket field is the join field; improve the efficiency of sampling.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.