In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
10 Hive architecture
10.1 concept
User interface: the entrance for users to access Hive
Metadata: user information of Hive and MetaData of tables
Interpreter: analyze and translate the components of HQL
Compilers: compiling components of HQL
Optimizer: optimizing the components of HQL
10.2 Hive architecture and basic composition
1. Architecture diagram
2. Basic composition
User interface, including CLI,JDBC/ODBC,WebUI
Metadata storage, usually stored in relational databases such as mysql, derby
Interpreter, compiler, optimizer, executor
Hadoop: use HDFS for storage and MapReduce for calculation
3. The basic functions of each component
There are three main user interfaces: CLI,JDBC/ODBC and WebUI
CLI, the Shell command line
JDBC/ODBC is the JAVA of Hive, similar to the way you use traditional database JDBC
WebGUI accesses Hive through a browser
Hive stores metadata in databases. Currently, only mysql and derby are supported, and more databases will be supported in the next release. The metadata in Hive includes the name of the table, the columns and partitions of the table and its attributes, the attributes of the table (whether it is an external table, etc.), the directory where the data of the table is located, and so on.
The interpreter, compiler and optimizer complete the HQL query sentence from lexical analysis, syntax analysis, compilation, optimization and query plan generation. The generated query plan is stored in HDFS and subsequently executed by a MapReduce call
Hive's data is stored in HDFS, and most queries are done by MapReduce (queries that include *, such as select * from table do not generate MapRedcue tasks)
4 、 Metastore
Metastore is the system catalog (catalog) used to hold metadata (metadata) information for tables stored in Hive.
Metastore is a feature that distinguishes Hive from other similar systems when it is used as a traditional database solution such as oracle and db2
The Metastore contains the following sections:
Database is the namespace of the table. The default database (database) is named 'default'
The original data of the Table table (table) contains information: columns (list of columns) and their type (types), owner (owner), storage space (storage) and SerDei information
Partition each partition has its own column (columns), SerDe, and storage space (storage). This feature will be used to support schema evolution (schema evolution) in Hive
5 、 Compiler
Driver calls the compiler (compiler) to process HiveQL strings, which may be a DDL, DML, or query statement
The compiler converts strings to policies (plan)
The policy consists only of metadata operations and HDFS operations, metadata operations contain only DDL statements, and HDFS operations contain only LOAD statements
For inserts and queries, the policy consists of directed acyclic graphs (directedacyclic graph,DAG) in map-reduce tasks
10.3 Hive operation mode
The running mode of Hive is the execution environment of the task.
It can be divided into local and cluster types.
We can specify through mapred.job.tracker
Setting method: hive > SET mapred.job.tracker=local
10.4 data types
1. Original data type
Integers:TINYINT-1 byte, SMALLINT-2 byte, INT-4 byte, BIGINT-8 byte
Boolean type:BOOLEAN-TRUE/FALSE
Floating point numbers:FLOAT-single precision, DOUBLE-double precision
String type:STRING-sequence of charactersin a specified character set
2. Complex data types
Structs: example {c INT; d INT}
Maps (key-value tuples):. Example 'group'-> gid M [' group']
Arrays (indexable lists): example ['1','2','3']
New attributes added in TIMESTAMP version 0.8
10.5 metadata storage for Hive
1. Storage mode and mode
Hive stores metadata in a database
There are three ways to connect to a database schema
Single user mode
Multi-user mode
Remote server mode
2. Single user mode
This mode connects to an In-memory database Derby, which is generally used for Unit Test
3. Multi-user mode
Connecting to a database over a network is the most frequently used mode
4. Remote server mode
It is used for non-Java client to access Metabase, MetaStoreServer is started on server, and client uses Thrift protocol to access Metabase through MetaStoreServer.
10.6 Hive data storage
1. The basic concept of Hive data storage.
Hive's data storage is based on Hadoop HDFS.
Hive does not have a special data storage format
The storage structure mainly includes: database, file, table and view.
Hive can load text files directly by default, and also supports sequence file and RCFile.
When creating a table, we directly tell Hive the column and row delimiters of the data, and Hive can parse the data
2. The data model of Hive-database
DataBase similar to traditional database
It's actually a table in a third-party database.
Simple example: command line hive > create database test_database
3. Internal table
Similar in concept to Table in the database
Each Table has a corresponding directory to store data in the Hive
For example, a table test whose path in HDFS is: / warehouse / test
Warehouse is created in hive-site.xml by the
The directory of the data warehouse specified by ${hive.metastore.warehouse.dir}
All Table data (excluding External Table) is stored in this directory.
When you delete a table, both metadata and data are deleted
4. Simple example of internal table
Create a data file test_inner_table.txt
Create a tabl
Create table test_inner_table (key string)
Load data
LOAD DATA LOCAL INPATH 'filepath' INTO TABLE test_inner_table
View data
Select * from test_inner_tableselect count (*) from test_inner_table
Delete tabl
Drop table test_inner_table
5. Partition table
Partition corresponds to a dense index of Partition columns in the database
In Hive, a Partition in a table corresponds to a directory under the table, and all Partition data is stored in the corresponding directory
For example, if the test table contains two Partition, date and position, the HDFS subdirectory corresponding to date=20120801 and position = zh is: / warehouse / test/date=20120801/ position = zh
The HDFS subdirectory corresponding to = 20100801, position = US is; / warehouse/xiaojun/date=20120801/ position = US
6. Simple example of partition table
Create a data file test_partition_table.txt
Create a tabl
Create table test_partition_table (key string) partitioned by (dtstring)
Load data
LOAD DATA INPATH 'filepath' INTO TABLE test_partition_tablepartition (dt='2006')
View data
Select * from test_partition_tableselect count (*) from test_partition_table
Delete tabl
Drop table test_partition_table
7. External table
To point to data that already exists in HDFS, you can create a Partition
It is the same as the internal table in the organization of metadata, but the storage of actual data is quite different.
During the process of creating the internal table and loading the data (both of which can be done in the same statement), the actual data is moved to the data warehouse directory during the loading process; after that, the access to the data pair will be completed directly in the data warehouse directory. When you delete a table, the data and metadata in the table will be deleted at the same time
There is only one process for an external table, which loads the data and creates the table at the same time, and does not move to the data warehouse directory, but simply establishes a link with the external data. When you delete an external table, only the link is deleted
8. Simple examples of external tables
Create a data file test_external_table.txt
Create a tabl
Create external table test_external_table (key string)
Load data
LOAD DATA INPATH 'filepath' INTO TABLE test_inner_table
View data
Select * from test_external_tableselect count (*) from test_external_table
Delete tabl
Drop table test_external_table
9. Bucket Table (bucket table)
The columns of the table can be further decomposed into different file stores through the Hash algorithm.
For example, if the age column is divided into 20 files, the first step is to Hash the AGE, corresponding to the write / warehouse/test/date=20120801/postion=zh/part-00000 of 0 and the write / warehouse/test/date=20120801/postion=zh/part-00001 of 1
If you want to apply a lot of Map tasks, this is a good choice.
10. Simple example of Bucket Table
Create a data file test_bucket_table.txt
Create a tabl
Create table test_bucket_table (key string) clustered by (key) into 20 buckets
Load data
LOAD DATA INPATH 'filepath' INTO TABLE test_bucket_table
View data
Select * from test_bucket_tableset hive.enforce.bucketing = true
11. Data model-view of Hive
The view is similar to that of a traditional database
The view is read-only
The basic table on which the view is based, if changed, means that the increase will not affect the rendering of the view; if deleted, there will be a problem
If you do not specify the columns for the view, it will be generated according to the select statement
Example
Create view test_view as select * from test
10.7 Hive data storage
Configuration steps:
Hive-site.xml add
Hive.hwi.war.file lib/hive-hwi-0.8.1.war
Start UI sh $HIVE_HOME/bin/hive of Hive-- service hwi
11 Hive principle
11.1 Hive principle
1. What is necessary to learn the principle of Hive
How many MR jobs will a Hive HQL be converted into
How to speed up the execution of Hive
What can we do when writing Hive HQL
How to convert HQL to MR Job by Hive
What kind of optimization method will Hive adopt?
2. Hive architecture & execution flow chart
3. Hive execution process
The compiler puts a Hive QL conversion operator
Operator is the smallest processing unit of Hive
Each operator represents an operation of HDFS or a MapReduce job
4 、 Operator
Operator is a process defined by hive.
Operator is defined as:
Protected List
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.