The basic concept of Hive 07/09 Update SLTechnology News&Howtos

The basic concept of Hive

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the "basic concept of Hive". In daily operation, I believe many people have doubts about the basic concept of Hive. The editor has consulted all kinds of materials and sorted out simple and easy operation methods. I hope it will be helpful for everyone to answer the doubts of "basic concept of Hive"! Next, please follow the small series to learn together!

About Hive

What is Hive?

Hive is a Hadoop-based data warehouse tool that maps structured data files to a database table and provides SQL-like query functionality (HQL).

Its essence is to convert SQL into MapReduce tasks for operation. HDFS provides data storage at the bottom. Hive can be understood as a tool for converting SQL into MapReduce tasks.

Hive can store and compute data

Data storage relies on HDFS

Data computation relies on MapReduce

Why Hive?

Problems with using hadoop directly

Staff learning costs are too high

Project cycle requirements are too short

MapReduce is too difficult to implement complex query logic development

Why use Hive?

The operation interface adopts SQL-like syntax and provides rapid development capability.

Avoid writing MapReduce and reduce the learning costs of developers.

Function expansion is easy.

Hive's characteristics

scalable

Hive is free to scale clusters and generally does not require a restart of services.

ductility

Hive supports user-defined functions, and users can implement their own functions according to their own needs.

fault-tolerant

Good fault tolerance, SQL can still complete execution if there is a problem with the node.

Hive Architecture

Metadata:

The data that describes the data is metadata.

Name of table,

table columns

the type of the column

Hive internal execution process:

Interpreters-> Compilers (metadata used)-> Optimizers-> Executors

basic composition

User interfaces: CLI, JDBC/ODBC, WebGUI. Among them, CLI(command line interface) is the shell command line;JDBC/ODBC is the JAVA implementation of Hive, similar to traditional database JDBC;WebGUI is to access Hive through a browser.

Metadata storage: Usually stored in a relational database such as mysql/derby. Hive stores metadata in a database. Metadata in Hive includes table names, table columns and partitions and their attributes, table attributes (whether it is an external table, etc.), and the directory where the table data resides.

** Interpreter, compiler, optimizer, executor:** Complete HQL query statements from lexical analysis, syntax analysis, compilation, optimization, and query plan generation. The generated query plan is stored in HDFS and subsequently executed by MapReduce calls.

Hive and Hadoop

Hive uses HDFS to store data and MapReduce queries to analyze data

Hive vs. traditional databases

hive Offline data analysis for massive data

1. Data formats. Hive does not define a specific data format

The data format can be specified by the user, and the user-defined data format requires three attributes to be specified:

Column separators (usually spaces,"\t,""\x001"),

Line separator ("\n") and

A method of reading file data (Hive defaults to three file formats TextFile, SequenceFile, and RCFile).

Hive does not require conversion from user data formats to Hive-defined data formats during data loading.

3. Hive does not modify or even scan the data itself during loading.== It simply copies or moves the data content to the appropriate HDFS directory.

4. Hive does not support rewriting and adding data, all data is determined at the time of loading.

5. Hive does not index certain keys in the data during loading. Hive needs to brute force scan the entire data to access specific values in the data that satisfy the condition, so access latency is high. Hive is not suitable for online data query because of the high latency of data access.

6. Hive is built on top of Hadoop, so Hive scalability is consistent with Hadoop scalability.

Summary: hive has the appearance of sql database, but the application scenario is completely different. hive is only suitable for statistical analysis of batch data.

Hive's Data Storage

1. All data in Hive is stored in HDFS, and there is no special data storage format (Text, SequenceFile, ParquetFile, ORC format RCFILE, etc.)

Hive can parse the data simply by telling Hive the column and row delimiters in the data when creating the table.

3. Hive contains the following data models: DB, Table, External Table, Partition, Bucket.

db: appears in hdfs as a folder under the ${hive.metastore.warehouse.dir} directory

table: represents a folder under db directory in hdfs

external table: similar to table, but its data storage location can be any specified path

partition: appears as a subdirectory under table directory in hdfs

bucket: appears in hdfs as multiple files under the same table directory after hashing

At this point, the study of "Hive's basic concepts" is over, hoping to solve everyone's doubts. Theory and practice can better match to help everyone learn, go and try it! If you want to continue learning more relevant knowledge, please continue to pay attention to the website, Xiaobian will continue to strive to bring more practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.