Example Analysis of Hive 04/26 Update SLTechnology News&Howtos

Example Analysis of Hive

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you the "sample analysis of Hive", which is easy to understand and well-organized. I hope it can help you solve your doubts. Let the editor lead you to study and learn the article "sample Analysis of Hive".

Hive definition

Hive is a data warehouse infrastructure built on Hadoop. It provides a series of tools that can be used for data extraction, transformation loading (ETL), a mechanism that can store, query, and analyze large-scale data stored in Hadoop. Hive defines a simple SQL-like query language called HQL, which allows users who are familiar with SQL to query data. At the same time, the language also allows familiar with MapReduce developers to develop custom mapper and reducer to handle complex analytical work that cannot be done by built-in mapper and reducer.

Hive has no special data format. Hive works well on top of Thrift, controls delimiters, and allows users to specify data formats.

So when we talk about Hive, we're talking about an architecture, or a data warehouse, and there are no special restrictions on how to refer to Hive SQL.

Applicable scenarios for Hive

Since it is mentioned above that Hive is different from traditional databases, Hive must have its unique features:

Hive is built on top of Hadoop based on static batch processing, and Hadoop usually has high latency and requires a lot of overhead when jobs are submitted and scheduled. Therefore, Hive is not able to implement low-latency and fast queries on large datasets. For example, Hive generally has a minute delay in executing queries on datasets of hundreds of MB. Therefore, Hive is not suitable for applications that require low latency, such as online transaction processing (OLTP). The Hive query operation process strictly follows the Hadoop MapReduce job execution model. Hive converts the user's HiveQL statements into MapReduce jobs through the interpreter and submits them to the Hadoop cluster. Hadoop monitors the job execution process and then returns the job execution results to the user. Hive is not designed for online transaction processing, and Hive does not provide real-time queries and row-based data update operations. The best use case for Hive is the batch job of big data set, for example, network log analysis.

Technical characteristics of Hive

Hive is a data warehouse processing tool that encapsulates Hadoop at the bottom. It uses SQL-like HiveQL language to query data. All Hive data is stored in Hadoop-compatible file systems (for example, Amazon S3, HDFS). Hive will not make any changes to the data during the loading process, but will only move the data to the directory set by Hive in HDFS. Therefore, Hive does not support rewriting and adding data, and all data is determined at the time of loading. The design features of Hive are as follows.

● supports indexing to speed up data query.

● has different storage types, such as plain text files, files in HBase.

● saves metadata in a relational database, greatly reducing the time it takes to perform semantic checks during queries.

● can use data stored in the Hadoop file system directly.

● has built-in a large number of user functions UDF to operate time, strings and other data mining tools, allowing users to extend UDF functions to complete operations that built-in functions can not achieve.

The query method of ● class SQL, which converts the SQL query into the job of MapReduce and executes on the Hadoop cluster.

Architecture user Interface of Hive

There are three main user interfaces: CLI,Client and WUI. One of the most common is when CLI,Cli starts, it starts a copy of Hive at the same time. Client is the client of Hive, and the user connects to Hive Server. When you start Client mode, you need to indicate the node where Hive Server is located, and start Hive Server on that node. WUI accesses Hive through a browser.

Metadata storage

Hive stores metadata in databases such as mysql and derby. The metadata in Hive includes the name of the table, the columns and partitions of the table and its properties, the attributes of the table (whether it is an external table, etc.), the directory where the data of the table is located, and so on.

Interpreter, compiler, optimizer, executor

The interpreter, compiler and optimizer complete the HQL query sentence from lexical analysis, syntax analysis, compilation, optimization and query plan generation. The generated query plan is stored in HDFS and subsequently executed by a MapReduce call.

Hadoop

Hive's data is stored in HDFS, and most queries are done by MapReduce (queries that include *, such as select * from tbl do not generate MapReduce tasks).

Data storage of Hive

First of all, Hive does not have a special data storage format, nor does it index the data. Users are very free to organize tables in Hive, as long as they tell Hive the column delimiter and row separator in the data when creating the table, and Hive can parse the data.

Second, all the data in Hive is stored in HDFS, and Hive contains the following data models: table (Table), external table (External Table), partition (Partition), bucket (Bucket).

Table in Hive is similar in concept to Table in database, and each Table has a corresponding directory to store data in Hive. For example, a table pvs, whose path in HDFS is: / wh/pvs, where wh is the directory of the data warehouse specified by ${hive.metastore.warehouse.dir} in hive-site.xml, where all Table data (excluding External Table) is stored in this directory.

Partition corresponds to a dense index of Partition columns in the database, but the organization of Partition in Hive is very different from that in the database. In Hive, a Partition in a table corresponds to a directory under the table, and all Partition data is stored in the corresponding directory. For example, if the pvs table contains two Partition, ds and city, then the HDFS subdirectory of ctry=US corresponds to ds=20090801: / wh/pvs/ds=20090801/ctry=US; corresponds to ds=20090801, and the HDFS subdirectory of ctry=CA is; / wh/pvs/ds=20090801/ctry=CA

Buckets calculates the hash for the specified column and splits the data according to the hash value in order to be parallel, with each Bucket corresponding to a file. Divide the user column into 32 bucket. First, calculate the hash for the value of the user column. The HDFS directory with a hash value of 0 is: / wh/pvs/ds=20090801/ctry=US/part-00000;hash directory with a value of 20 and the HDFS directory with a value of 20 is / wh/pvs/ds=20090801/ctry=US/part-00020

External Table points to data that already exists in HDFS, and you can create a Partition. It is the same as Table in the organization of metadata, but the storage of actual data is quite different.

During the process of creating Table and loading data (both of which can be done in the same statement), the actual data will be moved to the data warehouse directory during loading, and then the access to the data pairs will be completed directly in the data warehouse directory. When you delete a table, the data and metadata in the table are deleted at the same time.

External Table has only one process, loading data and creating tables at the same time (CREATE EXTERNAL TABLE... LOCATION), the actual data is stored in the HDFS path specified after the LOCATION and is not moved to the data warehouse directory. When you delete an External Table, only the metadata is deleted, and the data in the table is not actually deleted.

The above is all the content of this article "sample Analysis of Hive". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.