How to practice the principle of Hive 04/26 Update SLTechnology News&Howtos

How to practice the principle of Hive

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to carry out the practice of Hive principle, in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Hive basic architecture

Driver component: the core component, the core of the whole Hive, which includes Complier (compiler), Optimizer (optimizer) and Executor (executor). Their functions are to parse Hive SQL statements, compile and optimize them, generate execution plans, and then call the underlying MapReduce computing framework.

Metastore component: a metadata service component that stores metadata for Hive. Supported relational databases are Derby and MySQL.

CLI: command line interface

Thrift Server: provides JDBC and ODBC access capabilities, and users develop scalable and cross-language services. Hive integrates this service, allowing different programming languages to call Hive interfaces.

Hive Web Interface (HWI): the Hive client provides a way to access the services provided by Hive through web pages. This interface corresponds to the HWI component of Hive.

Hive receives relevant Hive SQL queries through CLI, JDBC/ODBC, or HWI, compiles them through Driver components, analyzes and optimizes them, and finally becomes an executable MapReduce.

HIVE SQL

Hive table: internal table and external table

Internal table: the hdfs directory file will be moved to the directory corresponding to hive. The table interface and file corresponding to the delete table are also deleted.

External table: the associated hdfs file is not moved, and deleting the table only deletes the table structure.

Usage scenarios: internal tables are more likely to be selected if all data processing is done in hive, but external tables are more appropriate if Hive and other tools work on the same dataset.

Zoning and bucket division

Partitions can make partial queries of data change faster, tables or partitions can be further divided into buckets, buckets usually add additional structures to the original data, which can be used for efficient queries.

Buckets are usually divided for two reasons: one is efficient query, and the other is efficient sampling.

The principle of Hive SQL implementation:

Roughly fall into three categories: select statement, group by statement, join statement.

Process: input sharding-> Map phase-> Combiner (optional)-> Shuffle phase (partition, sorting, separation, replication, merge, etc.)-> Reduce phase-> output file.

Other SQL on Hadoop technologies: Impala, Drill, HAWQ, Presto, Dremel, Spark SQL.

Hive optimization

The main challenges are data skew optimization caused by group by, Count distinct optimization, large table join small table (mapjoin) optimization, large table join large table optimization.

This is the end of the answer to the practical question on how to carry out the principle of Hive. I hope the above content can be of some help to everyone. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.