What is the architecture of Apache Hive 3? 04/05 Update SLTechnology News&Howtos

What is the architecture of Apache Hive 3?

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces what the Apache Hive 3 architecture is like, the article is very detailed, has a certain reference value, interested friends must read it!

Apache Tez

Apache Tez is the Hive execution engine for the Hive on Tez service, which includes HiveServer (HS2) in Cloudera Manager. Tez does not support MapReduce. In a Cloudera cluster, an exception occurs if the old script or application specifies the MapReduce to be executed. Most user-defined functions (UDF) do not need to be changed to execute on the Tez, without the need to perform MapReduce.

Using the expressions and data transfer primitives of a directed acyclic graph (DAG), executing Hive queries on Tez instead of MapReduce can improve query performance. In the Cloudera data platform (CDP), Hive usually uses only the Tez engine and automatically starts and manages Tez AM when Hive on Tez starts. The SQL query you submitted to Hive is executed as follows:

Hive compiles the query.

Tez executes the query.

Allocate resources to applications throughout the cluster.

Hive updates the data in the data source and returns the query results.

Hive on Tez runs tasks on temporary containers and uses standard YARN shuffle services.

Data storage and access control

One of the major architectural changes that support the Hive 3 design gives Hive more control over metadata memory resources and file systems or object storage. The following architectural changes from Hive 2 to Hive 3 provide greater security:

Tightly controlled file systems and computer memory resources replace flexible boundaries: clear boundaries improve predictability. Better file system control can improve security.

Optimize workloads in shared files and YARN containers

By default, CDP Private Cloud Foundation stores Hive data on HDFS, and CDP public cloud stores Hive data on S3 by default. In the public cloud, Hive uses HDFS only to store temporary files. Hive 3 is optimized for object storage, such as S3, in the following ways:

Hive uses ACID to determine which files to read, rather than relying on the storage system.

In Hive 3, file movement is less than in Hive 2.

Hive actively caches metadata and data to reduce file system operations.

The main authorization model of Hive is Ranger. Hive enforces the access control specified in Ranger. Compared with other security schemes, this model provides stronger security and more flexibility in managing policies.

This model only allows Hive to access the data warehouse. If the Ranger security service or other security is not enabled, by default, Hive for CDP Private Cloud Foundation will use storage-based authorization (SBA) based on user simulation.

HDFS permission change

In CDP Private Cloud Foundation, SBA relies heavily on HDFS access control tables (ACL). ACL is an extension of the privilege system in HDFS. By default, CDP Private Cloud Foundation opens ACL in HDFS, providing you with the following benefits:

Increased flexibility when granting specific permissions to multiple user groups and users

Easily apply permissions to a directory tree rather than a single file

Transaction processing

You can deploy new Hive application types with the following transaction features:

Mature versions of ACID transactions:

The ACID table is the default table type.

Enabling ACID by default does not cause performance or operational overload.

Simplified application development, strong transaction guaranteed operations and simple semantics of SQL commands

You do not need to bucket the ACID table.

Rewritten materialized view

Automatic query caching

Advanced optimization

Hive client change

CDP Private Cloud Foundation supports thin client Beeline to work on the command line. You can run Hive management commands from the command line. Beeline uses JDBC to connect to Hive on Tez to execute commands. Parsing, compilation, and execution are done in Hive on Tez. Beeline supports many of the command line options supported by Hive CLI. However, Beeline does not support hive-e set key=value to configure Hive Metastore.

You can enter supported Hive CLI commands by invoking Beeline using the hive keyword, command options, and commands. For example, hive-e set. Using Beeline instead of the fat client Hive CLI that is no longer supported has many advantages, including lower overhead. Beeline does not use the entire Hive code base. The small number of daemons required to execute queries simplifies monitoring and debugging.

Hive on Tez enforces whitelist and blacklist settings, which you can change using the SET command. Using blacklists, you can limit memory configuration changes to prevent instability. You can configure multiple Hive on Tez instances with different whitelists and blacklists to establish different levels of stability.

Apache Hive Metastore sharing

Hive, Impala, and other components can share remote Hive meta-storage. In the CDP public cloud, HMS uses a pre-installed MySQL database. On the public cloud, you need little or no configuration of HMS.

Integrate Spark

The Spark and Hive tables interoperate using Hive Warehouse Connector.

You can use Hive Warehouse Connector to access ACID tables and external tables from Spark. You do not need Hive Warehouse Connector to read Hive external tables from Spark and write Hive external tables from Spark. You do not need HWC to read or write to Hive external tables. Spark users simply read or write directly from the Hive. You can read Hive external tables in ORC or Parquet format. However, you can only write the external table of Hive in ORC format.

Execution of query batches and interactive workloads

You can connect to Hive using JDBC command-line tools (such as Beeline) or using JDBC / ODBC drivers and BI tools (such as Tableau). The client communicates with an instance of the same Hive on Tez version. You can configure settings files for each instance to perform batch or interactive processing.

These are all the contents of the article "what is the architecture of Apache Hive 3?" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.