What are the main functions of Apache Hive3 04/24 Update SLTechnology News&Howtos

What are the main functions of Apache Hive3

2025-04-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what are the main functions of Apache Hive3. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

Main functions of Apache Hive3

Cloudera Runtime (CR) services include Hive and Hive Metastore. Hive services are based on Apache Hive 3.x (SQL-based data warehouse system). The enhancements to Hive 3.x compared to previous versions can improve query performance and comply with Internet regulations.

ACID transaction processing

The Hive 3 table conforms to the ACID (atomicity, consistency, isolation, and durability) standard, which is essential for complying with the forgotten right to comply with GDPR (General data Protection regulations).

Shared Hive Metastore

Hive Metastore (HMS) is interoperable with multiple engines, such as Impala and Spark, thus simplifying interoperability between engines and user data access.

Low latency analytical processing (CDP public cloud)

Hive uses low-latency analytical processing (LLAP) or the Apache Tez execution engine to handle transactions. Hive LLAP services are not available in the CDP data center.

Hive integrated Spark

You can use Hive to query data from an Apache Spark application without a workaround. Hive Warehouse Connector supports reading and writing Hive tables from Spark.

Safety improvement

By default, Apache Ranger protects Hive data. To meet the need for concurrency improvements, ACID support for GDPR, rendering security, and other features, Hive strictly controls the location of repositories on the file system or object storage and memory resources.

Query-level workload management

You can configure who uses query resources, how many resources can be used, and how quickly Hive responds to resource requests. Workload management can improve parallel query execution, query cluster sharing and query performance.

Materialized view

Because multiple queries often require the same intermediate summary or join tables, expensive, repetitive query sharing can be avoided by pre-calculating and caching the intermediate tables into the view.

Query result cache

The configuration unit filters and caches similar or identical queries. Hive does not recalculate unchanged data. Caching duplicate queries can greatly reduce the load when hundreds or thousands of users of BI tools and Web services query Hive.

Information_schema

After startup, Hive creates two databases from the JDBC data source: information_schema and sys. All Metastore tables are mapped to your tablespace and are available in sys. Information_schema data shows the status of the system, similar to sys database data. You can use SQL standard queries to query information_schema.

Interface is not available or not supported

S3 and LLAP (CDP data Center 7.0 only)

Hive CLI (replaced by Beeline)

WebHCat

Hcat CLI

SQL standard authorization

MapReduce execution engine (replaced by Tez)

Apache Hive 3 Architecture Overview

Understanding the main design features of Apache Hive 3, such as the default ACID transaction processing, can help you use Hive to meet the growing needs of enterprise data warehouse systems.

Apache Tez

Apache Tez is the Hive execution engine for Hive-on-Tez services in Cloudera Manager. MapReduce is not supported. In a Cloudera cluster, an exception occurs if the old script or application specifies MapReduce execution. Most user-defined functions (UDF) do not need to be changed to execute on the Tez, without the need to execute the MapReduce.

Using the expressions and data transfer primitives of a directed acyclic graph (DAG), executing Hive queries on Tez instead of MapReduce can improve query performance. In Cloudera Data Plane (CDP), Tez is usually used only by Hive, and HiveServer automatically starts and manages Tez AM when HiveServer2 starts. The SQL query you submitted to Hive is executed as follows:

Hive compiles the query.

Tez executes the query.

Resources are allocated to applications throughout the cluster.

Hive updates the data in the data source and returns the query results.

Hive on Tez runs tasks on temporary containers and uses standard YARN shuffle services.

Data storage and access control

One of the major architectural changes that support the Hive 3 design gives Hive more control over metadata memory resources and vfile system or object storage. The following architectural changes from Hive 2 to Hive 3 provide greater security:

Tightly controlled file systems and computer memory resources replace flexible boundaries: clear boundaries improve predictability. Better file system control can improve security.

Optimize workloads in shared files and YARN containers

By default, the CDP data center stores Hive data on HDFS, and the CDP public cloud stores Hive data on S3. In the cloud, Hive uses HDFS only to store temporary files. Hive 3 is optimized for object storage, such as S3, in the following ways:

Hive uses ACID to determine which files to read, rather than relying on the storage system.

In Hive 3, file movement is less than in Hive 2.

Hive actively caches metadata and data to reduce file system operations

The main authorization model of Hive is Ranger. Hive enforces the access control specified in Ranger. Compared with other security schemes, this model provides stronger security and more flexibility in managing policies.

This model only allows Hive to access the data warehouse. If you do not enable the Ranger security service or other security, by default, Hive uses storage-based authorization (SBA) in the CDP data center based on user simulation.

HDFS permission change

In the CDP data center, SBA relies heavily on HDFS access control lists (ACL). ACL is an extension of the privilege system in HDFS. By default, the CDP data center opens ACL in HDFS, providing you with the following benefits:

Increased flexibility when granting specific permissions to multiple groups and users

Easily apply permissions to a directory tree rather than a single file

Transaction processing

You can deploy new Hive application types with the following transaction features:

Mature versions of ACID transactions:

The ACID table is the default table type.

Enabling ACID by default does not cause performance or operational overload.

Simplified application development, operations with strong transaction guarantees, and simple semantics of SQL commands

You do not need to store the ACID table.

Materialized view rewriting

Automatic query cache

Advanced optimization

Hive client change

The CDP data center supports the use of thin client Beeline on the command line. You can run Hive management commands from the command line. Beeline uses the JDBC connection to the HiveServer to execute the command. Parsing, compilation, and execution are done in HiveServer. Beeline supports many of the command line options supported by Hive CLI. Beeline does not support hive-e set key=value to configure Hive Metastore.

You can enter supported Hive CLI commands by invoking Beeline using the hive keyword, command options, and commands. For example, hive-e set. Using Beeline instead of the fat client Hive CLI that is no longer supported has many advantages, including lower overhead. Beeline does not use the entire Hive code base. The small number of daemons required to execute queries simplifies monitoring and debugging.

HiveServer enforces whitelist and blacklist settings that you can change using the SET command. Using blacklists, you can limit memory configuration changes to prevent HiveServer instability. You can configure multiple HiveServer instances with different whitelists and blacklists to establish different levels of stability.

You can use the grunt command line with Apache Pig.

Apache Hive Metastore sharing

HiveServer,Impala and other components can share remote Hive meta-storage. In the CDP public cloud, HMS uses a pre-installed MySQL database. You rarely perform HMS or just configure HMS in the cloud.

Spark integration

In some cases, Spark and Hive tables can be interoperable using Hive Warehouse connectors.

You can use the Hive Warehouse connector to access ACID and external tables from Spark. You do not need Hive Warehouse Connector to read Hive external tables from Spark and write Hive external tables from Spark.

Execution of query batches and interactive workloads

You can use JDBC command-line tools (such as Beeline) or use JDBC/ODBC drivers and BI tools (such as Tableau) to connect to Hive. The client communicates with an instance of the same HiveServer version. You can configure settings files for each instance to perform batch or interactive processing.

Apache Hive3 performance tuning

Low delay analysis and processing

The CDP public cloud supports low-latency analytical processing (LLAP) of Hive queries. Using LLAP in the CDP data warehouse service, you can adjust the data warehouse infrastructure, components, and client connection parameters to improve performance and relevance to business intelligence and other applications.

Enterprises increasingly want to run SQL workloads, which return results faster than those provided by batches. These enterprises usually want data analysis applications to support interactive queries. Low latency analytical processing (LLAP) can improve the performance of interactive queries. The Hive interactive query running on the CDP public cloud meets the low-latency, variable-parameter benchmark that Hive LLAP responds to in 15 seconds or less. LLAP enables application development and the IT infrastructure to run queries that return real-time or near real-time results.

LLAP is not supported in CDP data Center Edition.

Best practices for High performance Hive

Before tuning Apache Hive, you should follow best practices. These guidelines include how to configure clusters, store data, and write queries.

When resources are needed to process queries, automatic scaling can be adjusted in the CDP public cloud to scale up.

Accept the default setting to use Tez as the execution engine. In CDP, the MapReduce execution engine is replaced by Tez.

Accept the default settings to disable user emulation. If enabled, use the Cloudera Manager relief valve feature hive.server2.enable.doAs to disable in hive-site.xml (see link below).

LLAP caches data for multiple queries, and this feature does not support user impersonation.

Use Ranger security services to protect your cluster and related services.

Use the ORC file format to store data.

Check and explain the plan to ensure that the query is fully vectorized.

Use the SmartSense tool to detect common system misconfigurations.

Maximize storage resources using ORC

You can save storage space in a number of ways, but using the optimized rows and rows (ORC) file format to store Apache Hive data is the most efficient. ORC is the default storage for Hive data.

The ORC file format for Hive data storage is recommended for the following reasons:

Efficient compression: stored as columns and compressed, which results in smaller disk reads. Column format is also an ideal choice for vector optimization in Tez.

Fast read: ORC has built-in indexes, min / max values, and other aggregations, causing the entire stripe to be skipped during read. In addition, the predicate push-down pushes the filter into the read to read the fewest rows. The Bloom filter further reduces the number of rows returned.

Proven in large-scale deployments: Facebook uses the ORC file format for more than 300 PB deployments.

ORC provides the best Hive performance overall. In addition, to specify the storage format, you can also specify a compression algorithm for the table, as shown in the following example:

CREATE TABLE addresses (name string,street string,city string,state string,zip int) STORED AS orc TBLPROPERTIES ("orc.compress" = "Zlib")

You usually do not need to set up a compression algorithm because your Hive settings include the default algorithm. Using the ORC advanced properties, you can create Bloom filters for columns that are frequently used in point lookups.

Hive supports Parquet and other formats that are used only for plug-in ACID tables and external tables. You can also write your own SerDes (serializer, deserializer) interface to support custom file formats.

Advanced ORC Properties

In general, you do not need to modify the ORC property, but occasionally Cloudera support recommends such changes. You can use the relief valve function in Cloudera Manager to change the properties.

Using partitions to improve performance

You can use partitions to significantly improve performance. You can design Hive tables and materialized view partitions to map to physical directories on the file system / object store. For example, a table divided by date and time can organize the data loaded into Hive every day.

Large deployments can have thousands of partitions. Partition pruning occurs indirectly when Hive finds the partition key during query processing. For example, after adding a dimension table, the partitioning key may come from the dimension table. The query filters columns by partition, limiting scanning to one or more matching partitions. When there is a partition key in the WHERE clause, the partition is trimmed directly. Partition columns are virtual and are not written to the primary table because they are the same for the entire partition. In the SQL query, you define the partition, as shown in the following example:

CREATE TABLE sale (id in, amount decimal) PARTITIONED BY (xdate string, state string)

To insert data into this table, specify the partition key for quick loading:

INSERT INTO sale (xdate='2016-03-08, state='CA') SELECT * FROM staging_tableWHERE xdate='2016-03-08 'AND state='CA'

You do not need to specify dynamic partitioning columns. If dynamic partitioning is enabled, Hive generates partition specifications.

Hive-site.xml settings for loading 1 to 9 partitions:

SET hive.exec.dynamic.partition.mode=nonstrict;SET hive.exec.dynamic.partition=true

To bulk load data into the partition's ORC table, use the following property, which optimizes the performance of loading data into 10 or more partitions.

Hive-site.xml settings for loading 10 or more partitions:

Hive.optimize.sort.dynamic.partition=true

An example of querying partition data

INSERT INTO sale (xdate, state) SELECT * FROM staging_table

When partitioning and querying partitioned tables, follow these best practices:

Do not partition on a unique ID.

The average size of a partition is greater than or equal to 1 GB.

Design the query to handle no more than 1000 partitions.

Deal with the bucket table

If you migrate data from an earlier version of Apache Hive to Hive 3, you may need to work with storage bucket tables that affect performance.

You can divide tables or partitions into stores, which can be stored in the following ways:

As a file in the table directory.

If the table is partitioned, it is used as a partition directory.

There is no need to use buckets in the new Hive 3 table.

A common challenge associated with bucket storage is to maintain query performance when increasing or decreasing workloads or data. For example, you might have an environment that uses 16 buckets to support the smooth operation of 1000 users, but if you don't adjust the buckets and partitions in time, the number of users will soar to 100 million in a day or two, which can cause problems. After using tables to build tables, you must reload the entire table that contains table data to reduce, add, or delete table buckets, which complicates bucket tuning.

With Tez, you only need to deal with buckets on the largest table. If the workload requirements change rapidly, the bucket of the smaller table dynamically changes to complete the table JOIN.

You perform the following tasks related to storage buckets:

Set hive-site.xml to enable bucket splitting

SET hive.tez.bucket.pruning=true

Bulk load tables with both partitions and buckets:

When loading data into a table that is both partitioned and stored in buckets, set the following properties to optimize the process:

SET hive.optimize.sort.dynamic.partition=true

If you have 20 buckets stored on user_id data, the following query returns only the data associated with user_id = 1:

SELECT * FROM tab WHERE user_id = 1

To make the best use of the dynamic features of table buckets on Tez, take the following steps:

Use a single key for the bucket of the largest table.

Typically, you need to store the primary table by maximum dimension table. For example, a sales table may be stored by customer, not by item or store. However, in this case, the sales table is sorted by material and store.

In general, do not store and sort on the same column.

If the number of bucket files in a table exceeds the number of rows, you should reconsider the way the table is stored.

This is the end of this article on "what are the main functions of Apache Hive3". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.