What are the tuning skills of Hive 07/19 Update SLTechnology News&Howtos

What are the tuning skills of Hive

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what Hive tuning techniques are, and the editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.

Apache Hive is a data warehouse software project built on Apache Hadoop to provide data query and analysis. Hive is Hadoop's SQL interface on HDFS, which provides an interface similar to SQL to query data stored in various databases and file systems integrated with Hadoop. It can be said that engaged in data development work, whether in the usual work, or in the interview, Hive plays an important role, especially in the performance tuning of Hive, which can not only improve the efficiency in the work, but also stand out in the interview.

The skills are as follows: 1. Multiple INSERT single scan table

By default, Hive performs multiple table scans. Therefore, if you want to perform multiple operations in a hive table, it is recommended that you use a scan and use that scan to perform multiple operations.

For example, the data from one table is queried many times and loaded into another table. As in the following example, table my_table is a partition table with the partition field dt. If you need to query two specific partition date data in the table, and load the record into two different tables.

INSERT INTO temp_table_20201115 SELECT * FROM my_table WHERE dt = '2020-11-15'

INSERT INTO temp_table_20201116 SELECT * FROM my_table WHERE dt = '2020-11-16'

In the above query, Hive will scan the table twice. To avoid this, we can use the following ways:

FROM my_table

INSERT INTO temp_table_20201115 SELECT * WHERE dt = '2020-11-15'

INSERT INTO temp_table_20201116 SELECT * WHERE dt = '2020-11-16'

This ensures that only one scan is performed on the my_ table table, which greatly reduces the time and resources required for execution.

two。 Partition table

For a larger table, designing it as a partition table can improve the performance of the query. For a query with a specific partition, only the file data of the corresponding partition path will be loaded. Therefore, when a user executes a select query with a specific partition column value, the query will only be executed for that particular partition, which can provide better performance because it will be scanned for a smaller amount of data. It is worth noting that the selection of partition fields is an important factor affecting query performance, and try to avoid deep-level partitions, which will result in too many subfolders.

Now the question is, which columns should be used for partitioning? A basic rule is to select a low cardinality attribute as the "partition key", such as "region" or "date".

Some common partition fields can be:

Date or time

Such as year, month, day, or hour, which can be used when there are time or date fields in the table.

Geographical location

Such as countries, provinces, cities, etc.

Business logic

Such as department, sales area, customer, etc.

CREATE TABLE table_name (

Col1 data_type

Col2 data_type)

PARTITIONED BY (partition1 data_type, partition2 data_type, … ); 3. Bucket table

In general, when it is difficult to create a partition on a column, we use a bucket, such as a field that is often filtered, which results in a large number of partitions if used as a partition field. In Hive, bucket fields are hashed, providing additional data structures in to improve query efficiency.

Similar to partition tables, bucket tables are organized by dividing files on the HDFS into multiple files. Sub-bucket can speed up data sampling and improve the performance of join (the field of join is a sub-bucket field). Because sub-bucket can ensure that the data corresponding to a key is in a specific bucket (file), skillfully selecting sub-bucket field can greatly improve the performance of join. In general, bucket fields can be selected for fields that are often used in filter operations or join operations.

We can use set.hive.enforce.bucketing = true to enable the bucket setting.

When using a bucket table, it is best to set the bucketmapjoin flag to true. The specific configuration parameters are:

SET hive.optimize.bucketmapjoin = true

CREATE TABLE table_name

PARTITIONED BY (partition1 data_type, partition2 data_type, … (.) CLUSTERED BY (column_name1, column_name2, …)

SORTED BY (column_name [ASC | DESC], …)]

INTO num_buckets BUCKETS;4. Enable compression for intermediate data

Complex Hive queries are usually transformed into a series of multi-stage MapReduce jobs, and these jobs are linked by the Hive engine to complete the entire query. Therefore, the "intermediate output" here refers to the output of the previous MapReduce job, which will be used as input data for the next MapReduce job.

Compression can significantly reduce the amount of intermediate data, thus internally reducing the amount of data transfer between Map and Reduce.

We can use the following properties to enable compression on the intermediate output.

Set hive.exec.compress.intermediate=true

Set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec

Set hive.intermediate.compression.type=BLOCK

To compress the data that is finally output to HDFS, you can use the following properties:

Set hive.exec.compress.output=true

Here are some compression codecs that can be used

Org.apache.hadoop.io.compress.DefaultCodec

Org.apache.hadoop.io.compress.GzipCodec

Org.apache.hadoop.io.compress.BZip2Codec

Com.hadoop.compression.lzo.LzopCodec

Org.apache.hadoop.io.compress.Lz4Codec

Org.apache.hadoop.io.compress.SnappyCodec5.Map-side JOIN

Join on the map side is suitable for loading small tables into memory when a table is very small (it can be stored in memory). Hive supports automatic conversion to map-side join starting from 0.7. The configuration is as follows:

SET hive.auto.convert.join=true;-default true after hivev0.11.0

SET hive.mapjoin.smalltable.filesize=600000000;-the default is 25m

SET hive.auto.convert.join.noconditionaltask=true;-default true, so you do not need to specify map join hint

SET hive.auto.convert.join.noconditionaltask.size=10000000;-controls the size of tables loaded into memory

Once the join configuration on the map side is enabled, Hive automatically checks whether the small table is larger than the size configured by the hive.mapjoin.smalltable.filesize. If it is larger, it becomes a normal join, and if it is less than, it becomes a join on the map side.

The principle of map-side join is shown in the following figure:

First, Task A (task executed locally by the client) is responsible for reading small table a, converting it into a HashTable data structure, writing it to a local file, and then loading it into the distributed cache.

The Task B task then starts the map task to read the large table b, and in the Map phase, according to the hashtable association between each record and table an in the distributed cache, and outputs the result

Note: there are no reduce tasks on the map side of join, so map directly outputs the results, that is, how many map tasks will produce as many result files.

6. Vectorization

Vectorized query execution in Hive greatly reduces CPU usage for typical query operations such as scanning, filters, aggregations, and joins.

The standard query execution system processes one row at a time, and before processing the next row, the single row of data is processed by all operators in the query, resulting in very inefficient use of CPU. In the execution of a vectorization query, data rows are batched together (default = > 1024 rows) and represented as a set of column vectors.

To execute using a vectorization query, you must store the data in ORC format (CDH 5) and set the following variables.

SET hive.vectorized.execution.enabled=true

Hive query vectorization is enabled by default in CDH 6. After query vectorization is enabled, you can also set other attributes to adjust the query vectorization method. For more information, please refer to the cloudera official website.

7. Predicate pushdown

The default generated execution plan executes the filter in a visible location, but in some cases, some filter expressions can be pushed closer to the operator that first saw this particular data.

For example, the following query:

Select

A. *

B. *

From

A join b on (a.col1 = b.col1)

Where a.col1 > 15 and b.col2 > 16

If there is no predicate push-down, the filter condition * * (a.col1 > 15 and b.col2 > 16) * * will be executed after the JOIN processing is completed. Therefore, in this case, JOIN will occur first, and more rows may be generated, and then the filtering operation will be performed.

Using predicate pushdown, these two predicates * * (a.col1 > 15 and b.col2 > 16) * * will be processed before JOIN, so it may filter out most of the data rows processed earlier in the connection from an and b, so it is recommended that you enable predicate pushdown.

You can enable predicate push-down by setting hive.optimize.ppd to true.

SET hive.optimize.ppd=true8. Input format selection

Hive supports TEXTFILE, SEQUENCEFILE, AVRO, RCFILE, ORC, and PARQUET file formats, which can be specified in two ways:

CREATE TABLE... STORE AS: specify the file format when creating the table. The default is TEXTFILEALTER TABLE. [PARTITION partition_spec] SET FILEFORMAT: modify the file format of a specific table

If the file storage format is not specified, the format set by the parameter hive.default.fileformat is used by default.

If the data is stored in a small file that is smaller than the block size, you can use the SEQUENCE file format. If you want to store data in an optimized way that reduces storage space and improves performance, you can use the ORC file format, which is useful when there is too much data nested in the column. Therefore, you need to determine the input file format based on the data you have.

9. Start strict mode

If you are querying a partitioned Hive table without providing a partitioning predicate (partitioning column condition), in this case, a query will be issued against all partitions of the table, which can be time-consuming and resource-intensive. Therefore, we define the following property as strict to indicate that the compiler will throw an error if the partition predicate is not provided on the partition table.

SET hive.partition.pruning=strict10. Cost-based optimization

Hive optimizes the logical and physical execution plan of each query before submitting it for final execution. Cost-based optimization is further optimized according to the query cost, which may lead to different decisions, such as how to determine the order of JOIN, which type of JOIN to execute, and the degree of parallelism.

You can enable cost-based optimization by setting the following parameters.

Set hive.cbo.enable=true

Set hive.compute.query.using.stats=true

Set hive.stats.fetch.column.stats=true

Set hive.stats.fetch.partition.stats=true

You can use statistics to optimize queries to improve performance. The cost-based optimizer (CBO) also uses statistics to compare query plans and select the best plan. It can be efficient by viewing statistics instead of running queries.

Collect column statistics for the table:

ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS

View column statistics for the my_id column in my_table in the my_db database:

This is the end of DESCRIBE FORMATTED my_db.my_table my_id 's article on "what are the Hive tuning techniques?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.