How to optimize configuration parameters in Hive 04/28 Update SLTechnology News&Howtos

How to optimize configuration parameters in Hive

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to optimize the configuration parameters in Hive, which is very detailed and has a certain reference value. Friends who are interested must finish it!

1. Create a regular table

Table test_user1 (id int, name string,code string,code_id string) ROW FORMAT DELIMITED FIELDS TERMINATED BY','

two。 View the information on this table

DESCRIBE FORMATTED test_user1

We introduce some of the optimization points when creating the table from the description of the table.

2.1 number of files in the table

NumFiles indicates the number of files in the table. Too many files may mean that the table has too many small files. At this time, we can make some optimizations to solve the problem of small files. HDFS itself provides a solution:

(1) Hadoop Archive/HAR: package small files into large files.

(2) SEQUENCEFILE format: compress a large number of small files into a SEQUENCEFILE file.

(3) CombineFileInputFormat: combine small files before map and reduce processing.

(4) HDFS Federation:HDFS federation, using multiple namenode nodes to manage files.

In addition, we can merge small files by setting the parameters of hive.

(1) input phase merging

The input file format of Hive needs to be changed, that is, the parameter hive.input.format. The default value is org.apache.hadoop.hive.ql.io.HiveInputFormat, and we will change it to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat. Compared with the previous adjustment of the number of mapper, there will be two more parameters, namely mapred.min.split.size.per.node and mapred.min.split.size.per.rack, which means the minimum split size on a single node and a single rack. If a split size is found to be less than these two values (the default is 100MB), a merge occurs. For specific logic, please see the corresponding class in the Hive source code.

(2) merge in the output phase

You can simply set both hive.merge.mapfiles and hive.merge.mapredfiles to true. The former means merging the output of the map-only task, and the latter means merging the output of the map-reduce task. Hive will start an additional mr job to merge the output of small files into large files. In addition, hive.merge.size.per.task can specify the expected value of the merged file size after each task output, and hive.merge.size.smallfiles.avgsize can specify a mean threshold for all output file sizes, and the default value is 1GB. If the average size is insufficient, another task will be started to merge.

2.2 Storage format of the table

Through InputFormat and OutputFormat, you can see that the storage format of the table is TEXT. Hive supports TEXTFILE, SEQUENCEFILE, AVRO, RCFILE, ORC, and PARQUET file formats. You can specify the file format of the table in two ways:

(1) CREATE TABLE... STORE AS: specify the file format when creating the table. The default is TEXTFILE.

(2) ALTER TABLE... [PARTITION partition_spec] SET FILEFORMAT: modify the file format of a specific table

If you want to change the default file format for creating tables, you can use set hive.default.fileformat= to configure it, which applies to all tables. It can also be configured with set hive.default.fileformat.managed =, which applies only to internal or external tables.

Expansion: the case of different storage methods

TEXT, SEQUENCE, and AVRO files are row-oriented file storage formats, not the best file format, because even if only one column of data is queried, tables using these storage formats need to read a complete row of data. On the other hand, column-oriented storage formats (RCFILE, ORC, PARQUET) can solve the above problems very well. A description of each file format is as follows:

(1) TEXTFILE

The default file format when the table is created, and the data is stored in text format. Text files can be split and processed in parallel, or compressed, such as GZip, LZO, or Snappy. However, most compressed files do not support split and parallel processing, resulting in a job with only one mapper to deal with data, using compressed text files to ensure that the file is not too large, generally close to the size of two HDFS blocks.

(2) SEQUENCEFILE

Key/value binary storage format, the advantage of sequence files is better than the text format compression, sequence files can be compressed into block-level records, block-level compression is a good compression ratio. If you use block compression, you need to use the following configuration: set hive.exec.compress.output=true; set io.seqfile.compression.type=BLOCK

(3) AVRO

In addition to binary format files, avro is also a framework for serialization and deserialization. Avro provides specific data schema.

(4) RCFILE

The full name is Record Columnar File. First, the table is divided into several row groups, and the data in each row group is stored in columns. The data in each column is stored separately, that is, it is divided horizontally and then vertically.

(5) ORC

The full name is Optimized Row Columnar and is supported from the hive0.11 version. ORC format is an optimized format of RCFILE format, providing a larger default block (256m).

(6) PARQUET

Another column file format, which is very similar to ORC, supports a wider range of ecologies than ORC. For example, the lower version of impala does not support ORC format.

Configure two tables with the same data and the same field, and take the common TEXT row storage and ORC column storage as examples to compare the execution speed.

TEXT storage mode

Summary: from the figure above, you can see that the column storage is faster when querying the specified column. It is recommended to set the column storage method when creating the table.

2.3 Compression of tables

Compressing Hive tables is a common optimization method. Some storage methods have their own compression options. For example, SEQUENCEFILE supports three compression options: NONE,RECORD,BLOCK. Record compression ratio is low, and BLOCK compression is generally recommended.

ORC supports three compression options: NONE,ZLIB,SNAPPY. Let's take TEXT storage and ORC storage as examples to see the compression of the table.

Configure four tables with the same data and the same field, one for TEXT storage, and the other three are ORC storage for default compression, ORC storage for SNAPPY compression and ORC storage for NONE compression, to view the storage on hdfs:

TEXT storage mode

Default compressed ORC storage mode

ORC storage mode of SNAPPY compression

ORC storage mode of NONE compression

Summary: you can see that the ORC storage method stores the data as two block. The default compression size adds up to 134.69m. The default compression size is 196.67m. The file size of the none storage method is 366.58m, and the default block storage method is 256m and 128m respectively. The ORC default compression method is smaller than the file compressed by SNAPPY. The reason is that the default ZLIB compression method of ORZ uses the deflate compression algorithm, which has a higher compression ratio and smaller compressed files than the Snappy compression algorithm. The execution speed between different compression methods of ORC, after many tests, it is found that the execution speed of the three compression methods is similar, so it is recommended to use the default storage method of ORC to store data.

2.4 Bucket partition

Num Buckets represents the number of buckets, and we can optimize the Hive table through bucket splitting and partitioning operations:

For a larger table, it can be designed as a partition table, if it is not set as a partition table, the data is fully scanned. After it is set as a partition table, the query is only scanned in the specified partition to improve query efficiency. Should pay attention to avoid multi-level partition as far as possible, general secondary partition is enough to use. Common partition fields:

(1) date or time, such as year, month, day, or hour, which can be used when there are time or date fields in the table.

(2) Geographic location, such as country, province, city, etc.

(3) Business logic, such as department, sales area, customer, etc.

Similar to partitioned tables, bucket tables are organized by dividing a large table file on HDFS into multiple files. Split bucket is a finer-grained partition relative to partition. Sub-bucket divides the entire data content according to its field attribute worthy of hash value. Sub-bucket can speed up data sampling and improve the performance of join (join field is sub-bucket field). Because sub-bucket can ensure that the data corresponding to a key is in a specific bucket (file), skillfully selecting sub-bucket field can greatly improve the performance of join. In general, bucket fields can be selected for fields that are often used in filter operations or join operations.

Create a bucket table

Create table test_user_bucket (id int, name string,code string,code_id string) clustered by (id) into 3 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY','

View description information

DESCRIBE FORMATTED test_user_bucket

View the hdfs of the table

The same data view the query efficiency of ordinary table and bucket table.

Ordinary watch

Bucket table

The ordinary table is scanned all over the table. After dividing the bucket according to the hash value of the bucket field, the bucket table is scanned in a specific bucket according to the join field or where filter field, which improves the efficiency.

These are all the contents of the article "how to optimize configuration parameters in Hive". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.