How to solve the problem of too many hive small files 07/08 Update SLTechnology News&Howtos

How to solve the problem of too many hive small files

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to solve the problem of too many hive small files, the content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

Causes of small files

Small files in hive must be generated when importing data into the hive table, so first take a look at several ways to import data into hive

Insert data directly into the table

Insert into table A values (1), (2), (2), (1), (1), (1), (1), (1), (1), (1), (1), (2), (2), (2), (2), (2), (2), (2), (2), (2), (2), (2), (2), (2), (1), (2), (2), (2), (2), (1), (2), (2), (1), (2), (2), (1), (2), (2), (1), (1), (1), (1), (1), (1), (1), (1), (1), (2), (2), (2)

This method will produce a file every time it is inserted, and multiple small files will appear when you insert a small amount of data many times, but this method is rarely used in the production environment, so it can be said that it is basically not used.

Load data through load

Load data local inpath'/ export/score.csv' overwrite into table A-Import file load data local inpath'/ export/score' overwrite into table A-Import folder

You can import files or folders using load. When you import a file, the hive table has a file. When you import a folder, the number of files in the hive table is the number of files under the folder.

Load data by query

Insert overwrite table A select s_id,c_name,s_score from B

This method is commonly used in production environment, and it is also the easiest way to generate small files.

When insert imports data, it starts the MR task, and outputs as many files as there are reduce in MR.

So, number of files = number of ReduceTask * number of partitions

There are also many simple tasks that do not have reduce, only the map phase, then

Number of files = number of MapTask * number of partitions

At least one file is generated in the hive each time the insert is executed, because there is at least one MapTask when the insert is imported.

For example, some businesses need to synchronize data to hive every 10 minutes, so there will be a lot of files.

The impact of too many small files

First of all, for the underlying storage HDFS, HDFS itself is not suitable to store a large number of small files. Too many small files will cause the namenode metadata to be very large, occupy too much memory, and seriously affect the performance of HDFS.

For hive, when querying, each small file will be treated as a block and a Map task will be started to complete, while a Map task will take much more time to start and initialize than the logical processing time, which will result in a great waste of resources. Also, the number of Map that can be executed at the same time is limited.

How to solve too many small files 1. Use the concatenate command that comes with hive to automatically merge small files

How to use it:

# for non-partitioned table alter table A concatenate;#, for partitioned table alter table B partition (day=20201224) concatenate

For example:

# insert data into table A: hive (default) > insert into table A values (1), (2) hive (default) > insert into table A values (3), (4) hive (default) > insert into table A values (5), (6) # execute the above three statements, there will be three small files under Table A, execute the following statement on the hive command line # check the number of files under Table A hive (default) > dfs-ls / user/hive/warehouse/A Found 3 items-rwxr-xr-x 3 root supergroup 378 2020-12-24 14:46 / user/hive/warehouse/A/000000_0-rwxr-xr-x 3 root supergroup 2020-12-24 14:47 / user/hive/warehouse/A/000000_0_copy_1-rwxr-xr-x 3 root supergroup 378 2020-12-24 14:48 / user/hive/warehouse/A/000000_0_copy_2# you can see three small files Then use concatenate to merge hive (default) > alter table A concatenate # check again the number of files under Table A hive (default) > dfs-ls / user/hive/warehouse/A;Found 1 items-rwxr-xr-x 3 root supergroup 778 2020-12-24 14:59 / user/hive/warehouse/A/000000_0# has been merged into one file

Note:

1. The concatenate command only supports RCFILE and ORC file types.

2. You cannot specify the number of merged files when using the concatenate command to merge small files, but you can execute the command multiple times.

3, when the number of files does not change after using concatenate many times, this is related to the setting of the parameter mapreduce.input.fileinputformat.split.minsize=256mb, and the minimum size of each file can be set.

two。 Adjust parameters to reduce the number of Map

Set map input parameters related to merging small files:

# merge small files before executing Map # CombineHiveInputFormat is the CombineFileInputFormat method of Hadoop # this method is to synthesize multiple files into a single split in mapper as input set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;-default # maximum input size per Map (this value determines the number of merged files) set mapred.max.split.size=256000000 -- 256M# at least the size of split on a node (this value determines whether files on multiple DataNode need to be merged) set mapred.min.split.size.per.node=100000000;-- 100M# the minimum size of split on a switch (this value determines whether files on multiple switches need to be merged) set mapred.min.split.size.per.rack=100000000;-100m

Set parameters for merging map output and reduce output:

# set map output to merge. Default is trueset hive.merge.mapfiles = true;# set reduce output to merge, default is falseset hive.merge.mapredfiles = true;# set merge file size set hive.merge.size.per.task = 256 "1000" 1000;-- 256M# when the average size of the output file is less than this value, start a separate MapReduce task for file mergeset hive.merge.smallfiles.avgsize=16000000;-- 16m

Enable compression

# whether the query result output of hive is compressed, whether the result output of set hive.exec.compress.output=true;# MapReduce Job uses compressed set mapreduce.output.fileoutputformat.compress=true;3. Reducing the number of Reduce # reduce determines the number of files output, so you can adjust the number of reduce to control the number of files in the hive table. The partition function distribute by in # hive happens to control the partition partition in the MR. # then set the number of reduce, combined with the partition function to let the data enter each reduce evenly. # there are two ways to set the number of reduce. The first is to set the number of reduce directly. The second is to set the size of each reduce. Hive will determine the number of reduce based on the guess of the total data size set hive.exec.reducers.bytes.per.reducer=5120000000;-- default is 1G, set to 5G#, execute the following statement, and distribute the data evenly to set mapreduce.job.reduces=10;insert overwrite table A partition (dt) select * from Bdistribute by rand () in the reduce. Explanation: if the number of reduce is set to 10, then use rand () to generate a random number x% 10, so that the data will randomly enter into the reduce to prevent some files from being too large or too small. Use hadoop's archive to archive small files

Hadoop Archive, referred to as HAR, is a file archiving tool that efficiently puts small files into HDFS blocks. It can package multiple small files into a HAR file, thus reducing the use of namenode memory while still allowing transparent access to files.

# used to control whether the archive is available or not. Set hive.archive.enabled=true;# informs Hive whether the parent directory can be set when creating the archive. Set hive.archive.har.parentdir.settable=true;# controls the size of the file to be archived. Set har.partfile.size=1099511627776;# uses the following command to archive ALTER TABLE An ARCHIVE PARTITION (dt='2020-12-24 files, hr='12'). # restore the archived partition to the original file ALTER TABLE A UNARCHIVE PARTITION (dt='2020-12-24, hr='12')

Note:

The archived partition can be viewed and cannot be insert overwrite. You must first unarchive.

Last

If it is a new cluster and there are no historical problems, it is recommended that hive use the orc file format and enable lzo compression.

In this way, too many small files can be quickly merged using the hive native command concatenate.

On how to solve the problem of too many hive small files to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.