Example Analysis of merge small Files in hive 04/19 Update SLTechnology News&Howtos

Example Analysis of merge small Files in hive

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the example analysis of merge small files in hive, which has certain reference value. Interested friends can refer to it. I hope you will gain a lot after reading this article. Let Xiaobian take you to understand it together.

When Hive input consists of many small files, since each small file will start a map task, if the file is too small, the map task will start and initialize longer than the logical processing time, which will cause resource waste, even OOM.

For this reason, when we start a task and find that the input data is small but the number of tasks is large, we need to pay attention to the input merge at the front of the Map.

Of course, when we write data to a table, we also need to pay attention to the output file size

1. Map Input Merge small files

Corresponding parameters:

set mapred.max.split.size= 25600000; #Maximum input size per Map

set mapred.min.split.size.per.node= 1000000; #Minimum size of split on a node

set mapred.min.split.size.per.rack= 1000000; #Minimum size of split under a switch

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; #Merge small files before executing Map

When org.apache.hadoop.hive.ql.io.CombineHiveInputFormat is enabled, multiple small files on a data node are merged, the number of merged files is determined by the mapred.max.split.size limit.

mapred.min.split.size.per.node determines whether files on multiple data nodes need to be merged ~

mapred.min.split.size.per.rack determines whether files on multiple switches need to be merged ~

2. output merge

set hive.merge.mapfiles = true #Merge small files at the end of Map-only tasks

set hive.merge.mapredfiles = true #Merge small files at the end of Map-Reduce tasks

set hive.merge.size.per.task = 256*1000*1000 #merge file size

set hive.merge.smallfiles.avgsize= 1600000 #Start a separate map-reduce task to merge files when the average size of the output file is less than this value

Thank you for reading this article carefully. I hope that the article "Sample Analysis of Merge Small Files in hive" shared by Xiaobian will be helpful to everyone. At the same time, I hope that everyone will support you a lot and pay attention to the industry information channel. More relevant knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.