The scene of generating small HDFS files in Hadoop and its processing scheme 10/15 Update SLTechnology News&Howtos

The scene of generating small HDFS files in Hadoop and its processing scheme

2025-10-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Impact: 1. The metadata of files is stored in namenode, and the metadata of each file is about the same. Too many small files will greatly occupy the memory of namonode and restrict the expansion of the cluster. (main impact) 2. When dealing with small files, a small file corresponds to a maptask, a maptask will start a jvm process, and the opening and destruction of the process will cause serious performance. (jvm reuse) generate scenarios: 1. Real-time processing: for example, we use Spark Streaming to receive data from external data sources, and then store it in HDFS after ETL processing. In this case, a large number of small files are generated in each Job. 2. Insert the table in hive, and each insert forms a small file under the table directory. Create a table with the same table structure, create table t_new as select * from tweeted; the old table can be deleted according to the actual situation. 3. Perform a simple filtering operation in hive. The data that meets the filtering conditions are stored in many block blocks, and there are only many small documents output through map,map. Turn on the aggregation on the map side. 4. The normal execution of mapreduce produces small files. The mapreduce output is not written directly to the hdfs, but to the hbase. Set map-side file merge and reduce-side file merge. 5. Input data files are small files. Small files are merged and then calculated. CombineFileInputFormat: it is a new inputformat for merging multiple files into a single split, and it takes into account where the data is stored. General solution: 1, Hadoop Archive Hadoop Archive or HAR, is an efficient file archiving tool that puts small files into HDFS blocks, which can package multiple small files into a HAR file, thus reducing namenode memory usage while still allowing transparent access to files. 2. Sequence file sequence file consists of a series of binary key/value. If it is a key small file name and value is a file content, you can merge a large number of small files into one large file. Underlying processing scheme: HDFS-8998: DataNode is divided into small file areas, specifically storing small files. A block block is full to start using the next block. HDFS-8286: move metadata from namenode from memory to a third-party KMurv storage system. HDFS-7240: Apache Hadoop Ozone, the Hadoop sub-project, created to extend hdfs.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.