How to solve the problem of small Files in SparkStreaming 04/27 Update SLTechnology News&Howtos

How to solve the problem of small Files in SparkStreaming

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about how to solve the problem of small files in SparkStreaming. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

When using sparkstreaming, if the real-time calculation results are written to HDFS, it is inevitable that a very large number of small files will be generated by default, which is caused by the micro-batch mode of sparkstreaming and the distributed (partition) feature of DStream (RDD). Sparkstreaming starts a separate thread for each partition to process the data, and once the file is output to HDFS, the file stream is closed. Another parttition task for batch, and then a new file stream is used, so suppose that a batch is 10s and each output DStream has 32 partition, then the number of files produced in an hour will reach as many as (3600amp 10) * 320011520. The result of many small files is that there are a lot of file meta-information, such as file location, file size, block number and so on need to be maintained by NameNode, so NameNode will be Yali Mountain University. No matter what format it is, parquet, text, JSON, or Avro, you will encounter this kind of small file problem. Here are several typical ways to deal with Sparkstreaming small files.

Increase the size of batch

This method is easy to understand. The larger the batch, the more event is received from the outside and the more data accumulated in memory, then the number of files output will be reduced. For example, if the above time is increased from 10s to 100s, the number of files in an hour will be reduced to 1152. But do not be happy too early, real-time business can wait so long, originally people 10s to see the results updated once, now have to wait two minutes, people will scold their mother. So this method is suitable for scenarios where messages arrive in real time, but do not want to be squeezed together, because if they are squeezed together, batch tasks are dry and so on. (is it very similar to the pipeline mode within spark, but pay attention to the difference?).

Is Coalesce good?

At the beginning of the article, the cardinality of small files is: batch_number*partition_number, and the first method is to reduce batch_number, so this method is to reduce partition_number, this api is not detailed, is to reduce the initial number of partitions. Children's shoes who have seen the spark source code all know that for narrow dependencies, the partition rules of a child RDD inherit the parent RDD, and for wide dependencies (that is, those fork ByKey operations), they also inherit from the parent rdd if no special number of partitions is specified. So the initial SourceDstream is several partiion, and the final output is several partition. So the advantage of Coalesce is that it can reduce the number of partition when it is finally output. But the disadvantage of this method is also obvious, originally 32 threads are writing 256m data, now it may become 4 threads writing 256m data, but did not write the 256m data, this batch is not counted as the end. Then the processing delay of a batch must increase, and the batch extrusion will increase gradually. This method should also be used cautiously, cut the chicken and cut the chicken!

SparkStreaming external to deal with

Since we output the data to hdfs, it means that we must use "sql on hadoop" system classes such as hive or sparksql for further data analysis, and these tables are generally partitioned according to half an hour or an hour or a day. (be careful not to be confused with sparkStreaming partitions, which are used to optimize partition clipping) Then we can consider restarting the scheduled batch task outside the SparkStreaming to merge the small files generated by SparkStreaming. This method is not very direct, but it is more useful, "cost-effective" is higher, the only thing to note is that the batch merge task to grasp the time cut, it may be possible to go back to merge a small SparkStreaming file that is still being written.

Call foreach to append yourself.

SparkStreaming provides foreach this outout class api, which allows us to customize the method of outputting calculation results. Then we can also take advantage of this feature, that is, when each batch wants to write a file, it does not generate a new file stream, but opens the previous file. Considering the feasibility of this method, first of all, the files on HDFS do not support modification, but many of them support appending, then each partition of each batch corresponds to an output file, and each time the output file corresponding to this partition is appended, which can also achieve the purpose of reducing the number of files. What we should pay attention to in this method is that it cannot be appended indefinitely. When it is judged that a file has reached a certain threshold, a new file will be generated to append.

This is how the SparkStreaming shared by the editor solves the problem of small files. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.