What is the configuration of Spark Storage performance-related parameters 07/12 Update SLTechnology News&Howtos

What is the configuration of Spark Storage performance-related parameters

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article shows you how the configuration of Spark Storage performance-related parameters is, concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

With the gradual maturity and perfection of Spark, more and more configurable parameters are added to Spark. By expounding the working principle and configuration ideas of some of these parameters, the editor tries to discuss how to optimize the configuration of Spark according to the actual situation.

Storage related configuration parameters

Spark.local.dir

This looks very simple, that is, the location where Spark is used to write intermediate data, such as RDD Cache,Shuffle,Spill, so what can be noticed?

First of all, the most basic thing, of course, is that we can configure multiple paths (separated by commas) to multiple disks to increase the overall IO bandwidth, as we all know.

Secondly, in the current implementation, Spark uses the hash algorithm to distribute file names to directories under multiple paths. if your storage device is fast and slow, such as SSD+HDD mixed use, then you can configure more directory paths on SSD to increase the proportion of it used by Spark, so as to make better use of SSD's IO bandwidth capacity. Of course, this is just a flexible method, and the ultimate solution should be the same as the current implementation direction of HDFS, so that Spark can perceive the specific type of storage device and use it specifically.

It is important to note that after Spark 1.0, the SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) parameter overrides this configuration. For example, in the case of Spark On YARN, the local path of Spark Executor depends on the configuration of Yarn, not on this parameter.

Spark.executor.memory

The size of Executor memory is certainly not directly related to performance itself, but almost everything related to runtime performance is more or less indirectly related to memory size. This parameter will eventually be set to the heap size of Executor's JVM, which corresponds to the values of Xmx and Xms

In theory, the more Executor memory, the better, but in practice, due to the machine configuration, as well as the running environment, resource sharing, JVM GC efficiency and other factors, it may be necessary to set a reasonable size for it. How big is reasonable depends on the actual situation.

The memory of Executor is basically shared by all tasks within Executor, and the number of tasks that can be supported on each Executor depends on the number of CPU Core resources managed by Executor, so you need to understand the size of the data size of each task, so as to figure out how much memory each Executor needs to meet basic needs.

How to know the amount of memory required for each task is difficult to measure uniformly, because in addition to the overhead of the dataset itself, it also includes the use of all kinds of temporary memory space required by the algorithm, and depending on the specific code algorithm, the overhead of temporary memory space is also different. However, the size of the dataset itself has a certain reference significance for the final amount of memory required.

Generally speaking, the size of the dataset of each partition in memory may be several times the size of its source data on disk (regardless of source data compression, Java objects also take into account the additional overhead of the data structure used to manage the data relative to the original bare data). If you need to know the exact size, you can put the RDD cache in memory, and the size of each Cache partition can be seen from the Log output of BlockManager (in fact, it is also estimated). It is not entirely accurate)

For example: BlockManagerInfo: Added rdd_0_1 on disk on sr438:41134 (size: 495.3 MB)

On the other hand, if the number and memory size of your Executor are relatively fixed by the physical configuration of the machine, then you need to plan the data size of each partition task reasonably, for example, by using more partitions to reduce the data size that each task needs to process by increasing the number of tasks (which in turn requires more batches to calculate all tasks).

Spark.storage.memoryFraction

As mentioned earlier, spark.executor.memory determines the amount of memory available for each Executor, while spark.storage.memoryFraction determines how much of this memory can be used for Memory Store to manage RDD Cache data, and the rest of the memory is used to ensure the need for various other memory space while the task is running.

The default value of spark.executor.memory is 0. 6, and the official documentation recommends that this ratio not exceed the ratio of the JVM Old Gen region. This is also easy to understand, because RDD Cache data usually resides in memory for a long time, which in theory means that it will eventually be transferred to the Old Gen area (if the RDD has not been deleted). If the allowed size of this part of the data is too large, it is bound to fill the Old Gen area, resulting in frequent FULL GC.

How to adjust this ratio depends on the usage pattern and scale of your application's data. Roughly speaking, if Full GC occurs frequently, you can consider reducing this ratio, so that the available memory space for RDD Cache is reduced (the rest of the Cache data needs to be written to disk through Disk Store), which will result in a certain performance loss, but free up more memory space for task execution. Reducing the number of Full GC occurrences may improve the overall performance of the program.

Spark.streaming.blockInterval

This parameter is used to set the time interval for Stream Receiver to generate Block in Spark Streaming. Default is 200ms. The specific behavior is the data received by the specific Receiver. Every time interval set here, a StreamBlock is generated from the Buffer and put into the queue, waiting to be further stored in the BlockManager for subsequent calculation. In theory, in order for the data in each StreamingBatch interval to be uniform, this interval should of course be divisible by the length of the Batch interval. Generally speaking, if the memory size is enough and the data of Streaming can be processed in time, the impact of this blockInterval interval is not great. Of course, if the data Cache Level is Memory+Ser, that is, serialization is done, then the size of the BlockInterval will affect the size of the serialized data block and have some impact on the GC behavior of Java.

In addition, spark.streaming.blockQueueSize determines the maximum number of StreamBlock that can be held in the queue before StreamBlock is stored in BlockMananger. The default is 10, because the queue Poll interval is 100ms, so if CPU is not particularly busy, basically there should be no problem.

The above is what the configuration of Spark Storage performance-related parameters is like. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.