What should be done for Spark tuning 07/01 Update SLTechnology News&Howtos

What should be done for Spark tuning

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about how Spark tuning should be done. Many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

The problem of deadlock is solved by analyzing the deadlock log file, due to the wrong understanding of the above piece. It has always been thought that two non-unique indexes can hit a single record, but in fact, mysql has hit more than one for performance.

A simple update statement, update? Whereid1=1andid2=2id1 and id2 are both non-unique indexes, and some statements will lock the data entries around other indexes due to optimization, and then they will wait for the lock of the built-in unique index, while the same statement will lock the built-in unique index, wait for the lock of id1 or id2, and return the same statement to wait for the lock to be released, resulting in deadlock.

In the future, we still need to read more books on the database and know more about the principle.

What is the role of the main processes in Spark?

Driver process: responsible for the distribution of tasks and the recycling of results.

Executor process: responsible for the execution of specific tasks.

Master process: the main process of Spark resource management, responsible for resource scheduling.

Worker process: a slave process for Spark resource management. Woker nodes mainly run Executor.

How to choose the most appropriate persistence strategy?

By default, the highest performance is of course MEMORY_ONLY, but only if your memory is large enough to hold all the data for the entire RDD. Because there is no serialization and deserialization operation, this part of the performance overhead is avoided; the subsequent operator operations on this RDD are all based on data operations in pure memory, do not need to read data from disk files, and have high performance; and do not need to copy a copy of the data and transfer it to other nodes remotely. However, it must be noted that in the actual production environment, there may be limited scenarios in which this strategy can be used directly. If there is a lot of data in RDD (for example, billions), using this persistence level directly will lead to an exception of OOM memory overflow in JVM.

If a memory overflow occurs when using the MEMORY_ONLY level, it is recommended that you try the MEMORY_ONLY_SER level. This level serializes RDD data and then saves it in memory, where each partition is just an array of bytes, greatly reducing the number of objects and memory footprint. The main performance overhead of this level over MEMORY_ONLY is the cost of serialization and deserialization. However, the subsequent operators can operate based on pure memory, so the overall performance is still relatively high. In addition, the problem that may occur is the same as above. If there is too much data in the RDD, it is also an exception that may cause the OOM memory overflow.

If the level of pure memory is not available, it is recommended that you use the MEMORY_AND_DISK_SER policy instead of the MEMORY_AND_DISK policy. Because now that we have reached this stage, it means that the amount of data in RDD is so large that the memory can not be put down completely. There is less data after serialization, which can save memory and disk space. At the same time, the strategy will give priority to try to cache the data in memory as far as possible, and the memory cache will not be written to disk.

DISK_ONLY and levels with a suffix of _ 2 are generally not recommended: reading and writing data entirely based on disk files can lead to sharp performance degradation, and sometimes it is better to recalculate all RDD. At the level with a suffix of _ 2, all data must be copied and sent to other nodes. Data replication and network transmission can cause greater performance overhead, which is not recommended unless the high availability of the job is required.

After reading the above, do you have any further understanding of what Spark tuning should do? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.