What is the method of Flink task tuning 07/06 Update SLTechnology News&Howtos

What is the method of Flink task tuning

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what is the method of Flink task tuning". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is the method of Flink task tuning".

First look at the indicators, positioning problems.

The Metrics provided by Flink can collect metrics within Flink to give developers a better understanding of the status of jobs or clusters. Because it is difficult to find the internal actual situation after the cluster is running, whether it is running slowly or fast, whether it is abnormal, etc., developers cannot view all Task logs in real time, such as when the jobs are very large or there are many jobs, what to do? At this point, Metrics can help developers understand the current state of the job.

Let's see if the resources are enough.

When we locate the problem through the above indicators, we can accurately judge the performance of the task through delay and throughput indicators, and accurately find the code location where the problem occurs. In general, the following errors occur in these locations:

The concurrency number (parallelism) of Operator is unreasonable.

CPU (core) is unreasonable

Unreasonable setting of parameters such as heap memory (heap_memory)

The setting of parallelism is unreasonable.

The setting of State is unreasonable

The setting of checkpoint is unreasonable

When setting these parameters, we should pay attention to:

Parallelism (parallelism): enough parallelism is guaranteed, and parallelism is not as high as possible. Too much data will increase the pressure on data transfer between multiple solt/task manager, including serialization and deserialization.

CPU:CPU resources are shared by solt on task manager, so pay attention to monitoring CPU usage.

Memory: memory is isolated by solt. Be careful to store enough memory when storing large state.

Network: big data processing, there will be a lot of data transmission between flink nodes, the server network card as far as possible to use 10 gigabit network card.

Third, look at huff and puff, whether or not reverse pressure

The internal Flink is based on the producer-consumer model for message delivery, and the reverse pressure design of Flink is also based on this model. Flink uses efficient bounded distributed blocking queues, just like Java's generic blocking queues (BlockingQueue). If downstream consumers spend more slowly, the upstream will be blocked.

In practice, the backpressure in many cases is caused by data skew, which can be confirmed by the Records Sent and Record Received of each SubTask in Web UI. In addition, the State size of different SubTask in Checkpoint detail is also a useful indicator for analyzing data tilt.

In Flink version 1.11, some optimizations have been made to the Flink reverse pressure problem itself, such as using Unaligned Checkpoint + rocksdb to generate Checkpoint, using rocksdb to cache checkpoint, and changing from the original full generation to incremental generation, which is faster.

It is also important to note that the execution efficiency of user code (frequently blocked or performance problems) and TaskManager memory and GC problems.

Fourth, look at JVM, is it OOM?

The parameters given on the official website are as follows:

The most important of these are:

Taskmanager.memory.process.size: 512mtaskmanager.memory.framework.heap.size: 64mtaskmanager.memory.framework.off-heap.size: 64mtaskmanager.memory.jvm-metaspace.size: 64mtaskmanager.memory.jvm-overhead.fraction: 0.2taskmanager.memory.jvm-overhead.min: 16mtaskmanager.memory.jvm-overhead.max: 64mtaskmanager.memory.network.fraction: 0.1taskmanager.memory.network.min: 1mbtaskmanager.memory.network.max: 256mb

Their respective meaning, need everyone to consult the following official documents.

The main parameters configured by JVM itself are the following:

Heap setting-Xms: initial heap size-Xmx: maximum heap size-XX:NewSize=n: set the younger generation size-XX:NewRatio=n: sets the ratio of the younger generation to the older generation. For example, 3, it means that the ratio of the young generation to the old generation is 1:3, and the young generation accounts for the 1/4-XX:SurvivorRatio=n of the whole young generation: the ratio of the Eden region to the two Survivor regions in the young generation. Notice that there are two in the Survivor area. For example: 3, which means Eden:Survivor=3:2 A 1/5-XX:MaxPermSize=n whose Survivor area occupies the entire young generation: set persistent generation size collector setting-XX:+UseSerialGC: set serial collector-XX:+UseParallelGC: set parallel collector-XX:+UseParalledlOldGC: set parallel older collector-XX:+UseConcMarkSweepGC: set concurrent collector garbage collection statistics-heap details of XX:+PrintHeapAtGC GC-XX:+PrintGCDetails GC details-XX:+PrintGCTimeStamps print GC time information-XX : + PrintTenuringDistribution print age information, etc.-XX:+HandlePromotionFailure Old Age allocation guarantee (true or false) parallel Collector Settings-XX:ParallelGCThreads=n: sets the number of CPU used for parallel collector collection. Number of threads collected in parallel. -XX:MaxGCPauseMillis=n: sets the maximum pause time for parallel collection-XX:GCTimeRatio=n: sets the percentage of garbage collection time to program running time. The formula is 1 / (1cm n) concurrent collector setting-XX:+CMSIncrementalMode: set to incremental mode. It is suitable for single CPU situation. -XX:ParallelGCThreads=n: the number of CPU used when the young generation of the concurrent collector is collected in parallel. Number of parallel collection threads

We can use some simple JVM log analysis tools to see what is wrong with the parameters set by JVM.

Thank you for your reading, the above is the content of "what is the method of Flink task tuning". After the study of this article, I believe you have a deeper understanding of what the method of Flink task tuning is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.