How to analyze and deal with Flink backpressure? 07/19 Update SLTechnology News&Howtos

How to analyze and deal with Flink backpressure?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Reverse pressure (backpressure) is a very common problem in the development of real-time computing applications, especially in streaming computing. Reverse pressure means that a node in the data pipeline becomes a bottleneck, and the processing rate can not keep up with the upstream data transmission rate, so it is necessary to limit the upstream speed. Because real-time computing applications usually use message queues to decouple the production side and the consumer side, and the consumer side data source is pull-based, the backpressure is usually transmitted from a node to the data source and reduces the ingestion rate of the data source (such as Kafka consumer). With regard to the reverse pressure mechanism of Flink, there have been many blogs on the Internet, and Chinese blogs recommend these two articles 1. To put it simply, the data between each node (Task) in the Flink topology is transmitted in the way of blocking queues. When the queue is full due to the lack of consumption in the downstream, the upstream production will also be blocked, resulting in blocking the intake of the data source. This article will focus on the official blog [4] to share the author's experience in analyzing and dealing with Flink reverse pressure in practice. The influence of backpressure does not directly affect the availability of the job, it indicates that the job is in a sub-healthy state, has potential performance bottlenecks and may lead to greater data processing delays. Generally speaking, for some applications that do not require too much delay or a small amount of data, the impact of backpressure may not be obvious, but for large-scale Flink operations, backpressure may lead to serious problems. This is because of Flink's checkpoint mechanism, backpressure will also affect two indicators: checkpoint duration and state size. The former is because the checkpoint barrier will not cross the ordinary data, and the blocking of data processing will also lead to a longer time for checkpoint barrier to flow through the entire data pipeline, so the overall checkpoint time (End to End Duration) becomes longer. The latter is because in order to ensure EOS (Exactly-Once-Semantics, accurate once), the Operator,checkpoint barrier with more than two input pipes needs to be aligned (Alignment). After receiving the barrier of the faster input pipe, the data behind it will be cached but not processed until the barrier of the slower input pipe also arrives, and the cached data will be put into the state, resulting in the checkpoint becoming larger. These two effects are very dangerous for jobs in the production environment, because checkpoint is the key to ensuring data consistency, longer checkpoint time may lead to checkpoint timeout failure, and state size may also slow down checkpoint or even lead to stability problems of OOM (using Heap-based StateBackend) or physical memory usage exceeding container resources (using RocksDBStateBackend). Therefore, we should try our best to avoid reverse pressure in production (by the way, in order to alleviate the pressure caused by backpressure on checkpoint, the community proposed FLIP-76: Unaligned Checkpoints [4] to decouple backpressure and checkpoint). To locate the backpressure node, the first thing to do to solve the backpressure is to locate the node that causes the backpressure, which is mainly done in two ways: through the backpressure monitoring panel of Flink Web UI, and through Flink Task Metrics. The former is easy to use and suitable for simple analysis, while the latter provides more abundant information and is suitable for monitoring system. Because the backpressure is transmitted upstream, both methods require us to check one by one from the Source node to the Sink until we find the root cause of the backpressure [4]. These two methods are described below. The reverse pressure monitoring of the reverse pressure monitoring panel Flink Web UI provides reverse pressure monitoring at the SubTask level. The principle is to determine whether the node is in the reverse pressure state by periodically sampling the stack information of the Task thread and getting the frequency that the thread is blocked in the request Buffer (which means being blocked by the downstream queue). By default, the frequency below 0. 1 is OK,0.1 to 0. 5 LOW, and more than 0. 5 is HIGH.

Figure 1.If the Web UI reverse pressure panel of Flink 1.8 (from the official blog) is in a reverse pressure state, there are two possibilities: the sending rate of the node cannot keep up with its data rate of generation. This usually happens in an Operator with multiple inputs (such as flatmap). The reception rate of the downstream node is slow, and the transmission rate of the node is limited by the reverse pressure mechanism. If this is the first case, then the node is the root node of backpressure, which is the first node to backpressure from Source Task to Sink Task. In the second case, you need to continue troubleshooting the downstream nodes. It is worth noting that the source node of backpressure does not necessarily reflect high backpressure on the backpressure panel, because the backpressure panel monitors the sender, and if a node is a performance bottleneck, it will not lead to high backpressure in itself. it leads to high backpressure upstream. In general, if we find the first node with reverse pressure, then the source of reverse pressure is either this node or its next downstream node. What about distinguishing between the two states? Unfortunately, it is impossible to judge directly only through the reverse pressure panel, we also need to combine Metrics or other monitoring means to locate. In addition, if the job has a large number of nodes or a large degree of parallelism, the pressure on the reverse pressure panel will be very high or even unavailable due to the need to collect all the stack information of the Task. The Task Metrics provided by Task MetricsFlink is a better means of reverse pressure monitoring, but it also requires more background knowledge. First of all, let's briefly review the network stack after Flink 1.5, which can be skipped by familiar readers. When TaskManager transmits data, there are usually multiple Channel between two Subtask on different TaskManager according to the number of key. These Channel will reuse the same TaskManager-level TCP link and share the receiver Subtask-level Buffer Pool. At the receiving end, each Channel is assigned a fixed number of Exclusive Buffer in the initial phase, and these Buffer are used to store the received data, which are released again after being used by the Operator. The number of Buffer free at the receiving end of the Channel is called Credit,Credit and is periodically synchronized to the data used by the sender to determine how many Buffer to send. When the traffic is high, the Exclusive Buffer of the Channel may be full, and the Flink will request the remaining Floating Buffer from the Buffer Pool. These Floating Buffer belong to the backup Buffer, and you can go wherever the Channel is needed. On the sending side of Channel, all Channel of a Subtask will share the same Buffer Pool, so there is no distinction between Exclusive Buffer and Floating Buffer.

Figure 2. Flink Credit-Based network the Metrics we use when monitoring backpressure is mainly related to the Buffer utilization of the Channel receiver. The most useful ones are the following Metrics: Metris describes the outPoolUsage sender Buffer utilization inPoolUsage receiver Buffer utilization floatingBuffersUsage (above 1.9) receiver Floating Buffer utilization exclusiveBuffersUsage (above 1.9) receiver Exclusive Buffer utilization where inPoolUsage equals the sum of floatingBuffersUsage and exclusiveBuffersUsage. The general idea of analyzing the backpressure is: if the Buffer occupancy rate of the sender of a Subtask is very high, it indicates that it is speed-limited by the downstream backpressure; if the Buffer of the receiver of a Subtask is very high, it indicates that the backpressure is transmitted to the upstream. The reverse pressure situation can be checked according to the following table (picture from the official website):

Figure 3. There should be no doubt that the reverse pressure analysis table outPoolUsage and inPoolUsage are both low or high indicating that the current Subtask is normal or in the downstream backpressure. What is interesting is that when outPoolUsage and inPoolUsage behave differently, it may be due to the intermediate state of reverse pressure conduction or indicate that the Subtask is the source of reverse pressure. If the outPoolUsage of a Subtask is high, it is usually affected by the downstream Task, so you can check the possibility that it itself is the source of reverse pressure. If the outPoolUsage of a Subtask is low, but its inPoolUsage is high, it is possible that it is the source of backpressure. Because the reverse pressure is usually transmitted upstream, resulting in a high outPoolUsage of some Subtask upstream, we can further judge from this point. It is worth noting that the backpressure is sometimes short-lived and has little effect, such as a short network delay from a Channel or the normal GC of a TaskManager, in which case we do not have to deal with it. For Flink 1.9 and above, in addition to the above table, we can further analyze the data transmission between a Subtask and its upstream Task according to the outPoolUsage of floatingBuffersUsage/exclusiveBuffersUsage and its upstream Subtask.

Figure 4. Flink 1.9 back pressure analysis table generally speaking, a high floatingBuffersUsage indicates that the reverse pressure is being transmitted upstream, while exclusiveBuffersUsage indicates whether there is a tilt in the reverse pressure (high floatingBuffersUsage and low exclusiveBuffersUsage are tilted, because a small number of channel takes up most of the Floating Buffer). At this point, we already have a wealth of means to locate which node the root of backpressure occurs, but the specific reason has not been found. In addition, the network-based reverse pressure metrics can not locate the specific Operator, but can only locate the Task. Especially for embarrassingly parallel (easy to parallel) jobs (all Operator will be put into one Task, so there is only one node), reverse pressing metrics is of no use. After analyzing the specific cause and processing to locate the reverse pressure node, the method of analyzing the cause is very similar to our method of analyzing the performance bottleneck of an ordinary program, and it may be a little simpler, because what we want to observe is mainly Task Thread. In practice, the backpressure in many cases is caused by data skew, which can be confirmed by the Records Sent and Record Received of each SubTask in Web UI. In addition, the State size of different SubTask in Checkpoint detail is also a useful indicator for analyzing data tilt. In addition, the most common problem may be the efficiency of user code execution (frequent blocking or performance problems). The most useful way is to CPU profile TaskManager, from which we can analyze whether Task Thread is full of CPU cores: if so, we need to analyze which functions are mainly spent by CPU. For example, we occasionally encounter user functions (ReDoS) stuck in Regex in our production environment. If not, it depends on where the Task Thread is blocked. It may be a synchronous call to the user function itself, or a temporary system pause caused by system activities such as checkpoint or GC. Of course, the results of performance analysis may also be normal, but the lack of resources for job applications leads to reverse pressure, which usually requires the expansion of parallelism. It is worth mentioning that in future versions of Flink, JVM CPU flame diagrams will be provided directly in WebUI, which will greatly simplify the analysis of performance bottlenecks. In addition, the memory and GC problems of TaskManager may also lead to reverse pressure, including frequent Full GC or even loss of contact caused by unreasonable memory in each area of TaskManager JVM. It is recommended that you can observe problems with TaskManager by enabling the G1 garbage collector to optimize GC and adding-XX:+PrintGCDetails to print GC logs. Summary reverse pressure is a common problem in the operation and maintenance of Flink applications, which not only means performance bottlenecks but also may lead to job instability. Positioning backpressure can start from both Web UI backpressure monitoring panel and Task Metric, the former is convenient for simple analysis, and the latter is suitable for in-depth mining. After locating the backpressure node, we can further analyze the specific reasons behind the backpressure and optimize it by means of data distribution, CPU Profile and GC index log.

The original link to this article is the original content of Yunqi community and may not be reproduced without permission.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.