How to understand the certainty of Flink real-time application 07/03 Update SLTechnology News&Howtos

How to understand the certainty of Flink real-time application

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces you how to understand the certainty of Flink real-time application, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Determinism is a very important feature in computer science. Deterministic algorithms ensure that the same output is always produced for a given same input. In the field of distributed real-time computing, certainty has always been a difficult problem for the industry to solve. As a result, the Lambda architecture of correcting real-time computing results with offline computing has become the mainstream architecture in big data's field in the past decade.

In recent years, with the introduction of Google The Dataflow Model, the relationship between real-time computing and offline computing has become increasingly clear, and it is possible to provide the same certainty as offline computing in real-time computing. Based on the popular real-time computing engine Apache Flink, this paper will sort out what conditions should be met to build a deterministic real-time application.

Certainty and accuracy

Accuracy may be a synonym for us to come into contact with more than certainty, and the two can be mixed in most scenarios, but they are actually slightly different: what is accurate must be certain, but what is certain may not be 100% accurate. In the field of big data, many algorithms can adjust the balance between cost and accuracy according to the demand. For example, the result given by HyperLogLog de-duplication algorithm has some error (so it is not accurate), but it is deterministic at the same time (recalculation can get the same result).

The reason for partition certainty and accuracy is that accuracy is closely coupled with specific business logic and is difficult to assess, while certainty is a general requirement (except for a small number of scenarios where users deliberately use non-deterministic algorithms). When a Flink real-time application provides certainty, it means that it can produce the same results as offline jobs in the case of automatic retry of abnormal scenarios or manual reflow of data, which will greatly improve the trust of users.

Factors affecting the certainty of Flink Application

Delivery semantics

There are three common delivery semantics: At-Most-Once, At-Least-Once and Exactly-Once. Strictly speaking, only Exactly-Once satisfies the requirement of certainty, but if the whole business logic is idempotent, the certainty of the result can be achieved based on At-Least-Once.

Real-time computing Exactly-Once usually refers to end-to-end Exactly-Once, which ensures that the data output to the downstream system is consistent with the upstream data, and there is no double calculation or data loss. To achieve this, you need to implement the Exactly-Once of reading the data source (Source side), the calculated Exactly-Once, and the Exactly-Once output to the downstream system (Sink side), respectively.

The first two are better guaranteed, because if an exception occurs in the Flink application, it will automatically revert to the status of the Source of the last successful checkpoint,Pull-Based and the status of the internal calculation of Flink will be automatically rolled back to the snapshot point, but the problem lies in the Sink side of the Push-Based. Whether the Sink can be rolled back smoothly depends on the characteristics of the external system, which generally requires the external system to support transactions. However, many big data components do not support transactions very well. Even the most commonly used Kafka for real-time computing does not support transactions until version 0.11 in 2017, and more components need to rely on various trick to achieve Exactly-Once in a certain scenario.

In general, these trick can be divided into two main categories:

Depends on the idempotency of write operations. For example, KV storage such as HBase does not provide cross-line transactions, but Exactly-Once can be achieved through idempotent writes combined with primary key-based Upsert operations. However, because Upsert cannot express Delete operations, this mode is not suitable for business scenarios with Delete. Pre-write log (WAL,Write-Ahead-Log). Pre-writing log is a technology widely used in transaction mechanism. Things in mature relational databases, such as MySQL and PostgreSQL, are based on pre-write log. The basic principle of pre-writing logs is to write changes to the cache and apply them all again when the transaction is committed. File systems such as HDFS/S3 do not provide transactions themselves, so the burden of implementing pre-written logging falls on their users, such as Flink. Flink's FileSystem Connector implements Exactly-Once by writing temporary files / objects and waiting for Flink Checkpoint to succeed before committing. However, pre-writing logs can only guarantee the atomicity and persistence of transactions, not consistency and isolation. For this reason, FileSystem Connector provides isolation by setting the write-ahead log to hide files, while consistency (such as the cleaning of temporary files) is not guaranteed.

In order to ensure the certainty of Flink applications, when choosing official Connector, especially Sink Connector, users should pay attention to the description of Connector delivery semantics in the official document [3]. In addition, when implementing customized Sink Connector, we also need to make clear what kind of delivery semantics to achieve. We can refer to three ways to achieve Exactly-Once semantics: using external system transactions, idempotent write operations or pre-writing logs.

Function side effect function side effect refers to the unexpected influence of user function on the outside world of computing framework. For example, it is typical to write the intermediate results to the database in a Map function. If the Flink job is automatically restarted abnormally, the data may be written twice, resulting in uncertainty. In this case, Flink provides Checkpoint-based two-phase commit hooks (CheckpointedFunction and CheckpointListener) that users can use to implement transactions to eliminate the uncertainty of side effects. It is also common for users to use local files to save temporary data, which is likely to be lost when Task is rescheduled. There may be many other scenarios, but all in all, if you need to change the state of an external system in a user function, make sure that Flink is aware of these actions (such as recording the state with State API and setting Checkpoint hooks).

It is a common operation for Processing Time to introduce the current time as a parameter in the algorithm, but introducing the current system time, that is, Processing Time, in real-time computing is the most common and difficult to avoid cause of uncertainty. References to Processing can be obvious and well documented, such as Flink's Time Characteristic, but they can also be completely unexpected to users, such as from commonly used techniques such as caching. To this end, the author summarizes several common Processing Time references:

Time Characteristic provided by Flink. Time Characteristic affects all use of time-related operators, such as Processing Time, which makes window aggregations use the current system time to allocate windows and trigger calculations, creating uncertainty. In addition, Processing Timer has a similar effect. Access external storage directly in the function. Because this access is based on externally storing the state of a certain Processing Time point in time, this state is likely to change on the next visit, leading to uncertainty. To get a deterministic result, instead of simply querying the state of an externally stored point in time, we should get the history of its state change, and then query the corresponding state according to the current Event Time. This is also the implementation principle of Temporary Table Join in Flink SQL [1]. Caching of external data. When calculating data with high traffic, in many cases, users will choose to use cache to reduce the load of external storage, but this may cause inconsistent query results, and this inconsistency is uncertain. Whether using cache culling strategies directly related to system time, such as timeout threshold, LRU (Least Recently Used), or culling strategies without direct correlation time, such as FIFO (First In First Out) and LFU (Less Frequently Used), the results obtained by accessing the cache are usually related to the order of arrival of messages, which is difficult to guarantee in upstream shuffle operators (an exception for Embarrassingly Parallel jobs without shuffle). StateTTL for Flink. StateTTL is Flink's built-in mechanism to automatically clean up State based on time, and time here currently only provides Processing Time, regardless of whether Flink itself uses Processing Time or Event Time as Time Characteristic. BTW,StateTTL 's support for Event Time can follow FLINK-12005 [2].

Generally speaking, it is very difficult to completely avoid the impact of Processing Time, but slight uncertainty is usually acceptable for the business, and we need to do more to anticipate the possible impact in advance to ensure that the uncertainty is within control.

As a mechanism for calculating Event Time, one of the most important uses of WatermarkWatermark is to determine when to output the results of real-time calculation, similar to the effect achieved by the end of file marker (EOF) in offline batch computing. However, late data may arrive after the output, which is called a window integrity problem (Window Completeness).

Window integrity problems are inevitable, and the solution is to either update the calculation results or discard this part of the data. Because of the large tolerance of delay in offline scenarios, offline jobs can be delayed for a certain time and the delay data can be included in the calculation as much as possible. Real-time scenarios require high latency, so it is common to keep the state for a period of time after outputting the results, during which time the results (i.e. Allowed Lateness) are continuously updated according to the late data, and then the data is discarded. Because of location, real-time computing may naturally produce more late data that is discarded, which will be closely related to the generation algorithm of Watermark.

Although the generation of Watermark is streaming, the distribution of Watermark is breakpoint. There are two Watermark distribution strategies of Flink: Periodic and Punctuated. The former is based on Processing Time timing trigger, and the latter is triggered according to special messages in the data stream.

Figure 1. Periodic Watermark normal state and playback data status

Processing Time-based Periodic Watermark has uncertainty. The promotion of Watermark may be ladder-like when the traffic is steady (see figure 1 (a)). However, in the case of replaying historical data, the amount of data processed in the same length of system time may be much larger (see figure 1 (b)), and accompanied by Event Time tilt (that is, the Event Time of some Source is significantly faster or slower than others, causing the overall Watermark with the minimum value to be slowed down by slow Watermark), causing the late data that was originally discarded to become data within the Allowed Lateness (see the red element in figure 1).

Figure 2. Punctuated Watermark normal state and playback data status

By contrast, Punctuated Watermark is more stable, and the issued Watermark is consistent either under normal conditions (see figure 2 (a)) or in the case of data playback (see figure 2 (b)), but there is still a risk of Event Time tilting. For this, the Flink community drafted FLIP-27 to deal with [4]. The basic principle is that the Source node selectively consumes or blocks a partition/shard to keep the overall Event Time close.

In addition to the uncertainty about the distribution of Watermark, another problem is that Watermark has not been included in the Checkpoint snapshot. This means that after the job resumes from the Checkpoint, the Watermark will restart the calculation, resulting in Watermark uncertainty. This problem is documented in FLINK-5601 [5], but currently only reflects the Watermark of Window operators, and after StateTTL supports Event Time, perhaps each operator has to record its own Watermark.

To sum up, it is difficult for Watermark to be very certain at present, but because the uncertainty of Watermark leads to the uncertainty of calculation results by discarding late data, as long as there is no late data discarded, no matter how the intermediate Watermark changes, the final result is the same.

The lack of certainty is the main factor hindering the application of real-time computing in critical business applications, but the current industry already has a theoretical basis for solving problems, and the rest are more problems in the subsequent iterations of the computing framework and engineering practice. As far as the development of Flink real-time applications is concerned, we need to pay attention to the uncertainty caused by delivery semantics, function side effects, Processing Time and Watermark. The certainty of how to understand Flink real-time applications is shared here. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.