What are the common fallacies of eliminating stream processing in Flink 07/06 Update SLTechnology News&Howtos

What are the common fallacies of eliminating stream processing in Flink

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces what are the common fallacies of eliminating flow processing in Flink, which can be used for reference by interested friends. I hope you will gain a lot after reading this article.

Myth 1: there is no stream that does not use batches (Lambda schema)

The "Lambda Architecture" is a useful design pattern in the early stages of Apache Storm and other streaming projects. This architecture consists of a "fast stratosphere" and a "batch layer".

Two separate layers are used because flow processing in the Lambda architecture can only calculate rough results (that is, if there is an error in the middle, the calculation is not trusted) and can only handle a relatively small number of events.

There are such problems in the early versions of Storm, but many open source flow processing frameworks today are fault tolerant, they can generate accurate results in the event of failure, and have high throughput computing power. So there is no need to maintain a multi-tier architecture in order to get "fast" and "accurate" results, respectively. Today's stream processors such as Flink can help you get both results at the same time.

Fortunately, people no longer talk more about the Lambda architecture, indicating that flow processing is maturing.

Myth 2: latency and throughput: only one can be selected

The early open source flow processing framework was either "high throughput" or "low latency", while "massive and fast" has not been synonymous with open source flow processing framework.

But Flink (and possibly other frameworks) provides both high throughput and low latency. Here is an example of the benchmark results.

Let's analyze this example from the bottom, especially from the hardware layer, and combine it with a flow processing pipeline with a network bottleneck (many pipes that use Flink have this bottleneck). There should be no tradeoffs at the hardware layer, so the network is the main factor affecting throughput and latency.

A well-designed software system should make full use of the upper limit of the network without introducing bottleneck problems. For Flink, however, there is always room for optimization to bring it closer to the performance that hardware can provide. Using a cluster of 10 nodes, Flink can now handle other events per second, and its latency can be reduced to tens of milliseconds if expanded to 1000 nodes. In our view, this level is already much higher than many existing schemes.

Myth 3: microbatching means better throughput

We can discuss performance from another perspective, but first let's clarify two concepts that are easy to confuse:

Micro batch

Microbatches are based on traditional batches and are an execution or programming model for processing data. "with this technique, a process or task can treat a stream as a series of small batches or data blocks."

Buffer

Buffering technology is used to optimize the access of network, disk and cache. Wikipedia*** defines it as "an area in physical memory used to temporarily store mobile data".

So the third suggestion is that a data processing framework using microbatches can achieve higher throughput than a framework that processes one event at a time, because microbatches are transmitted more efficiently over the network.

This insight ignores the fact that flow frameworks do not rely on any batches at the programming model level, they only use buffering at the physical level.

Flink does buffer data, which means it sends a set of processed records over the network, rather than one record at a time. In terms of performance, not buffering data is not desirable, because sending records one by one over the network does not bring any performance benefits. So we have to admit that there is no such thing as one record at a time at the physical level.

However, buffering can only be used as a performance optimization, so buffering:

Is invisible to the user

Should not have any impact on the system

There should be no artificial boundaries.

System functions should not be limited

So for Flink users, the programs they develop can handle each record individually, because Flink hides the details of using buffers to improve performance.

In fact, the use of microbatches in task scheduling brings additional overhead, and if you do so to reduce latency, then this cost will only increase! The stream processor knows how to take advantage of buffering without the overhead of task scheduling.

Myth, 4:Exactly once? It's totally impossible.

This insight contains several aspects:

Fundamentally speaking, Exactly once is impossible.

Exactly once from end to end is impossible.

Exactly once has never been a real-world need.

Exactly once at the expense of performance

To say the least, we don't mind the existence of the idea of "Exactly once". "Exactly once" used to mean "one-time transmission", but now the word is used casually in streaming, making the word confusing and losing its original meaning. However, the related concepts are still very important, and we are not going to skip it.

In order to be as accurate as possible, we regard "one-time state" and "one-time delivery" as two different concepts. Because the way people used the two words before led to their confusion. Apache Storm uses "at least once" to describe delivery (Storm does not support state), while Apache Samza uses "at least once" to describe the state of the application.

An one-time state means that the application has experienced a failure as if there had never been a failure. For example, suppose we are maintaining a counter application that can count neither more nor less after a failure. The word "Exactly once" is used here because the application state assumes that each message is processed only once.

One-time delivery means that the receiver (a system other than the application) receives a processed event after a failure, as if there had never been a failure.

The flow processing box does not guarantee one-time delivery under any circumstances, but it can achieve an one-time state. Flink can achieve an one-time state without a significant impact on performance. Flink can also pass on data slots related to Flink checkpoints at one time.

Flink checkpoints are snapshots of the state of the application, and Flink generates snapshots of the application asynchronously on a regular basis. This is why Flink can still guarantee an one-time state in the event of a failure: Flink periodically records the read position of the input stream and the relevant state of each Operand. In the event of a failure, Flink rolls back to its previous state and starts the calculation again. So, although the record has been reprocessed, it seems that the record has been processed only once.

What about end-to-end disposable processing? It is possible for checkpoints to have both transaction coordination mechanisms in an appropriate way, in other words, to involve both source and target operations in the checkpoint. Within the framework, the result is disposable and, from end to end, disposable, or "close to disposable". For example, when using Flink and Kafka as data sources and data slot (HDFS) scrolling occurs, Kafka to HDFS is an one-time end-to-end processing. Similarly, when using Kafka as the source of Flink and Cassandra as the slot of Flink, if the update to Cassandra is idempotent, then one-time end-to-end processing can be achieved.

It is worth mentioning that with Flink SavePoint, checkpoints can also have a stateful version mechanism. Using SavePoint, you can "move over time" while maintaining state consistency. This makes it easy to update, maintain, migrate, debug, and simulate your code.

Myth 5: streams can only be applied in "real-time" scenarios

This myth includes several points:

"I don't have a low-latency application, so I don't need a stream processor."

"Stream processing is only related to transitional data before persistence."

"We need batch processors to perform clunky offline calculations."

It is time to think about the relationship between the type of dataset and the processing model.

First, there are two kinds of datasets:

Borderless: data that is continuously generated from non-predefined endpoints

Bounded: limited and complete data

Many real data sets have no boundaries, whether they are stored in files, in HDFS directories, or in systems such as Kafka. To give some examples:

Interactive information of users of mobile devices or websites

Metrics provided by physical sensors

Financial market data

Machine log data

In fact, it is difficult to find bounded data sets in the real world, but the location information of all the buildings of a company is bounded (but it also changes as the company's business grows).

Second, there are two processing models:

Stream: as long as there is data generation, it will always be processed.

Batches: end processing in a limited time and release resources

Let's go a little further and distinguish between two kinds of unbounded data sets: continuous flows and intermittent flows.

It is entirely possible to use any model to deal with any kind of data set, although this is not the practice. For example, the batch processing model has been applied to unbounded datasets for a long time, especially intermittent unbounded datasets. The reality is that most "batch" tasks are executed through scheduling, processing only a small portion of an unbounded dataset at a time. This means that the borderless nature of the flow can cause trouble for some people (those who work on the inflow pipeline).

The batch is stateless, and the output depends only on the input. The reality is that batch tasks retain states internally (for example, reducer often retains states), but these states are limited to batch boundaries, and they do not flow between batches.

When someone tries to implement a time window similar to an "event timestamp", the "intra-boundary state of the batch" becomes useful, which is a common tool when dealing with unbounded data sets.

Batch processors dealing with unbounded datasets will inevitably encounter late events (due to upstream delays), which may make the data in the batch incomplete. Note that it is assumed that we are moving the time window based on the event timestamp, because the event timestamp is the most accurate model in reality. Late data can be a problem when executing batches, and even a simple time window fix (such as flipping or sliding a time window) won't solve the problem, especially if you use a session time window.

Because the data needed to complete a calculation will not be in the same batch, it is difficult to ensure the correctness of the results when using batches to deal with unbounded data sets. At the very least, it requires additional overhead to process late data and maintain state between batches (wait until all the data has arrived before starting processing, or reprocessing batches).

Flink has a built-in mechanism to deal with late data, and late data is regarded as a normal phenomenon of real-world unbounded data, so Flink designed a stream processor to deal with late data.

Stateful stream processors are more suitable for processing unbounded data sets, regardless of whether the data sets are generated continuously or intermittently. Using a stream processor is just the icing on the cake.

Miao Jian 6: in any case, the flow is still very complex

This is a muse to see. You might think, "the theory is good, but I still won't use streaming technology, because …"... " :

The flow framework is difficult to master

Stream is difficult to solve the problems of time window, event timestamp and trigger.

Streams need to be combined with batches, and I already know how to use batches, so why use streams?

We never intended to encourage you to use streaming, although we think streaming is a cool thing. We believe that whether or not to use streams depends entirely on the characteristics of the data and code.

Before you make a decision, ask yourself, "what kind of dataset am I dealing with?"

Borderless (user activity data, logs, sensor data)

With boundaries.

Then ask another question: "which part changes most frequently?"

Code changes more frequently than data

Data changes more frequently than code

For cases where data changes more frequently than code, such as performing a relatively fixed query operation on a frequently changing dataset, there can be flow problems.

So, before deciding that flow is a "complex" thing, you may have unwittingly solved the problem of overcurrent! You may have used hour-based batch task scheduling, and others on the team can create and manage these batches. (in this case, the results you get may be inaccurate. and you don't realize that this result is caused by the batch time problem and the status problem mentioned earlier).

It has been working for a long time to provide a set of API,Flink communities that encapsulate these time and state complexities. Event timestamps can be easily handled in Flink by defining a time window and a function that can extract timestamps and watermarks (called only once on each stream). Dealing with state is also simple, similar to defining Java variables and registering those variables with Flink. Using Flink's StreamSQL, you can run SQL queries over a steady stream.

One thing: what should I do if the code changes more frequently than the data? In this case, we think you have an exploratory problem. Iterating with a notebook or other similar tool may be appropriate to solve exploratory problems.

After the code has stabilized, you will still encounter streaming problems. We recommend using a long-term solution to solve the flow problem from the beginning.

The future of stream processing

With the increasing maturity of stream processing and the gradual fading of these insights, we find that flow is developing towards a field other than analytical applications. As we discussed, the real world is constantly generating data.

Traditional practice interrupts this contiguous data because it must be aggregated into a centralized location or split into batches for application use.

Stream processing patterns like CQRS are becoming more and more popular, and applications can be developed directly based on continuous data streams, which can retain state locally, better isolate applications and teams, and better handle time-based data.

As Flink continues to evolve and be adopted by more and more enterprises, we believe that it can not only be used to simplify the analysis pipeline, but also bring us more powerful computing models.

Thank you for reading this article carefully. I hope the article "what are the common fallacies of eliminating stream processing in Flink" shared by the editor will be helpful to you. At the same time, I also hope that you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.