In-depth practice: comparison between Flink and Storm at protocol level 07/04 Update SLTechnology News&Howtos

In-depth practice: comparison between Flink and Storm at protocol level

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Author: Zhang Xinyu

From the point of view of data transmission and data reliability, this paper compares the performance of Storm and Flink in stream processing, analyzes the test results, and gives some suggestions to improve the performance when using Flink.

cdn.xitu.io/2019/6/13/16b4ec8494de8079? w=865&h=482&f=png&s=223149">

Apache Storm, Apache Spark, and Apache Flink are all active distributed computing platforms in the open source community, and many companies may use two or even three of them simultaneously. For real-time computing, Storm and Flink's underlying computing engine is stream-based, essentially processing data one by one, and the processing mode is pipeline mode, that is, all processing processes exist at the same time, and data flows between these processes. Spark, on the other hand, is based on batch data processing, that is, processing small batches of data, and the processing logic does not perform calculations until a batch of data is ready. In this article, we compare Storm and Flink, which are also based on stream processing, for test analysis.

Before we do the test, we investigate some existing big data platform performance test reports, such as Yahoo's Streaming-benchmarks, or Intel's HiBench, etc. In addition, there are many papers that test distributed computing platforms from different perspectives. Although each of these test cases has a different emphasis, they all use the same two metrics, throughput and latency. Throughput represents the amount of data that can be processed per unit time and can be improved by increasing concurrency. Latency represents the time required to process a piece of data and is inversely proportional to throughput.

When we design computational logic, we first consider the computational model of flow processing. The above figure is a simple flow calculation model. The data is extracted from the Source, sent to the downstream Task, processed in the Task, and finally output. For such a computational model, latency consists of three components: data transmission time, Task computation time, and data queuing time. We assume that resources are sufficient and that data does not queue. The delay time consists of only the data transfer time and the Task computation time. The time required for processing in Task is closely related to the logic of the user, so for a computing platform, the time of data transmission can better reflect the capabilities of this computing platform. Therefore, when we design the test Case, in order to better reflect the ability of data transmission, no calculation logic is designed in the Task.

When determining the data source, we mainly consider generating the data directly in the process, which is also used in many previous test standards. This is done because data generation is not constrained by the performance of external data source systems. But since most of our real-time computing data comes from kafka internally, we added tests that read data from kafka.

There are two types of data transfer: inter-process data transfer and intra-process data transfer.

Inter-process data transfer means that the data will go through three steps: serialization, network transmission and deserialization. In Flink, two processing logics are distributed on different TaskManagers, and the data transfer between these two processing logics can be called inter-process data transfer. Flink network transmission is Netty technology. In Storm, data transfer between processes is data transfer between workers. Earlier versions of Storm used ZeroMQ for network transport, and now Netty.

In-process data transfer is when two processing logics are in the same process. In Flink, these two pieces of processing logic are chained together, passing formal process data through method calls in a single thread. In Storm, the two processing logics become two threads, transferring data through a shared queue.

Storm and Flink each have their own reliability mechanisms. In Storm, the ACK mechanism is used to ensure data reliability. In Flink, it is guaranteed by checkpoint mechanism, which comes from chandy-lamport algorithm.

In fact, the guarantee of Exactly-once reliability depends on the logic of the process and the design of the resulting output. For example, the result is output to kafka, and the data output to kafka cannot be rolled back, which cannot guarantee Exactly-once. We choose at-least-once semantic reliability strategy and unguaranteed reliability strategy to test.

The image above shows the environment we tested and the versions for each platform.

The figure above shows Flink throughput for different transmission methods and reliability in the case of self-produced data: intra-process + unreliable, intra-process + reliable, inter-process + unreliable, and inter-process + reliable. You can see that intra-process data transfers are 3.8 times greater than inter-process data transfers. Whether checkpoint is enabled or not has little effect on Flink throughput. So when we use Flink, we come in using intra-process transfers, that is, as much as possible to make operators chain up.

So let's see why Chain up performs so much better, and how to make Flink's operator Chain up use inter-process data transfer in the process of writing Flink code.

As you know, we must create an env in Flink code, and calling env's disableOperatorChainning() method will make all operators unchained. We usually call this method back during debug to facilitate debugging problems.

If Chain is allowed, Source and mapFunction in the above figure will be Chain together and placed in a Task to calculate. Conversely, if Chain is not allowed, it will be placed in two Tasks.

For two operators without Chain, they are placed in two different Tasks, so the data transfer between them is like this: SourceFunction takes the data and serializes it into memory, and then transmits it to the process where MapFunction is located through the network, which serializes the data and uses it.

For the two operators of Chain, they are placed in the same Task, so the data transfer between the two operators is: SourceFunction takes the data, performs a deep copy, and then MapFunction takes the object copied deeply as the input data.

Although Flink has done a lot of optimization on serialization, performance is still much worse than in-process data transfer without serialization and without network transmission. So we try to chain operators together as much as possible.

Not any two operators can be chained together, there are many conditions to chain operators together: first, the downstream operator can only accept one upstream data stream, such as the stream accepted by Map cannot be a stream after union; second, the upstream and downstream concurrency numbers must be the same; second, the operator must use the same resource Group, the default is consistent, both are default; Third, the disableOperatorChainning() method cannot be called in env, and finally, the upstream method for sending data is Forward, for example, there is no call to the balance () method during development, no keyby(), no boardcast, etc.

Compare throughput between Flink and Storm when in-process communication is used and data reliability is not guaranteed for homegrown data. In this case, Flink performs 15 times better than Storm. Flink throughput can reach 20.6 million/s. Not only that, but if env.getConfig().enableObjectReuse() is called at development time, Flink's concurrent throughput can reach 40.9 million/s.

When the enableObjectReuse method is called, Flink omits all intermediate deep copy steps and uses the data generated by SourceFunction directly as input to MapFunction. However, it is important to note that this method cannot be called casually, and it must be ensured that there is only one downstream Function, or that the downstream Function does not change the value inside the object. Otherwise, there may be thread safety issues.

When comparing Flink and Storm performance under different reliability strategies, we find that ensuring reliability has a very small impact on Flink, but a very large impact on Storm. Overall, Flink's single-concurrency throughput is 15 times better than Storm when reliability is guaranteed, and 66 times better when reliability is not guaranteed. This is primarily because Flink and Storm have different mechanisms for ensuring data reliability.

Storm's ACK mechanism costs more to ensure data reliability.

The diagram on the left shows Storm's ACK mechanism. Spout will send an ACK message to ACKer every time it sends a piece of data to Bolt. When Bolt finishes processing this piece of data, it will also send an ACK message to ACKer. When ACKer receives all ACKs for this data, it will reply Spout with an ACK. That is, for a topology with only two spouts + bolts, three ACKs are transmitted for every data sent. These three ACK messages are overhead required to ensure reliability.

The diagram on the right shows Flink's Checkpoint mechanism. The originator of Checkpoint information in Flink is JobManager. It is not like Storm, where each message has an ACK message overhead and the overhead is calculated in terms of time. Users can set the frequency of checkpoints, such as one checkpoint every 10 seconds. Each checkpoint costs only one checkpoint message sent from Source to map (checkpoint messages sent by JobManager go through control flow, independent of data flow). Compared to Storm, Flink has a much lower reliability overhead. This is why reliability assurance has less impact on Flink performance than Storm does.

The final comparison of self-produced data is the comparison of data transfer between Flink and Storm. It can be seen that Flink has 4.7 times higher concurrent throughput than Storm in the case of data transfer between processes. 14 times more reliable than Storm.

The figure above shows the concurrent throughput of Storm and Flink when consuming data from kafka. Because the data consumed is in kafka, throughput will definitely be affected by kafka. We found that the performance bottleneck is on SourceFunction, so we increased the partition number of topic and the concurrency number of SourceFunction taking data threads, but the concurrency number of MapFunction is still 1. In this case, we found that Flink's bottleneck shifted upstream to the downstream data. Storm's bottleneck is where the data is deserialized downstream.

The previous performance analysis led us to analyze the performance of the Flink and Storm computing platforms themselves purely from the perspective of data transmission and data reliability. However, in actual use, task must have computational logic, which is bound to involve more CPU, memory and other resource issues. In the future, we intend to make an intelligent analysis platform to analyze the performance of users 'jobs. Through the collected index information, the bottleneck of the job is analyzed and optimization suggestions are given.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.