What is the comparison between big data's processing engine Spark and Flink 07/09 Update SLTechnology News&Howtos

What is the comparison between big data's processing engine Spark and Flink

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shows you how the comparison between big data processing engine Spark and Flink is, the content is concise and easy to understand, it can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

The next Generation big data Computing engine

Since the demand for data processing exceeds the amount of data that the traditional database can handle effectively, a variety of massive data processing systems based on MapReduce, such as Hadoop, emerge as the times require. Since Google published MapReduce papers in 2004, after nearly 10 years of development, massive data processing based on Hadoop open source ecology or other corresponding systems has become the basic demand of the industry.

However, many organizations will find that they need to face a series of problems when developing their own data processing systems. Getting value from the data requires much more investment than expected. Common problems include:

A very steep learning curve. People who are new to this field are often stunned by the number of skills they need to learn. Unlike a database system that has been developed for decades, a system in big data's ecology, such as Hadoop, is often good at some data processing scenarios, others can make do with it, and some scenarios can not meet the needs at all. The result is that several systems are needed to handle different scenarios.

(source: https://mapr.com/developercentral/lambda-architecture/)

The figure above is a typical lambda architecture, which only includes two scenarios of batch processing and streaming processing, and at least four or five technologies are involved, not counting the alternative options for each technology. Coupled with real-time query, interactive analysis, machine learning and other scenarios, each scenario has several technologies to choose from, and each technology covers different areas of overlap. The result is that a business often needs to use more than four or five technologies to support a complete data processing process. Coupled with the selection of research and research, there is much more to know.

The picture below is a panoramic view of big data's field. Are you dizzy?

2018 big data and AI Panorama (Source: http://mattturck.com/bigdata2018/)

The development and operation efficiency is low. Because a variety of systems are involved, each system has its own development language and tools, and the development efficiency can be imagined. Because of the use of multiple systems, data needs to be transmitted between systems, resulting in additional development and operation costs, and the consistency of data is difficult to be guaranteed. In many organizations, more than half of the development effort is actually spent on the transmission of data between systems.

Complex operation and maintenance. Multiple systems, each needs its own operation and maintenance, which not only brings a higher cost of operation and maintenance, but also increases the possibility of system problems.

It is difficult to guarantee the quality of data. It is difficult to track and solve problems with the data.

*, there are still people's problems. In many organizations, due to the complexity of the system, the support and use of each subsystem is responsible for different departments.

After understanding these issues, it is easier to understand the rapid popularity of Spark around 2014. At that time, Spark not only brought tens to hundreds of times performance improvement than Hadoop MapReduce in some scenarios, but also proposed a unified engine to support batch processing, streaming, interactive query, machine learning and other common data processing scenarios. Having seen the Spark demo of completing all the above scenarios in one Notebook, compared with the previous data flow development, it is not difficult for many developers to make a choice. After several years of development, Spark has been seen as a complete replacement for the MapReduce engine in Hadoop.

At the time of the rapid development of Spark, around 2016, Flink began to enter the public view and gradually become widely known. Why? It turns out that after people started using Spark, they found that Spark supports a variety of common scenarios, but not all of them work equally well. The real-time processing of data stream is a relatively weak part of it. With its better flow processing engine and supporting various processing scenarios, Flink has become a powerful challenger to Spark.

How Spark and Flink do this, and what are the similarities and differences between them, let's take a look at them in detail.

Engine Technology of Spark and Flink

This section focuses on the architectural aspects of Spark and Flink engines, with more emphasis on the potential and limitations that architecture brings. The maturity and limitations of implementation at this stage will be discussed in the follow-up ecological part.

Data model and processing model

To understand the engine features of Spark and Flink, start with the data model.

The data model of Spark is the resilient distributed dataset RDD (Resilient Distributed Datasets). Compared with the file model of MapReduce, RDD is a more abstract model, and RDD relies on lineage to ensure recoverability. In many cases, RDD can be implemented as distributed shared memory or fully virtualized (that is, some intermediate results RDD can be optimized and omitted directly when the downstream processing is entirely local). This saves a lot of unnecessary Spark O, which is the main reason for the performance advantage of early PUBG.

Spark uses transformations (operators) on RDD to describe data processing. Each operator, such as map,filter,join, generates a new RDD. All operators form a directed acyclic graph (DAG). Spark simply divides edges into wide dependencies and narrow dependencies. The upstream and downstream data do not need shuffle, that is, narrow dependence. The upstream and downstream operators can be continuously processed locally in a phase (stage). In this case, the upstream result RDD can be omitted. The following figure shows the related basic concepts. A more detailed introduction is easier to find online, so it doesn't take much space here.

Spark DAG (Source: http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/)

The basic data model of Flink is data flow and sequence of events (Event). Data flow as the basic model of data may not be as intuitively familiar as tables or data blocks, but it can be proved to be completely equivalent. A stream can be an unbounded stream, that is, flow processing in a general sense. It can also be a finite stream with a boundary, which is called batch processing.

Flink uses transformations (operators) on data streams to describe data processing. Each operator generates a new data stream. In terms of operator, DAG, and upstream and downstream operator links (chaining), it is roughly equivalent to Spark. The vertex of Flink is roughly equivalent to the stage of Spark, and the division is basically the same as the Spark DAG in the figure above.

Flink Task Diagram (Source: https://ci.apache.org/projects/flink/flink-docs-release-1.5/concepts/runtime.html)

There is a significant difference between Spark and Flink in the implementation of DAG. In Flink's stream execution mode, the output of an event after it is processed by one node can be sent to the next node for immediate processing. This way the execution engine does not introduce additional latency. Accordingly, all nodes need to run at the same time. The micro batch of Spark is the same as the general batch implementation, processing the upstream stage and getting the output before starting the downstream stage.

In Flink's stream execution mode, multiple events can also be transmitted or calculated together in order to improve efficiency. However, this is entirely an execution-time optimization, which can be determined independently by each operator, and it does not have to be bound to the data set boundary as in batch models such as RDD, so it can be optimized more flexibly while taking into account low latency requirements.

Flink uses asynchronous checkpoint mechanism to achieve the recoverability of task state to ensure the consistency of processing, so in the main process of processing, there is no need to drop the data between the data source and the output, achieving higher performance and lower latency.

Data processing scene

In addition to batch processing, Spark also supports real-time data flow processing, interactive query and machine learning, graph computing, and so on.

(source: https://databricks.com/spark/about)

The main difference between real-time data stream processing and batch processing is the requirement of low latency. Spark because RDD is memory-based, it is easier to cut into smaller chunks to handle. If these small pieces can be processed fast enough, the effect of low delay can be achieved.

Interactive query scenario, if the data can be all in memory and processed fast enough, interactive query can be supported.

Machine learning and graph computing are actually different types of RDD operators from the previous scenarios. Spark provides libraries to support common operations, and users or third-party libraries can extend them themselves. It is worth mentioning that Spark's RDD model fits well with the iterative computation of machine learning model training, which has brought significant performance improvements in some scenarios from the very beginning.

As you can see from these, Spark is essentially a faster batch processing based on memory than Hadoop MapReduce. Then use batch processing fast enough to implement various scenarios.

(source: https://www.slideshare.net/ParisCarbone/state-management-in-apache-flink-consistent-stateful-distributed-stream-processing)

As mentioned earlier, in Flink, if the input data stream is bounded, the effect of batch processing is naturally achieved. The difference between a stream and a batch is completely logical, and the logic that the user needs to implement is exactly the same as the processing implementation, which should be a cleaner abstraction. A more in-depth discussion will be made later in the in-depth comparison of flow computing.

Flink also provides libraries to support machine learning, graph computing, and other scenarios. It's not much different from Spark in this respect.

An interesting thing is that the underlying API of Flink can support the implementation of some data-driven distributed services using only Flink clusters. Some companies use Flink clusters to implement social networks, web crawlers and other services. This also reflects the versatility of Flink as a computing engine and benefits from Flink's built-in flexible state support.

In general, both Spark and Flink aim to support most data processing scenarios on the same execution engine, and should be able to do so. The main difference is that the architecture itself is limited in some scenarios. The more prominent place is the micro batch execution mode of Spark Streaming. The Spark community should also be aware of this and has recently begun to work on the continuous implementation model (continuous processing). Details will be described later.

Stateful processing (Stateful Processing)

Another very unique feature of Flink is the introduction of managed state (managed state) into the engine. To understand managed state, start with stateful processing. If the result of processing an event (or a piece of data) is only related to the content of the event itself, it is called stateless processing; conversely, the result is also related to previously processed events, which is called stateful processing. Slightly more complex data processing, such as basic aggregation, is stateful. Flink has long believed that retention can not be done without good state support, so it introduces managed state and provides an API interface.

Status support in Flink (source: https://www.slideshare.net/ParisCarbone/state-management-in-apache-flink-consistent-stateful-distributed-stream-processing)

Generally speaking, stateful processing is paid more attention to when streaming, but batch processing will also be affected if you look at it carefully. For example, in the common window aggregation, if the data period of batch processing is larger than that of the window, the state can be ignored, which is often ignored by user logic. But when the batch processing time period becomes smaller than the window, the result of a batch actually depends on the previously processed batch. At this point, because batch engines generally do not have this requirement and do not have good built-in support, maintaining state has become a problem that users need to solve. For example, in the case of window aggregation, the user needs to add an intermediate result table to remember the results of the window that has not yet been completed. In this way, the user will find that the logic becomes more complex when the batch processing period is shortened. This is a problem that early Spark Streaming users often encountered and was not alleviated until Structured Streaming came out.

Engines like Flink, which take stream processing as the basic model, can not avoid this problem in the first place, so managed state is introduced to provide a general solution. Compared with the specific solutions implemented by users, it is not only easier for users to develop, but also provides better performance. The most important thing is to better ensure the consistency of the processing results.

To put it simply, there is some inherent data processing logic that can be easily ignored or simplified in batch processing to get available results, while problems are exposed and solved in stream processing. Therefore, the flow computing engine uses finite flow to process batches, which is logically rigorous and can naturally achieve correctness. Just do some different implementations to optimize performance. Simulating the flow with smaller batches needs to deal with some problems that didn't exist before. When the computing engine does not have a general solution, it needs to be solved by the user. Similar problems include changes in dimension tables (such as updates to user information), boundaries of batch data, late data, and so on.

Programming model

API status at Spark 1.6

One of the original intentions of Spark is to use a unified programming model to solve the needs of users, and has been making great efforts in this regard. Initially, API based on RDD can do various types of data processing. Later, in order to simplify user development, higher-level DataFrame (adding columns to RDD to make structured data) and Datasets (adding types to DataFrame columns) were gradually introduced, and integrated in Spark 2.0 (DataFrame = DataSet [Row]). Spark SQL support was also introduced relatively early. Coupled with the continuous improvement of various processing types of API, such as Structured Streaming and the interaction with machine learning deep learning, Spark's API can be said to be very useful today, and it is also one of the aspects of Spark.

Spark 2.0 API (Source: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)

Flink's API has similar goals and development paths. The core API of Flink and Spark can be said to be basically corresponding. Today Spark API is generally more complete, such as the integration of machine learning and deep learning that has been invested heavily in the last year or two. Flink is still ahead in terms of stream processing, such as support for watermark, window, and trigger.

Flink API (Source: https://ci.apache.org/projects/flink/flink-docs-release-1.5/concepts/programming-model.html)

Summary

Both Spark and Flink are general-purpose computing engines that can support very large-scale data processing and various processing types. There are many aspects worth exploring in both systems, such as the optimization of SQL, the integration of machine learning and so on. This is mainly an attempt to compare the two systems from the most basic aspects of architecture and design. Because the functions of the upper layer can learn from each other to a certain extent, they should be able to do well with sufficient investment. On the other hand, the change of the basic design will break the bones and make it more difficult.

The big difference between the different execution models of Spark and Flink should be in the support of convection computation. At the beginning, the calculation of Spark Streaming convection is too simple, and there will be a lot of problems in the calculation of more complex ones. Structured Streaming, introduced since Spark 2.0, rearranges the semantics of stream computing to support event time processing and end-to-end consistency. Although there are still many functional limitations, it has made great progress than before. However, the problems caused by the implementation of micro batch still exist, especially the performance problems will be more prominent after the scale. Recently, driven by a number of application scenarios, Spark has also begun to develop a continuous execution model. The experimental release in 2. 3 only supports simple map class operations.

The above is how big data's processing engine Spark compares with Flink. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.