Introduction to Apache Flink Zero Foundation (1): analysis of basic Concepts 07/02 Update SLTechnology News&Howtos

Introduction to Apache Flink Zero Foundation (1): analysis of basic Concepts

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Authors: Chen Shouyuan, Dai Zili

I. the definition, structure and principle of Apache Flink

Apache Flink is a distributed big data processing engine, which can perform stateless or stateless computing on limited data streams and infinite data streams, and can be deployed in various cluster environments to quickly calculate data of various sizes.

1. Flink Application

To understand Flink application development, it is necessary to understand the basic processing semantics of Flink, such as Streams, State and Time, as well as the multi-level API of Flink that takes into account flexibility and convenience.

Streams: stream, which is divided into finite data flow and infinite data flow. Unbounded stream is a data stream with no beginning and no end, that is, infinite data flow. The difference between bounded stream and finite data stream is that the data of infinite data stream will increase continuously with the deduction of time, the calculation continues and there is no end state, the relative size of finite data stream data is fixed, the calculation will be completed and in the end state.

State, state is the data information in the process of computing, and plays an important role in fault-tolerant recovery and Checkpoint. Stream computing is essentially Incremental Processing, so it is necessary to constantly query and maintain the state; in addition, in order to ensure the semantics of Exactly- once, data needs to be written to the state; and persistent storage can ensure that the entire distributed system fails or dies, which is another value of the state.

Time, which is divided into Event time, Ingestion time and Processing time,Flink, is a continuous process. Time is an important basis for us to judge whether the business state lags behind and whether the data processing is timely.

API,API is usually divided into three layers, from top to bottom can be divided into SQL / Table API, DataStream API, ProcessFunction three layers, API expression ability and business abstraction ability are very strong, but the closer to the SQL layer, the expression ability will gradually weaken, abstract ability will be enhanced, on the contrary, the ProcessFunction layer API expression ability is very strong, can carry out a variety of flexible and convenient operations, but the abstract ability is also relatively small. 2.Flink Architecture

In the part of architecture, it is mainly divided into the following four points:

Cdn.xitu.io/2019/7/2/16bb2c32e9736d17?w=960&h=540&f=jpeg&s=98000 ">

First, Flink has the ability of a unified framework to handle bounded and × × data streams.

Second, the deployment is flexible, and the underlying Flink supports a variety of resource schedulers, including Yarn, Kubernetes and so on. Flink's own Standalone scheduler is also very flexible in deployment.

Third, extremely high scalability, scalability is very important for distributed systems. Alibaba double 11 large screen uses Flink to deal with massive data, and the peak value of Flink can reach 1.7 billion / s during use.

Fourth, the ultimate flow processing performance. Compared with Storm, the most important feature of Flink is that it completely abstracts the state semantics into the framework, supports local state reading, avoids a large number of network IO, and can greatly improve the performance of state access.

3.Flink Operation

There will be a special course later. Here we briefly share the contents of Flink on operation and maintenance and business monitoring:

Flink has a 7 X 24-hour highly available SOA (service-oriented architecture) because Flink provides a consistent Checkpoint in implementation. Checkpoint is the core of the fault-tolerant mechanism of Flink. It periodically records the state of Operator in the process of calculation and generates snapshot persistent storage. When the Flink job crashes, it can selectively recover from the Checkpoint, which ensures the consistency of the calculation.

Flink itself provides monitoring, operation and maintenance and other functions or interfaces, and has a built-in WebUI to provide DAG diagrams and various Metric for running jobs to assist users in managing job status. Application scenario of 4.Flink 4.1Application scenario of Flink: Data Pipeline

The core scenario of Data Pipeline is similar to data handling and partial data cleaning or processing during data handling, while the left side of the entire business architecture diagram is Periodic ETL, which provides streaming ETL or real-time ETL, which can subscribe to message queues and process them, and write them to the downstream Database or File system in real time after cleaning. An example of a scenario:

Real-time data warehouse

When you want to build a real-time data warehouse downstream, the upstream may need real-time Stream ETL. This process will clean or expand the data in real time, and write it to the whole link of the downstream real-time data warehouse after cleaning, which can ensure the timeliness of data query and form real-time data acquisition, real-time data processing and downstream real-time Query.

Search engine recommendation

Take Taobao as an example, when a seller launches a new product, a message flow will be generated in the background in real time, and the message flow will be processed and expanded when it passes through the Flink system. Then the processed and extended data are generated into a real-time index and written to the search engine. In this way, when Taobao sellers launch new products, they can realize the search of search engines in seconds or minutes.

4.2 Flink application scenario: Data Analytics

Data Analytics, shown in the picture, is Batch Analytics on the left and Streaming Analytics on the right. Batch Analysis is the traditional use of similar to Map Reduce, Hive, Spark Batch, etc., to analyze, process and generate offline reports on jobs. Streaming Analytics uses streaming analysis engines such as Storm,Flink to deal with analysis data in real time, and applies more real-time large screen and real-time reports.

4.3 Flink application scenario: Data Driven

To some extent, all real-time data processing or streaming data processing belongs to Data Driven, and stream computing is essentially Data Driven computing. Many applications such as risk control system, when the risk control system needs to deal with a variety of complex rules, Data Driven will write the rules and logic into the API of Datastream or the API of ProcessFunction, and then abstract the logic into the whole Flink engine. When the external data flow or events enter, the corresponding rules will be triggered. This is the principle of Data Driven. After triggering certain rules, Data Driven will process or give early warning, which will be sent downstream to generate business notifications. This is the application scenario of Data Driven. Data Driven is more applied to the processing of complex events in applications.

2. Analysis of the concept of "stateful streaming" 1. Traditional batch processing

The traditional batch processing method is to collect data continuously, take time as the basis for dividing multiple batches, and then perform batch operations periodically. However, suppose you need to calculate the number of event transitions per hour, and if the event conversion spans the defined time division, the traditional batch will take the mediation operation result to the next batch for calculation; in addition, when the received event order is reversed, the traditional batch processing will still bring the mediation state to the next batch of operation results, which is not satisfactory.

two。 Ideal method

First, to have an ideal approach, the engine must have the ability to accumulate and maintain state, which represents all events received in the past and affects the output.

Second, time means that the engine has a mechanism for data integrity to manipulate, and when all the data is fully accepted, the calculation results are output.

Third, the ideal method model needs to produce results in real time, but it is more important to use a new persistent data processing model to deal with real-time data, which is most in line with the characteristics of continuous data.

3. Flow treatment

Streaming processing simply means that an endless data source continues to receive data, taking the code as the basic logic of data processing, and the data of the data source is processed by code to produce results, and then output. This is the basic principle of streaming processing.

4. Distributed streaming processing

Suppose Input Streams has many users, and each user has its own ID. If we calculate the number of occurrences of each user, we need to stream the events of the same user to the same operation code, which is the same concept as other batches need to do group by, so like Stream, we need to partition, set the corresponding key, and then let the same key flow to the same computation instance to do the same operation.

5. Stateful distributed streaming

As shown in the figure, the variable XMagol X is defined in the above code that will be read and written during data processing. When the final output result is output, the output content can be determined according to the variable X, that is, the state X will affect the final output result. In this process, the first key point is that the state co-partitioned key by is carried out first, and the same key will flow to computation instance, which is based on the same principle as the number of user appearances, that is, the so-called state, which must be accumulated in the same computation instance with events of the same key.

This is equivalent to repartitioning the state of the partition according to the key of the input stream. When the partition enters the stream, the state that the stream accumulates becomes copartiton. The second point is embeded local state backend. In a stateful distributed streaming engine, the state may accumulate to a very large amount. When there is a large amount of key, the state may exceed the memory load of a single node. At this time, the state must have a stateful back end to maintain it; in this state, the back end can be maintained with in-memory.

Third, the advantages of Apache Flink. State fault tolerance

When we consider state fault tolerance, we will inevitably think of accurate state tolerance. When we apply the state accumulated during the operation, each input event is reflected to the state, and the state is changed exactly once. If you modify it more than once, it also means that the results produced by the data engine are unreliable.

How to ensure that the state has an accurate once (Exactly-once guarantee) fault tolerance guarantee?

How to generate a globally consistent snapshot (Global consistent snapshot) for multiple operators with local states in a decentralized scenario?

More importantly, how to generate a snapshot without interrupting the operation? 1.1 accurate one-time fault tolerance method for simple scenarios

Or in terms of the number of user appearances, if the number of times a user appears is not accurate, not exactly once, then the result can not be used as a reference. Before considering the accurate fault tolerance guarantee, we first consider the simplest use scenario, such as unlimited data entry, followed by a single Process operation, and each calculation will accumulate a state. In this case, if you want to ensure that Process produces accurate state fault tolerance, after each data is processed, a snapshot is taken after changing the state, and the snapshot is included in the queue and compared with the corresponding state. By taking a consistent snapshot, you can ensure an accurate one.

1.2 distributed state fault tolerance

As a distributed processing engine, Flink performs multiple local state operations in a distributed scenario, and only produces a globally consistent snapshot. If you need to generate a globally consistent snapshot without interrupting the operation value, it involves distributed state fault tolerance.

Global consistent snapshot

With regard to Global consistent snapshot, when Operator performs operations on each node in a distributed environment, the first way to generate Global consistent snapshot is to deal with the snapshot points of each piece of data is continuous, this operation flows through all the operation values, and after changing all the operation values, you can see the state of each operation value and the position of the operation, which can be called consistent snapshot. Of course, Global consistent snapshot is also an extension of the simple scenario.

Fault tolerant recovery

First of all, let's take a look at Checkpoint. It is mentioned above that the state backend of each Operator operation value local to maintain state, that is, every time checkpoints are generated, they are passed into the shared DFS. When any Process is dead, you can directly restore the state of all the calculated values from the three complete Checkpoint and reset them to the appropriate location. The existence of Checkpoint enables the whole Process to implement Exactly-once in a decentralized environment.

1.3 distributed Snapshot (Distributed Snapshots) method

About how Flink can continuously generate Global consistent snapshot without interrupting the operation, its way is based on the extension of simple lamport algorithm mechanism. A known point Checkpoint barrier, Flink will always install Checkpoint barrier,Checkpoint barrier and N-1 in a Datastream, and so on. Checkpoint barrier N represents that all data in this range is Checkpoint barrier N.

For example: suppose you need to generate Checkpoint barrier N now, but in fact, in Flink, it is job manager that triggers the Checkpoint,Checkpoint and starts to generate Checkpoint barrier from the data source. When job starts to do Checkpoint barrier N, it can be understood that Checkpoint barrier N needs to gradually fill out the table in the lower left corner.

As shown in the figure, when some events are marked red and Checkpoint barrier N is also red, it means that Checkpoint barrier N is responsible for these data or events. The data or events in the white part of the Checkpoint barrier N do not belong to the Checkpoint barrier N.

On the basis of the above, when the data source receives the Checkpoint barrier N, it will first save its own state. Take reading Kafka data as an example, the status of the data source is its current location in the Kafka partition, and this state will also be written to the table mentioned above. Downstream Operator 1 will start to calculate the data belonging to Checkpoint barrier N. When Checkpoint barrier N flows to Operator 1 with these data, Operator 1 will also reflect all the data belonging to Checkpoint barrier N in the state, and will take a snapshot of Checkpoint directly when it receives Checkpoint barrier N.

When the snapshot completes and continues to swim down, Operator 2 also receives all the data, then searches for Checkpoint barrier N data and reflects it directly to the state, which is also written directly to Checkpoint N when the state is received by Checkpoint barrier N. At this point in the process, you can see that Checkpoint barrier N has completed a complete table called Distributed Snapshots, or distributed snapshot. Distributed snapshots can be used for state fault tolerance, and any node can be restored in the previous Checkpoint when it dies. Continue the above Process, when multiple Checkpoint are carried out at the same time, the Checkpoint barrier N has been flowed to job manager 2 Checkpoint barrier N job manager can trigger other Checkpoint, such as Checkpoint N + 1 Leckpoint N + 2 and so on. Using this mechanism, Checkpoint can be generated continuously without blocking the operation.

two。 Condition maintenance

State maintenance uses a piece of code to maintain the state value locally, and when the state value is very large, it needs to be supported by the local state back end.

As shown in the figure, in the Flink program, you can use getRuntimeContext (). GetState (desc); this set of API to register the status. Flink has a variety of status backends. After registering the status with API, the status is read through the status backend. Flink has two different state values, as well as two different state backends:

The JVM Heap status backend is suitable for a small number of states. When the number of states is small, the JVM Heap state backend can be used. The JVM Heap status backend will use Java object read / writes to read or write every time the operation value needs to be read, which does not incur a high cost, but when the Checkpoint needs to put the local state of each operation value into Distributed Snapshots, it needs to be serialized.

RocksDB state backend, which is a kind of out of core state backend. When the back end of the local state of Runtime asks the user to read the state, it will pass through the disk, which is equivalent to maintaining the state on disk. The corresponding cost may be that each time the state is read, it needs to go through the process of serialization and deserialization. When a snapshot is needed, the application can only be serialized, and the serialized data is directly transferred to the central shared DFS.

Flink currently supports the above two status backends, one is pure memory status backend, and the other is status backend with resource disk. When maintaining status, you can select the corresponding status backend according to the number of states.

3.Event-Time

3.1 different types of time

Before the advent of Flink and other advanced streaming engines, big data's processing engine only supported Processing-time processing. Suppose you define a window for operation windows, and the windows operation is set to settle on an hourly basis. When calculating with Processing-time, you can find that the data engine settles the data received between 3: 00 and 4: 00. In fact, when making a report or analyzing the results, you want to know the output results of the actual data generated between 3: 00 and 4: 00 in the real world. To understand the output results of the actual data, you must use Event-Time.

As shown in the figure, Event-Time is equivalent to an event. It is generated at the source of the data with a timestamp, and then it needs to be operated with a timestamp. As shown in the graph, the initial queue receives data and divides the data into batches every hour, which is what Event-Time Process is doing.

3.2 Event-Time treatment

Event-Time uses the actual timestamp generated by the event to do Re-bucketing, putting the data from 3: 00 to 4: 00 on the Bucket from 3: 00 to 4: 00, and then Bucket to produce the result. So the concepts of Event-Time and Processing-time are compared in this way.

The importance of Event-Time is to record the time when the engine outputs the results of the operation. To put it simply, the streaming engine is running and collecting data for 24 hours continuously. Suppose there is a windows Operator in the Pipeline doing operations, which can produce results every hour, and when to output the operational value of windows. This point in time is the essence of Event-Time processing, which is used to indicate that the received data has been received.

3.3 Watermarks

Flink actually uses watermarks to implement the function of Event-Time. Watermarks is also a special event in Flink, and its essence is that when an operation value receives a watermarks with a timestamp "T", it means that it will not receive new data. The advantage of using watermarks is that you can accurately estimate the deadline for receiving data. For example, suppose the time difference between the expected time to receive the data and the time to output is 5 minutes, then all windows Operator in Flink search for data from 3: 00 to 4: 00, but because of the delay, they need to wait 5 minutes until the 4:05 data is collected, then we can determine that the data collection at 4: 00 is complete, and then the data results from 3: 00 to 4: 00 will be output. The result of this time period corresponds to the part of watermarks.

4. State saving and migration

Streaming applications are running all the time, and there are several important considerations in operation and maintenance:

How to change the application logic / fix the bug, etc., how to migrate the state of the previous execution to the new execution?

How to redefine the degree of parallelization of the operation?

How do I upgrade the version number of an operation cluster?

Checkpoint perfectly meets the above requirements, but there is another term Savepoint in Flink, which is called a Savepoint when a Checkpoint is manually generated. The difference between Savepoint and Checkpoint is that checkpoints are Flink for a stateful application that uses distributed snapshots to generate Checkpoint continuously and periodically, while Savepoint is a manually generated Checkpoint,Savepoint that records the status of all operands in a streaming application.

As shown in figure, Savepoint An and Savepoint B, whether it is changing the underlying code logic, fixing bug or upgrading the Flink version, redefining the parallelization of applications and computing, the first thing you need to do is to generate Savepoint.

The principle of Savepoint is that the Checkpoint barrier is manually inserted into all Pipeline to produce distributed snapshots, which are called Savepoint. Savepoint can be saved in any location, and when changes are completed, they can be recovered and executed directly from Savepoint.

It should be noted that the recovery execution from Savepoint lasts for a long time during the process of changing the application, for example, Kafka is continuously collecting data, and when recovering from Savepoint, Savepoint saves the time when Checkpoint was generated and the corresponding location of Kafka, so it needs to recover to the latest data. Regardless of any operation, Event-Time ensures that the results are exactly the same.

Assuming that the recovered recalculation uses Process Event-Time and sets the windows window to 1 hour, the recalculation can include all the results of the operation into a single windows in 10 minutes. If you use Event-Time, it is similar to doing Bucketing. In the case of Bucketing, no matter how many recalculations are, the time of the final recalculation and the results produced by windows must be completely consistent.

IV. Summary

First of all, this paper starts with the definition, architecture and basic principles of Apache Flink, discriminates the basic concepts related to × × computing, and then briefly reviews the historical evolution of big data's processing methods and the principle of stateful streaming data processing. Finally, from the challenges faced by stateful streaming processing, this paper analyzes the natural advantages of Apache Flink as one of the best stream computing engines in the industry. Hope to help you clarify the basic concepts involved in the × × × processing engine, and make it easier to use Flink.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.