Principle and practice of Flink, a rookie of streaming processing 07/16 Update SLTechnology News&Howtos

Principle and practice of Flink, a rookie of streaming processing

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

With the wide application of big data technology in various industries, there are more and more requirements for real-time processing of massive data. at the same time, the business logic of data processing is becoming more and more complex. traditional batch processing and early streaming frameworks are becoming more and more difficult to meet the increasingly stringent requirements of business in terms of delay, throughput, fault tolerance and ease of use.

In this situation, the new streaming framework Flink greatly improves the problems of the previous streaming framework by creatively applying modern massively parallel processing technology to streaming processing. On the evening of March 13th, Mr. Kuang Donglin, a senior technical architect of big data, was invited to share with us the innovation of Flink in many aspects and its unique capabilities in the online live broadcast.

Let's look at it mainly from the following parts:

one。 Background of streaming:

The traditional big data processing method is generally batch, that is to say, the data collected today will be calculated tomorrow for everyone to use, but in many cases, the timeliness of the data is critical to the success or failure of the business.

1. The background of streaming-necessity

For example, in the case of * * detection, we want to see the result: once * is available, we can respond in a timely manner. In this case, if you follow the traditional batch processing method, it is impossible to detect the result in real time. In addition, for example, in voice computing, we need to monitor the running status of each virtual machine in real time and the early warning when errors occur. In this case, it also requires us to monitor the data in real time and take real-time action against all kinds of alarms generated by the data. As a result, there is no doubt about the necessity of streaming.

two。 Background of streaming-Infrastructure

Let's take a look at the basic framework of streaming.

It is mainly divided into six parts: event producer, collection, queuing system (in which the main purpose of kafka is to temporarily cache it when data is at its peak to prevent data loss. ), data transformation (that is, streaming processes), long-term storage, statements / actions.

3. Background of flow processing-Evaluation Index

At present, there are many streaming frameworks in the industry, among so many frameworks, how do we evaluate the performance of this streaming framework? What are the indicators? Generally speaking, we will examine the capability of the streaming framework from the following aspects.

Among them, "the guarantee of data transmission" refers to whether the data can be processed and arrived at the destination. It has three possibilities: guarantee at least once, at most once, and accurate once. In most cases, "guarantee at least once" can meet business requirements, except for specific scenarios that require high data accuracy.

"processing delay", in most cases, the lower the delay of streaming processing, the better, but in many cases, the lower the delay, the higher the corresponding cost. "throughput" and "processing delay" is a contradiction. If the throughput is high, the corresponding latency will be low, the throughput will be low, and the corresponding latency will be high.

"State management", in the process of real-time transformation, we should have interaction with the outside, such as * detection, in order to protect the environment and data security.

"Fault tolerance" and "fault tolerance load" require that when streaming is in normal progress, even if some machines are down, the system can still operate normally, and the whole streaming framework will not be affected.

"flow control", that is, flow control, in the process of data transmission, there may be a sudden increase in data, in order to ensure that the system will not collapse due to overload, it is necessary to control the data density.

"programming complexity", relatively speaking, the more advanced the API design, the lower the programming burden.

4. Background of streaming processing-selection

After understanding the assessment criteria of the streaming framework, why did we choose Flink? What are the advantages of Flink?

"ensuring accurate primary semantics with state computation" is very necessary for some specific calculations.

Generally, in the streaming framework, there are two ways to deal with data, one is to process the data according to the processing time, the other is to process the data according to the event time, and the "event time semantic support" method is more complex.

Flink's API is very advanced and more efficient in the logical business of dealing with streaming data.

Second, the principle of Flink:

After understanding the background of Flink, let's take a look at how it works.

1. Overview

The whole component of Flink is similar to Spark, its core is a distributed streaming framework, on top of the core, there are two sets of API, one for batch-DataSet API and one for streaming-DataStream API.

As we can see from the figure, there are more advanced libraries under the two sets of API, and its entire processing deployment can support local, cluster, and cloud.

two。 Infrastructure construction

The overall architecture of Flink is very similar to that of Spark, with three main parts.

One is the client that submits the task-Flink Program; and the job manager-JobManager, which is mainly responsible for task scheduling and status detection, as well as preliminary management when the whole cluster fails; finally, the task manager-TaskManager, which implements the execution of business logic, is responsible for outputting the corresponding results to the outside or for external interaction after running the received tasks.

JobManager is not responsible for task execution throughout the process.

3. Programming model

Let's take a look at the specific programming model structure of Flink.

The first statement is to establish the environment of the entire Flink runtime, similar to establishing a context in Spark. Its main business logic is determined by three parts: the specified data source, the specified transformation logic and the specified output.

The process of specifying the data source is nv.addSource, which specifies where our data comes from. In this design, it reads the data from kafka. In this case, the transformation of the data stream is relatively simple, but each row of data is parsed, and after parsing, another data stream is obtained, which constitutes the DataStreamevents data stream.

On this data stream, we do a grouping: keyBy ("id"), timeWindow (Time.seconds (10)), apply (new MyWindowAggregationFunction ()). After we have processed the whole data, we get a statistical data stream and specify the output.

This is roughly the business logic of the entire data flow, and below the arrow is the data flow graph.

In the example, only part of the API is shown. In addition to the above, there are many operations. Let's take a look at the following picture.

"map" is to do some mapping, for example, we merge two strings into one string and split a string into two or three strings.

"flatMap" is similar to splitting a record into two, three, or even four records.

"Filter" is similar to filtering.

"keyBy" is equivalent to group by in SQL.

"reduce" is similar to reduce in MapReduce.

The "join" operation is somewhat similar to the join in our database.

"aggregate" is an aggregation operation, such as counting, summing, averaging, and so on.

The "connect" implementation connects two streams into one stream.

The "project" operation is similar to the snacks in SQL.

"repartition" is a repartition operation.

4. Executive mechanism

Once you know the programming model of Flink, how does Flink run this business logic? Here is its execution mechanism.

The above figure is a business execution diagram showing business logic. The execution mode of Flink is similar to that of pipeline. it draws lessons from some implementation principles of the database and realizes its own unique execution mode.

5. State and fault tolerance

The fault tolerance mechanism of Flink is very special. Let's take a look at it.

When Flink deals with the data stream, there are two kinds of data in the whole data stream, one is the data sent by its own business, and the other is the data inserted into the data stream by Flink itself. The inserted record is called barrier, which is the fence, and we can think of it as a mark of progress, marking the state of the entire data processing, which is emitted from the source. We can see from the figure that no matter what stream it is, it produces a checkpoint barrier.

When the operator receives the fence, it will store the state of the fence, and then send a specific record to the second operator, and it will put its state into the Master. This is done step by step. In the process, if one of the steps fails, Flink will repeat the previous steps and run it again, so there will be no data loss or errors.

3. The practice of Flink:

1. Example

Let's take a look at a concrete example.

The first step is to initialize the runtime environment of the framework; the second step is to specify the data source of the data flow, which is specified in the example FlinkKafkaConsumer010 (.) Data; the third step is to implement the business transformation logic of the data stream. Here, a record is divided into multiple records through flatmap, filtered by filter, and then grouped according to the domain name, specifying the window length, and finally specifying the statistical method, which is counted; the fourth step is to specify the output of the statistical data flow; the last step is to submit the data transformation logic to the framework and run after compilation.

two。 Monitor and control

After starting this program, we can see the monitoring page of Flink. Here are some monitoring information.

We can see that in the startup Flink cluster, there are 80 Task Managers,80 nests and 1 idle nest. After clicking in the red box, this is the image below.

There are many monitoring indicators.

four。 Summary and prospect:

Finally, let's make a summary. The above is just a brief introduction to Flink. We can inquire about the memory management, deployment, internal execution mechanism and other related details of Flink through the following website.

Apache Flink is the official website about Flink open source.

The Flink-Forward website mainly introduces the experiences of major companies in the process of using Flink, as well as some relevant contents of Flink's own development proposals.

DataArtisans is a commercial company behind Flink, from which Flink has developed. The blog above contains a lot of introduction to Flinkd, as well as some in-depth articles.

Athenax is mainly about Flink's forward-looking website.

The above four parts are the main content of this online live broadcast teacher Kuang Donglin. What are the questions in the question session? Let's take a look.

1. Could you tell me the comparison between Flink and the latest version of Spark?

Teacher Kuang: spark streaming and flink are competitive, and both frameworks are widely used in flow processing. The biggest advantage of Flink lies in ensuring low latency in high throughput, state management for complex flows with states, and support for very flexible windows.

two。 Does the new version of spark adopt timeline db technology?

Mr. Kuang: no, timeline db is not the same as spark in implementation. Spark streaming is a typical micro-batch flow processing framework, and most of the others are based on pipeline implementation architecture.

This online live broadcast, I believe you have a better understanding of Flink streaming, here we are also very grateful to teacher Kuang Donglin for sharing. For those of you who want more details, you can follow the service number: FMI Pegasus, click the menu bar Pegasus Live, and you can learn.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.