The principle and usage of Flink 04/16 Update SLTechnology News&Howtos

The principle and usage of Flink

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces the principle and usage of Flink. In daily operation, I believe many people have doubts about the principle and usage of Flink. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about the principle and usage of Flink! Next, please follow the editor to study!

Streaming framework, while supporting low latency, high throughput, Exactly-once only Apache Flink.

1. What is Apache Flink 1.1What is Apache Flink?

Real-time data processing is becoming more and more important, and streaming data processing has higher processing efficiency and cost control ability. Flink means fast and sensitive in German, and it is used to reflect the characteristics of streaming data processor, such as high speed and flexibility. Apache supports low latency, high throughput and Exactly-once in the streaming framework at the same time, and provides the computing ability to deal with batch data based on streaming computing engine, which really realizes the unity of batch flow. At the same time, with the open source of Ali Blind, Flink's support for batch computing is greatly enhanced.

1.2 Evolution of data architecture

The traditional relational data storage architecture has gradually evolved into a distributed processing and storage architecture.

1.2.1 traditional data infrastructure

Storage is mainly based on centralized relational database, and most of them divide the architecture into computing layer and storage layer. The micro-service architecture splits the system data sources and solves the problem of business system expansion, but the Mandarin of business data is scattered in different systems, so it is difficult to centralize the data management. applications such as data analysis or data mining within an enterprise need to periodically synchronize data from the database to the data warehouse by extracting data from different databases. Then the data is extracted, transformed and loaded (ETL) in the data warehouse, so as to build different data sets and applications, which are provided to the business system.

1.2.2 big data data architecture

Figure-big data real-time processing architecture

The Lamada architecture supports the processing of different types of data, including Batch Layer supporting batch computing and Speed Layer supporting real-time computing, by integrating batch computing and streaming computing in one platform, but this architecture still has the problems of platform complexity and high operation and maintenance costs because of too many frameworks.

1.2.3 stateful flow computing architecture

Figure-stateful flow Computing Architecture

The enterprise maintains the state of all computing processes based on real-time streaming data. The so-called state is the intermediate result produced in the computing process. Every time the new data enters the streaming system, it is calculated on the basis of the intermediate state result, and finally produces the correct intermediate result. The biggest advantage of stateful computing is that there is no need to take the original data out of the external system to avoid full calculation. Compared with batch computing, real-time computing can get the results in a very short time.

1.2.4 Why Flink?

It can be seen that stateful flow computing will gradually become the architectural pattern of enterprises as a data platform. Flink implements high throughput, low latency, high performance and real-time streaming computing framework by implementing Google DataFlow streaming computing model, and supports highly fault-tolerant state management.

Apache Flink also supports the following features:

At the same time, it supports high throughput, low latency, high performance, and currently the only support. Storm does not support high throughput requirements.

Support the concept of event time (Event time), use the time generated by events, so that even if out of order, the flow system can calculate the correct results, maintain the order of the original events, and avoid the influence of network transmission or hardware system as much as possible.

Stateful streaming computing is supported, and the intermediate results of operators are saved in memory or file system, which greatly improves system performance and reduces resource consumption.

Support highly flexible window (Window) mechanism, through the window way convection data for a certain range of aggregate calculation

Based on the fault tolerance of lightweight distributed snapshot (Snapshot), Checkpoints based on distributed snapshot technology persists the state information during execution, supports automatic recovery when tasks are abnormal, and ensures the consistency of data in the process of processing.

Independent memory management is implemented based on JVM. Serialization / deserialization reduces data storage size and reduces the performance impact of GC.

SavePoint (Save Points) is supported to save snapshots of task execution on storage media for better management and operation and maintenance of streaming applications.

1.3 Flink application scenarios

Real-time intelligent recommendation, through Flink stream computing to build a more real-time intelligent recommendation system, real-time calculation of user behavior indicators, real-time update of the model, real-time prediction of user indicators, and push the predicted information to Web/ App.

Complex event handling with the help of Flink CEP (complex event handling)

Real-time fraud detection

Real-time data warehouse and ETL

Stream data analysis

Real-time report analysis, Tmall double 11 big screen

1.4 Flink basic architecture

Figure-Flink hierarchical Architecture

two。 Environment preparation 3. Flink programming model 3.1 dataset type

Bounded data sets, with time boundaries, batch processing; unbounded data sets, no boundaries, continuous generation of new data, streaming data processing. The two are relative concepts, mainly according to the range of time. It can be considered that the unbounded data set in a period of time is actually a bounded data set, and bounded data can also be converted into unbounded data through some methods. Bounded data and unbounded data can actually be converted into each other, and different data types can be processed uniformly. Apache Spark and Flink support streaming computing and batch computing at the same time.

3.2 Flink programming interface

Core data processing interface, DataSet API that supports batch computing, DataStream API that supports stream computing

Figure-Flink interface layering and abstraction

3.3The Flink program structure

Set the Flink execution environment, create and load the dataset, specify the conversion operation logic to the dataset, specify the output location of the calculation results, and call the execute method to trigger the program execution.

Figure-sample Flink program WordCount

3.4 Flink data type 3.4.1 data type support

The description information of data type is defined by TypeInformation, and the most commonly used ones are BasicTypeInfo, TupleTypeInfo, CaseClassTypeInfo and PojoTypeInfo.

BasicTypeInfo: supports any Java native data type, array BasicTypeInfo

Java Tuples type: fixed length fixed type, does not support null value storage

POJO types: definition of complex data structures

Flink Value types: serialization and deserialization

Special data type: Types Hmt

3.4.2 TypeInformation information acquisition

In general, Flink can normally determine the data type and select the appropriate serializers and comparators, but it cannot be obtained in some cases, such as JVM generic erasure.

The reflection mechanism reconstructs the type information as much as possible, the type prompt (Ctype Himts), and the TypeHint specifies the output parameter type.

Custom TypeInformation

4. Introduction and use of 4.1DataStream programming model by DataStream API

Based on the DataFlow model proposed by Google, a computing engine supporting native data flow processing is implemented. API is mainly divided into three parts:

DataSourc module, data access function, mainly to access all kinds of external data to the Flink system, and convert the access data into the corresponding DataStream data set

Transformation module, which defines various conversion operations for DataStream datasets, such as map, reduce, windows, etc.

DataSink module, which writes the result data to an external storage medium, such as a file or Kafka middleware

4.1.1 DataSources data entry

Built-in data sources, including files, Socket network ports and collection type data; third-party data sources that define the logic of data interaction between Flink and external systems, including data read-write interfaces, and Flink defines rich third-party data source connectors (Connector), such as Kafka Connector, ES Connector and custom third-party data source Connector.

Built-in file data source

Built-in Socket data source

The built-in collection data source, the collection class Collection, distributes data from the local collection to nodes that are executed in parallel at the remote end

External data source connectors, such as Kafka

External custom data source connectors to implement SourceFunction, etc.

4.1.2 DataStream conversion operation

That is, the process of generating a new DataStream from one or more DataStream is called Transformation. In the conversion process, each operation type is defined as a different Operator,Flink that can make multiple Transformation into a DataFlow topology. The conversion operations of DataStream can be divided into three types: Single-DataStream, Multi-DataStream, and physical partition.

Single-DataStream

Map (DataStream- > DataStream), FlatMap (DataStream- > DataStream), Filter (DataStream- > DataStream), KeyBy (DataStream- > KeyedStream), Reduce (KeyedStream- > DataStream), Aggregations (KeyedStream- > DataStream)

Multi-DataStream

Union (DataStream- > DataStream), Connect/CoMap/CoFlatMap (DataStream- > DataStream), Split (DataStream- > SplitStream), Select (SplitStream- > DataStream), Iterate (DataStream- > IterativeStream- > DataStream)

Physical partition operation

Redistribute data to task cases of different nodes according to the specified partitioning policy, random partitioning, balanced partitioning, proportional partitioning, etc.

4.1.3 DataSinks data output

Basic data output

File output, client output, Socket network port,

Third-party data output

Such as Kafka, Cassandra, Kinesis, ES, HDFS, NIFI and so on. DataSink class operators deal specifically with data output, and all data output can be defined based on the implementation SinkFunction, such as FlinkKafkaProducer.

4.2 time probability and WaterMark

Three concepts of time:

Event generation time (Event time), time access time (Ingestion Time) and event processing time (Processing Time)

At this point, the study of "the principle and usage of Flink" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.