In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces the principle and usage of Flink. In daily operation, I believe many people have doubts about the principle and usage of Flink. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about the principle and usage of Flink! Next, please follow the editor to study!
Streaming framework, while supporting low latency, high throughput, Exactly-once only Apache Flink.
1. What is Apache Flink 1.1What is Apache Flink?
Real-time data processing is becoming more and more important, and streaming data processing has higher processing efficiency and cost control ability. Flink means fast and sensitive in German, and it is used to reflect the characteristics of streaming data processor, such as high speed and flexibility. Apache supports low latency, high throughput and Exactly-once in the streaming framework at the same time, and provides the computing ability to deal with batch data based on streaming computing engine, which really realizes the unity of batch flow. At the same time, with the open source of Ali Blind, Flink's support for batch computing is greatly enhanced.
1.2 Evolution of data architecture
The traditional relational data storage architecture has gradually evolved into a distributed processing and storage architecture.
1.2.1 traditional data infrastructure
Storage is mainly based on centralized relational database, and most of them divide the architecture into computing layer and storage layer. The micro-service architecture splits the system data sources and solves the problem of business system expansion, but the Mandarin of business data is scattered in different systems, so it is difficult to centralize the data management. applications such as data analysis or data mining within an enterprise need to periodically synchronize data from the database to the data warehouse by extracting data from different databases. Then the data is extracted, transformed and loaded (ETL) in the data warehouse, so as to build different data sets and applications, which are provided to the business system.
1.2.2 big data data architecture
Figure-big data real-time processing architecture
The Lamada architecture supports the processing of different types of data, including Batch Layer supporting batch computing and Speed Layer supporting real-time computing, by integrating batch computing and streaming computing in one platform, but this architecture still has the problems of platform complexity and high operation and maintenance costs because of too many frameworks.
1.2.3 stateful flow computing architecture
Figure-stateful flow Computing Architecture
The enterprise maintains the state of all computing processes based on real-time streaming data. The so-called state is the intermediate result produced in the computing process. Every time the new data enters the streaming system, it is calculated on the basis of the intermediate state result, and finally produces the correct intermediate result. The biggest advantage of stateful computing is that there is no need to take the original data out of the external system to avoid full calculation. Compared with batch computing, real-time computing can get the results in a very short time.
1.2.4 Why Flink?
It can be seen that stateful flow computing will gradually become the architectural pattern of enterprises as a data platform. Flink implements high throughput, low latency, high performance and real-time streaming computing framework by implementing Google DataFlow streaming computing model, and supports highly fault-tolerant state management.
Apache Flink also supports the following features:
At the same time, it supports high throughput, low latency, high performance, and currently the only support. Storm does not support high throughput requirements.
Support the concept of event time (Event time), use the time generated by events, so that even if out of order, the flow system can calculate the correct results, maintain the order of the original events, and avoid the influence of network transmission or hardware system as much as possible.
Stateful streaming computing is supported, and the intermediate results of operators are saved in memory or file system, which greatly improves system performance and reduces resource consumption.
Support highly flexible window (Window) mechanism, through the window way convection data for a certain range of aggregate calculation
Based on the fault tolerance of lightweight distributed snapshot (Snapshot), Checkpoints based on distributed snapshot technology persists the state information during execution, supports automatic recovery when tasks are abnormal, and ensures the consistency of data in the process of processing.
Independent memory management is implemented based on JVM. Serialization / deserialization reduces data storage size and reduces the performance impact of GC.
SavePoint (Save Points) is supported to save snapshots of task execution on storage media for better management and operation and maintenance of streaming applications.
1.3 Flink application scenarios
Real-time intelligent recommendation, through Flink stream computing to build a more real-time intelligent recommendation system, real-time calculation of user behavior indicators, real-time update of the model, real-time prediction of user indicators, and push the predicted information to Web/ App.
Complex event handling with the help of Flink CEP (complex event handling)
Real-time fraud detection
Real-time data warehouse and ETL
Stream data analysis
Real-time report analysis, Tmall double 11 big screen
1.4 Flink basic architecture
Figure-Flink hierarchical Architecture
two。 Environment preparation 3. Flink programming model 3.1 dataset type
Bounded data sets, with time boundaries, batch processing; unbounded data sets, no boundaries, continuous generation of new data, streaming data processing. The two are relative concepts, mainly according to the range of time. It can be considered that the unbounded data set in a period of time is actually a bounded data set, and bounded data can also be converted into unbounded data through some methods. Bounded data and unbounded data can actually be converted into each other, and different data types can be processed uniformly. Apache Spark and Flink support streaming computing and batch computing at the same time.
3.2 Flink programming interface
Core data processing interface, DataSet API that supports batch computing, DataStream API that supports stream computing
Figure-Flink interface layering and abstraction
3.3The Flink program structure
Set the Flink execution environment, create and load the dataset, specify the conversion operation logic to the dataset, specify the output location of the calculation results, and call the execute method to trigger the program execution.
Figure-sample Flink program WordCount
3.4 Flink data type 3.4.1 data type support
The description information of data type is defined by TypeInformation, and the most commonly used ones are BasicTypeInfo, TupleTypeInfo, CaseClassTypeInfo and PojoTypeInfo.
BasicTypeInfo: supports any Java native data type, array BasicTypeInfo
Java Tuples type: fixed length fixed type, does not support null value storage
POJO types: definition of complex data structures
Flink Value types: serialization and deserialization
Special data type: Types Hmt
3.4.2 TypeInformation information acquisition
In general, Flink can normally determine the data type and select the appropriate serializers and comparators, but it cannot be obtained in some cases, such as JVM generic erasure.
The reflection mechanism reconstructs the type information as much as possible, the type prompt (Ctype Himts), and the TypeHint specifies the output parameter type.
Custom TypeInformation
4. Introduction and use of 4.1DataStream programming model by DataStream API
Based on the DataFlow model proposed by Google, a computing engine supporting native data flow processing is implemented. API is mainly divided into three parts:
DataSourc module, data access function, mainly to access all kinds of external data to the Flink system, and convert the access data into the corresponding DataStream data set
Transformation module, which defines various conversion operations for DataStream datasets, such as map, reduce, windows, etc.
DataSink module, which writes the result data to an external storage medium, such as a file or Kafka middleware
4.1.1 DataSources data entry
Built-in data sources, including files, Socket network ports and collection type data; third-party data sources that define the logic of data interaction between Flink and external systems, including data read-write interfaces, and Flink defines rich third-party data source connectors (Connector), such as Kafka Connector, ES Connector and custom third-party data source Connector.
Built-in file data source
Built-in Socket data source
The built-in collection data source, the collection class Collection, distributes data from the local collection to nodes that are executed in parallel at the remote end
External data source connectors, such as Kafka
External custom data source connectors to implement SourceFunction, etc.
4.1.2 DataStream conversion operation
That is, the process of generating a new DataStream from one or more DataStream is called Transformation. In the conversion process, each operation type is defined as a different Operator,Flink that can make multiple Transformation into a DataFlow topology. The conversion operations of DataStream can be divided into three types: Single-DataStream, Multi-DataStream, and physical partition.
Single-DataStream
Map (DataStream- > DataStream), FlatMap (DataStream- > DataStream), Filter (DataStream- > DataStream), KeyBy (DataStream- > KeyedStream), Reduce (KeyedStream- > DataStream), Aggregations (KeyedStream- > DataStream)
Multi-DataStream
Union (DataStream- > DataStream), Connect/CoMap/CoFlatMap (DataStream- > DataStream), Split (DataStream- > SplitStream), Select (SplitStream- > DataStream), Iterate (DataStream- > IterativeStream- > DataStream)
Physical partition operation
Redistribute data to task cases of different nodes according to the specified partitioning policy, random partitioning, balanced partitioning, proportional partitioning, etc.
4.1.3 DataSinks data output
Basic data output
File output, client output, Socket network port,
Third-party data output
Such as Kafka, Cassandra, Kinesis, ES, HDFS, NIFI and so on. DataSink class operators deal specifically with data output, and all data output can be defined based on the implementation SinkFunction, such as FlinkKafkaProducer.
4.2 time probability and WaterMark
Three concepts of time:
Event generation time (Event time), time access time (Ingestion Time) and event processing time (Processing Time)
At this point, the study of "the principle and usage of Flink" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.