What are the basic knowledge points of Flink 07/06 Update SLTechnology News&Howtos

What are the basic knowledge points of Flink

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what are the basic knowledge points of Flink". In the daily operation, I believe that many people have doubts about the basic knowledge points of Flink. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the questions of "what are the basic knowledge points of Flink?" Next, please follow the editor to study!

Basic Flink features of Flink

Streaming computing is the pain point of big data's computing. The first generation real-time computing engine Storm has weak support for Exactly Once semantics and windows, uses limited scenarios and cannot support high throughput computing. Spark Streaming uses "micro-batch" to simulate flow computing, which has performance bottlenecks in scenarios with small window settings. Spark itself is also trying continuous execution mode (Continuous Processing), but the progress is slow.

Flink is a real-time computing engine with low latency and high throughput, which uses distributed consistent snapshots to achieve checkpoint fault tolerance and better state management. Flink can process hundreds of millions of messages or events per second under millisecond delay, while providing a consistent semantics of Exactly-once, ensuring the correctness of the data, so that Flink can provide financial-level data processing capabilities. Summarize its advanced features including CSTW (CheckPoint,Statue,Time,windows)

Comparative Design ideas of Flink and Spark

The technical concept of Spark is to simulate the flow based on batches, the latency of microbatch processing is high (it can not be optimized to the order of magnitude less than seconds), and it can not support event_time-based time window to do aggregation logic. In contrast to spark, Flink simulates batch computing based on stream computing, which is more in line with the way data is generated and has better technical scalability.

State management

Flow processing tasks need to count the data, such as Sum, Count, Min, Max, these values need to be stored, because to constantly update, these values or variables can be understood as a state, if the data source is reading Kafka, RocketMQ, you may want to record where to read, and record Offset, these Offset variables are the states to be calculated.

Flink provides built-in state management that allows you to store these states inside the Flink without having to store it on an external system, which has the benefit of:

① reduces the dependence and deployment of computing engine on external systems, which makes operation and maintenance easier.

② brings a great improvement to performance: if you access Redis externally, HBase needs network and RPC resources, and if you access Flink internally, you can only access these variables through your own process.

At the same time, Flink will periodically persist these states to Checkpoint and store Checkpoint in a distributed persistence system, such as HDFS, so that when any failure occurs in the task of Flink, it will recover the state of the entire flow from the most recent Checkpoint, and then continue to run its flow processing, without any data impact on users.

A preliminary study on the Design Architecture of Flink

Flink is a hierarchical architecture system, and the components contained in each layer provide a specific abstraction to serve the upper components. The layered embodiment of Flink has four layers, namely Deploy layer, core layer and API layer / Libraries layer. The deployment layer mainly involves the deployment mode of Flink and the interaction mode with resource scheduling components, and the Core layer provides all the core implementations that support Flink computing. The API layer / Libraries layer provides the API interface of Flink and the application-specific computing framework based on API interface.

Deploy layer: this layer mainly involves the deployment model of Flink. Flink supports a variety of deployment models: local, Standalone/YARN, and GCE/EC2. The Standalone deployment model is similar to Spark.

Runtime layer: the Runtie layer provides all the core implementations that support Flink computing, such as supporting distributed Stream processing, Job Graph-to-Execution Graph mapping, scheduling and so on, providing basic services for the upper API layer.

API layer: API layer mainly implements stream processing for unbounded Stream and batch processing API for Batch, where stream processing corresponds to DataStream API and batch processing corresponds to DataSet API.

Libraries layer: this layer can also be called Flink application framework layer. According to the division of API layer, the real-time computing framework built on API layer to meet specific applications also corresponds to two types: stream-oriented and batch-oriented. Stream-oriented support: CEP (complex event processing), SQL-like operations (Table-based relational operations); batch-oriented support: FlinkML (machine learning library), Gelly (graph processing).

Flink on yarn

Flink supports incremental iterations and has the ability to self-optimize iterations, so tasks submitted on on yarn perform slightly better than Spark,Flink provides two ways to submit tasks on yarn: start a running Yarn session (detached mode) and run a Flink task on Yarn (client mode)

Split mode: the startup mode by command yarn-session.sh is essentially to start a flink cluster on the yarn cluster. Yarn assigns several container to the flink cluster in advance. Only one Flink session with X TaskManagers task can be seen on the yarn interface, and there is only one Flink interface, which can be accessed from the Application Master link of Yarn.

Client mode: start one task at a time by bin/flink run-m yarn-cluster, which essentially starts a cluster for each Flink task. Yarn starts JobManager (corresponding to Yarn's AM) and TaskManager when a task is released. If a task specifies n TaksManager (- yn n), it will start 1 Container, one of which is JobManager, and there are m Flink interfaces for publishing m applications. It is impossible for different tasks to achieve resource isolation in a Container (JVM).

Run. / yarn-session.sh-help under Flink's bin directory to help verify whether yarn is configured successfully. Use. / yarn-session.sh-Q to display all nodeManager node resources of yarn. To deploy Flink in On yarn mode, you only need to modify the configuration conf/flink-conf.yaml. For more parameters, please refer to the official website: general configuration: Configuration,HA configuration: High Availability (HA)

Start Flink Yarn Session in detached mode, prompt the yarn application to be successfully submitted to yarn and return id after submission, and use yarn application-kill application_id to stop the tasks submitted on yarn

Yarn-session.sh-n 3-jm 700-tm 700-s 8-nm FlinkOnYarnSession-d-st

You can directly submit your own word frequency statistics use case to verify whether the on yarn mode is configured successfully:

~ / bin/flink run-m yarn-cluster-yn 4-yjm 2048-ytm 2048 ~ / flink/examples/batch/WordCount.jar process analysis

Split mode: start the cluster first by ordering yarn-session.sh, then submit the job, and then apply for a piece of space from yarn, and the resources will remain unchanged forever. If the resources are full, the next job cannot be submitted, and the next job will not be submitted normally until one of the jobs in yarn completes and the resources are released. All jobs share Dispatcher and ResourceManager; resources; suitable for jobs with small scale and short execution time.

Client mode: through the command bin/flink run-m yarn-cluster to submit a job, each job submitted will separately apply for resources to yarn according to its own situation, until the job execution is completed. The failure of one job will not affect the normal submission and operation of the next job, so it is suitable for large-scale and long-running jobs.

DataStream

DataStream is a lower-level API of Flink, which is used for real-time data processing tasks. The programming model can be divided into three parts: DataSource, Transformation and Sink.

DataSource

The source is where the program reads the input data, and you can add the source to the program using StreamExecutionEnvironment.addSource (sourceFunction). Flink has many pre-implemented source functions, or you can customize the non-parallel source by implementing the SourceFunction method, or by implementing ParallelSourceFunction or extending RichParallelSourceFunction to customize the parallel source.

Several predefined stream data sources are accessible from StreamExecutionEnvironment:

File-based:

ReadTextFile (path) # reads the text file line by line (the file is in TextInputFormat format) and returns each line as a string.

ReadFile (fileInputFormat, path) # reads the file of the specified path in the specified file input format (fileInputFormat).

ReadFile (fileInputFormat, path, watchType, interval, pathFilter) # Internal invocation of the first two methods. Reads the file of the specified path according to the given file format (fileInputFormat). According to watchType, periodically listen for new data under the path (FileProcessingMode.PROCESS_CONTINUOUSLY), or process the data currently in the path and exit (FileProcessingMode.PROCESS_ONCE), using pathFilter, you can further exclude files that are being processed.

Read from Socket based on Socket:socketTextStream, elements can be separated by delimiters.

Based on the collection:

FromCollection (Seq) # creates a data flow with Java.util.Collection objects, and all elements in the collection must be of the same type

FromCollection (Iterator) # uses iterators to create data streams. Specifies the data type of the element returned by the iterator

FromElements (elements: _ *) # creates a data stream from a given sequence of objects. All objects must be of the same type

FromParallelCollection (SplittableIterator) # creates data streams from iterators in parallel. Specifies the data type of the element returned by the iterator

GenerateSequence (from, to) # generates a sequence of numbers at given intervals in parallel.

Custom: addSource appends a new source function. For example, to read from Apache Kafka, you can use addSource (new FlinkKafkaConsumer08 (...)). Please check the connector in detail.

Transformation

The Transformation operation converts one or more DataStream into a new DataStream, and multiple transformations are combined into a complex data flow topology. As shown in the following figure, DataStream will be operated, transformed, filtered, and clustered into other different streams by different Transformation operations, thus fulfilling the business requirements.

Map:DataStream-> DataStream, a data element generates a new data element. Double the elements of the input stream: dataStream.map {x = > x * 2}

FlatMap:DataStream-> DataStream, one data element generates multiple data elements (which can be 0). Divide the sentence into words:

DataStream.flatMap {str = > str.split ("")}

Filter:DataStream-> DataStream, each data element executes a Boolean function, saving only the data elements that the function returns to true. A filter that filters out zero values:

DataStream.filter {_! = 0}

KeyBy: DataStream-> KeyedStream, dividing the flow into disjoint partitions. All records with the same Keys are in the same partition. Specify the value of key:

DataStream.keyBy ("someKey") / / Key by field "someKey"

DataStream.keyBy (0) / / Key by the first element of a Tuple

The Reduce: KeyedStream-> DataStream,KeyedStream element scrolls through Reduce. Sends the current data element combined with the latest Reduce value as a new value. Create the sum of values for key: keyedStream.reduce {_ + _}

Aggregations: KeyedStream-> DataStream, applied to scrolling aggregation on KeyedStream.

Window:KeyedStream-> WindowedStream,Windows can be defined on a partitioned KeyedStream. Windows groups the data in each Keys according to certain characteristics, such as data that has arrived in the last 5 seconds. For more information, please refer to Windows or translation.

DataStream.keyBy (0) .window (TumblingEventTimeWindows.of (Time.seconds (5)

WindowAll: DataStream-> AllWindowedStream,Windows can also be defined on DataStream. In many cases, this is a non-parallel transformation. All records will be collected in one task of the windowAll operator.

DataStream.windowAll (TumblingEventTimeWindows.of (Time.seconds (5)

Window Apply: WindowedStream-> DataStream or AllWindowedStream-> DataStream to apply the function to the entire window. A sum of window data:

WindowedStream.apply {WindowFunction}

AllWindowedStream.apply {AllWindowFunction}

The Window Reduce:WindowedStream-> DataStream,Reduce function is applied to the window and returns the result value. WindowedStream.reduce {_ + _}

Aggregations on windows:WindowedStream-> DataStream, aggregate window content

Union: DataStream*-> DataStream, the merging of two or more data streams to create a new stream that contains all data elements from all streams. If you federate the data flow with itself, the data elements are obtained twice in the result stream.

DataStream.union (otherStream1, otherStream2,...)

Window Join: DataStream,DataStream-> DataStream,Join connects two streams, specifying Key and window.

DataStream.join (otherStream)

.where () .equalTo ()

.window (TumblingEventTimeWindows.of (Time.seconds (3)

.apply {...}

Window CoGroup: DataStream,DataStream-> DataStream,CoGroup connects two streams, specifying Key and window.

DataStream.coGroup (otherStream)

.where (0) .equalTo (1)

.window (TumblingEventTimeWindows.of (Time.seconds (3)

.apply {}

The difference between CoGroup and Join: CoGroup outputs unmatched data, while Join outputs only matching data.

Connect: DataStream,DataStream-> ConnectedStreams, connecting two data streams of their own type. Allows state sharing between two streams.

SomeStream: DataStream [Int] =.

OtherStream: DataStream [String] =.

Val connectedStreams = someStream.connect (otherStream)

Can be used for data flow association configuration flow

CoMap, CoFlatMap: ConnectedStreams-> DataStream, scope connects map and flatMap on data flow (connected data stream):

Split: DataStream-> SplitStream to split the data stream into two or more streams.

Select: SplitStream-> DataStream, select one or more streams from the SpliteStream.

Val even = split select "even"

Val odd = split select "odd"

Val all = split.select ("even", "odd")

Iterate: DataStream-> IterativeStream-> DataStream, redirects the output of an operator to some previous operator, and creates a feedback loop in the flow. This is particularly useful for defining algorithms that constantly update models. The following code starts with a stream and applies iterations continuously. Data elements greater than 0 will be sent back to feedback, and the rest will be forwarded downstream.

Project:DataStream-> DataStream, which acts on the transformation of tuples, selecting a subset of fields from the tuple.

DataStream in = / / [...]

DataStream out = in.project (2); Sink

Data Sink consumes DataStream and forwards to files, sockets, external systems or prints to pages. Flink comes with various built-in output formats encapsulated behind operator operations on DataStreams:

WriteAsText () / TextOutputFormat: writes files in string order. The string is obtained by calling the toString () method of each element.

WriteAsCsv (...) / CsvOutputFormat: writes tuples to the file as comma-delimited. Row and field delimiters are configurable. The value of each field comes from the object's toString () method.

Print () / printToErr (): prints the toString () value of each element on the standard output / standard error stream. You can define the output prefix, which helps to distinguish between different print calls. If the degree of parallelism is greater than 1, the output also contains the identifier of the task that generated the output.

WriteUsingOutputFormat () / FileOutputFormat: customize the method and base class of the file output. Custom object to byte conversion is supported.

WriteToSocket: writes the element to Socket and serializes it using SerializationSchema.

AddSink: calls a custom sink function. Please check the connector in detail.

The write* () method of DataStream is mainly used for debugging purposes. They do not participate in Flink checkpoint, which means that these functions usually have semantics at least once. The data refreshed to the target system depends on the implementation of the OutputFormat, and not all data sent to the OutputFormat is immediately displayed in the target system. In addition, in the event of failure, these records may be lost.

To deliver the stream reliably and accurately to the file system, use flink-connector-filesystem. Through .addSink (...) Method, you can achieve precise semantics once in checkpoint.

Time

The most important feature of streaming data processing is that data has time attribute. According to the location where time is generated, Flink divides time into three concepts: data generation time (Event_time), event access time (Ingestion_time) and event processing time (Processing_time). Users can choose event types as time attributes of streaming data according to their needs, which greatly enhances the flexibility and accuracy of data processing.

Event_time: the time when an independent event occurs on the device that produces it, which is usually embedded in the production data before it reaches Flink, so the time order depends on the place where the event occurs and has nothing to do with the events of the downstream data processing system. You need to specify the time attribute of the event in Flink or set the time extractor to extract the event time.

Processing_time: refers to the time of the host obtained during the operator calculation. After the user selects Processing_time, all time-related computing operators directly use the system time of their host. The performance of the program using Processing_time is relatively high and the latency is relatively low, because all its operations do not need to be compared and coordinated in any time.

Ingestion_time: refers to the time when the data is connected to the Flink system, depending on the system clock of the host where the Source Operator resides

In general, choosing event_time as the event timestamp is the closest to production, but in most cases processing_time is used because of data delay and disorder.

Window window Windows definition and classification

In streaming computing, the continuous flow of data into the computing engine requires a window to limit the scope of calculation, such as nearly 2 minutes of the monitoring scene or accurate calculation every 2 minutes, and the window defines the range. assist in bounded data processing

Flink's DataStream API abstracts windows into independent Operator and supports many window operators. Each window operator contains some attributes such as Window Assigner, Windows Function, trigger, eliminator, delay setting and so on. Window Assigner and Windows Function are necessary attributes.

Window Assigner is used to determine which window an element is assigned to; Trigger triggers determine when a window can be calculated or cleared, and each window has its own Trigger

After the Trigger is triggered by the Evictor driver, and before the window is processed, the Evictor (if there is an Evictor) is used to weed out unwanted elements in the window, which is equivalent to a filter.

Flink supports a variety of window types, which can be divided into time-driven Time Window (e.g. every 30 seconds) and data-driven Count Window (e.g. every 100th event). According to the scrolling mode of the window, it can be further divided into: tumbling window (Tumbling Window, no overlap), scrolling window (Sliding Window, overlapping) and session window (Session Window, activity gap). The following figure shows the difference:

Time Window groups data streams according to time, and the window mechanism and time type are completely decoupled, that is, when the time type needs to be changed (three kinds of time), there is no need to change the code related to window logic. Tumbling Time Window and Sliding Time Window are common in Time Window.

Count Window groups data streams according to the number of elements, including Tumbling Count Window and Sliding Count Window

Windows implementation

The components in the above figure are all in an operator (window operator), and the data flow continues to enter the operator, and each arriving element will be handed over to WindowAssigner,WindowAssigner to determine which window or windows the element is placed in. Window itself is an ID identifier, which may store some metadata, such as start and end times in TimeWindow, but does not store the elements in the window. The elements in the window are actually stored in Key/Value State, and key is Window,value, which is a collection of elements (or aggregate values). In order to ensure the fault tolerance of the window, this implementation relies on the State mechanism of Flink.

Each window has its own Trigger,Trigger with a timer on it to determine when a window can be counted or cleared, and Trigger will be called whenever an element is added to the window, or a previously registered timer expires. The return result of Trigger can be continue (doing nothing), fire (processing window data), purge (removing window and window data), or fire + purge. If the result of a Trigger call is only fire, the window will be evaluated and left as it is, that is, the data in the window will remain unchanged, waiting for the calculation to be performed again the next time Trigger fire. A window can be repeated many times until it is purge. Windows occupy memory all the time before purge.

When Trigger fire is over, the collection of elements in the window is handed over to Evictor (if specified). Evictor is mainly used to traverse the list of elements in the window and to determine how many elements that enter the window first need to be removed. The remaining elements are handed over to the user-specified function to calculate the window. If there is no Evictor, all the elements in the window will be given to the function for calculation.

The calculation function receives the elements of the window (which may have been filtered by Evictor), calculates the result value of the window, and sends it downstream. The result value of the window can be one or more. DataStream API can receive different types of calculation functions, including predefined sum (), min (), max (), ReduceFunction,FoldFunction, and WindowFunction. WindowFunction is the most general computing function, and other predefined functions are basically based on this function.

Flink optimizes window calculations of some aggregation classes, such as sum,min, because the calculation of aggregation classes does not need to save all the data in the window, just a result value. Each element that enters the window executes an aggregate function and modifies the result value. This can greatly reduce memory consumption and improve performance. However, if the user defines Evictor, the optimization of the aggregation window is not enabled, because Evictor needs to traverse all the elements in the window and must save all the elements in the window.

Windows Function

When using the window to calculate, Flink according to whether the dataset is KeyedStream type (whether the data is partitioned according to Key), if the upstream data is not grouped, call the window () method to specify Windows Assigner, the data will be calculated in parallel in different Task instances according to Key, and finally get the statistical results for each Key. If it is a Non-Keyed type, call the WindowsAll () method to specify Windows Assigner. All data will be routed in the window operator to get a Task calculation, and the global statistical results will be obtained.

After defining the window allocator, you need to specify the computing logic for each window, that is, Windows Function,Flink provides four types of WindowFunction, namely ReduceFunction, AggreateFunction, FoldFunction, and ProcessWindowFunction, in which FoldFunction will gradually cease to be used; the four types are divided into incremental aggregate operations (ReduceFunction, AggreateFunction, FoldFunction) and full aggregate operations (ProcessWindowFunction)

The incremental aggregate function has high computing performance and takes up less storage space, because it only needs to maintain the intermediate result state value of the window and does not need to cache the original data. The total aggregate function has relatively high cost and weak performance, because the operator needs to cache the access data of the window, and then summarize all the original data after the window is triggered. If the amount of access data is large or the window time is long, it is easy to reduce computing performance.

ReduceFunction is similar to AggreateFunction, but the output type and input type of the former are the same (for example, using a field aggregation of tuple). The latter provides three replication methods more flexibly. Add () defines the logic of adding data, getResult () defines the logic of merging accumulator according to the logic of Accumulator calculation results, and the merge () method defines the logic of merging accumulator.

ProcessWindowFunction can support more complex operators, which support calculation based on the results of all the data elements of the window, when the operator needs the metadata or state data of the window, or when the operator does not support the operation exchange law and association law (counting the median and mode of all elements), the Context object in this function is required. The Context class defines the metadata of Window and the state data of operable Window, including GlobalState and WindowState.

In most cases, it is necessary to combine incremental computing with full computing, because although incremental computing can improve window performance to a certain extent, it is not as flexible as ProcessWindowFunction. When the two are used together, you can get both incremental operators and window metadata (window start and end time, etc.). For example, in the scenario of calculating TOP N, the total number of clicks aggregated according to commodity ID is needed after calculating the data in a split window.

Watermark

Due to the influence of external factors such as network or system, event data can not be transmitted to the Flink system in time, resulting in data disorder, delay and other problems, so a mechanism is needed to control the process and progress of data processing. After the creation of the Windows based on event_time time, how to make sure that all the data elements in the Windows have arrived? if you are sure that all the data have arrived, you can calculate all the data in the window (summary, grouping). If the data does not arrive, you will continue to wait for the data in the window, but you cannot wait indefinitely. You need a mechanism to ensure that after a specific time. Window must be triggered to do the calculation, and watermark works, indicating that when watermark is reached, all the data before watermark has been reached (even if there is delayed data later). Watermark is a mechanism proposed to deal with EventTime window calculation, which is essentially a timestamp, which can be specified when reading Source or before the transformation operation, using a custom Watermark generator according to requirements.

Normally, the arrival time of streaming data is orderly, as shown in the following figure:

In general, there are out-of-order and late element of data. At this time, the watermark mechanism can indicate that all the data from the timestamp to the current watermark timestamp have been reached, there is no data earlier than it (watermark), and the calculation is triggered.

There are two ways to generate water mark in Flink: Periodic Watermarks (periodicity) and Punctuated Watermarks. The former assumes that all data can be reached after the current timestamp is subtracted from a fixed time, while the latter triggers the generation of water mark after a specific event is indicated.

Example to illustrate how Periodic Watermarks works: the current window is 10s, and imagine that the message has no delay under ideal circumstances, then eventTime is equal to the current time of the system, if watermark is set to eventTime, when watermark = 00:00:10, it will trigger the calculation of w1. After this time, because there is no delay in the message, all messages before watermark (0000eventTime 0000eventTime 10) have fallen into the window, so the full amount of data in window will be calculated. So if there is a message eventTime that should belong to W1 at 00:00:01 and arrive at 00:00:11, because assuming that there is no delay, then watermark is equal to the current time, 00:00:11, by which time W1 has been calculated, then the message will be discarded and will not be added to the calculation, which will cause a problem. It is understandable why a constant should be subtracted as watermark in the code. Suppose the time of extracting eventTime is subtracted by 2s, then when data1 arrives at 00:00:11, watermark is 00:00:09 and W1 has not triggered the calculation, then data1 will be added to W1. At this time, there is no problem with calculation, so subtracting a constant is for fault tolerance of delayed messages.

Punctuated Watermarks provides custom conditions to generate water levels, such as judging the current state of a data element or a value of tuple type. If the status in the access event is 0, the watermark is triggered. If the status is not 0, it is not triggered. You need to override the extractTimestamp and checkAndGetNextWatermark methods respectively.

Flink allows the extractor Timestamp Extractors of the data to be predefined in advance, defining the extraction timestamp when reading the source

Delayed data

Although Event_time-based window computing can use warterMark mechanism to tolerate some delays, it can only alleviate the problem to a certain extent and can not cope with some scenarios with particularly serious delays. Flink loses delayed data by default, but users can customize the handling of delayed data, which requires additional processing of near data by Allowed Lateness mechanism.

DataStream API provides the Allowed Lateness method to specify whether to process late data. The parameter is the time interval of Time type, which represents the maximum allowed delay time. In the window calculation of Flink, the Endtime of Window plus this time is taken as the end time (P) of the last release of the window. When the Event time in the accessed data does not exceed this time (P), but the window calculation is triggered directly when the WaterMark has exceeded the Event_Time of Window. If the Event_Time exceeds the time P, it will be discarded.

Usually, you can use the sideOutputLateData method to mark the late data, and then use the getSideOutput () method to get the marked delay data and analyze the cause of the delay.

Multi-stream merge / association merge

Connect:Flink provides connect method to realize the merging of two or more streams. After merging, ConnectedStreams is generated. Different processing methods are applied to the data of the two streams, and states (such as counting) can be shared between the two streams. Map () and flatMap () provided by ConnectedStream need to define CoMapFunction and CoFlatMapFunction to process the input DataStream data sets respectively.

The Union:Union operator mainly realizes the merging of two or more input streams into one data set, which needs to ensure that the format of the two streams is the same, and the output stream is exactly the same as the input.

Association

Flink supports multi-stream association of windows, that is, to perform join operations on multiple input streams on one window according to the same condition, it is necessary to ensure that the input Stream is built on the same Windows and has the same type of Key as the association condition.

Dataset inputStream1 forms a JoinedStreams type dataset through the join method, calls the where () method to specify the key of the inputStream1 dataset, calls the equalTo () method to specify the associated key of the inputStream2, and specifies the Window Assigner through the window () method. Finally, the input data elements are windowed by passing in the user-defined JoinFunction or FlatJoinFunction in the apply () method

All Join operations in the Windows Join process are of Inner Join type, that is, each Stream in the same window must have a Key and the same key in order to complete the associated operation and output the result.

Status and fault tolerance

Stateful computing is an important feature of Flink, which internally stores the intermediate results produced by computing and provides them for subsequent Function or operator use. State data is maintained in local storage, either in Flink heap memory or out-of-heap memory, or with the help of third-party storage media. compared with storm+ redis / hbase mode, Flink's perfect state management reduces dependence on external systems and reduces maintenance costs.

State and type

Flink divides the state into two types: Keyed State and Operator State according to whether the dataset is partitioned according to key. Keyed State can only be used on Function and Operation corresponding to KeyedStream datasets. It is a special case of Operator State.

Operator State is only bound to parallel operator instances, independent of key in data elements, and supports automatic redistribution of state data when the parallelism of operator instances changes.

Both Keyed State and Operator State have two forms, one is the managed state, the other is the original state, the former has Flink Runtime to control and manage the state data and converts the state data into object storage of Hash tables or RocksDB in memory, while the latter manages the data structure by the operator itself. when CheckPoint is triggered, Flink does not know the internal data structure of the state data, but converts the data into bytes data and stores it in CheckPoint. When recovering tasks from Checkpoint, Flink does not know the internal data structure. The operator itself deserializes the data structure of the state.

CheckPoint and SavePoint

Flink provides CheckPoint mechanism based on lightweight distributed snapshot algorithm. Distributed snapshot can globally process Task/Operator state data at the same point in time, including Keyed State and Operator State.

Savepoints is a special implementation of checkpoint. The underlying layer uses CheckPoint mechanism, and Savepoint triggers CheckPoint by manual command and persists the result to the specified storage path. Its main purpose is to help users save system state data in the process of upgrading and maintaining the cluster, and avoid being unable to recover due to downtime operation and maintenance or upgrade to the application data state that is known to be terminated normally.

At this point, the study of "what are the basic knowledge points of Flink" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.