In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
Author: Qiu Congxian
1. Introduction to Window & Time
Apache Flink (hereinafter referred to as Flink) is a distributed computing framework that naturally supports infinite flow data processing. In Flink, Window can cut infinite flow into finite flow, which is the core component of dealing with finite flow. Now Window in Flink can be time-driven (Time Window) or data-driven (Count Window).
The following code is two examples of using Window in Flink
Cdn.xitu.io/2019/5/6/16a8ae89e3096f7e?w=724&h=209&f=jpeg&s=41903 ">
2. Use of Window API
We already know some basic concepts of Window and related API from the first part, so let's take a practical example to see how to use Window-related API.
The code comes from flink-examples.
In the above example, we first extract time from each piece of data, then keyby, and then call window (), evictor (), trigger (), and maxBy (). Let's focus on the window (), evictor (), and trigger () methods.
2.1 WindowAssigner, Evictor and Trigger
The input received by the window method is a WindowAssigner, WindowAssigner is responsible for distributing each input data to the correct window (a piece of data may be distributed to multiple Window at the same time), and Flink provides several general WindowAssigner:tumbling window (elements between windows are not duplicated), sliding window (elements between windows may be duplicated), session window, and global window. If you need to customize your own data distribution policy, you can implement a class that inherits from WindowAssigner.
Tumbling Window
Sliding Window
Session Window
Global Window
Evictor is mainly used to customize some data, either before or after executing the user code. For a more detailed description, you can refer to the evicBefore and evicAfter methods of org.apache.flink.streaming.api.windowing.evictors.Evictor. Flink provides the following three general evictor:
CountEvictor retains a specified number of elements
DeltaEvictor determines whether to delete an element by executing a user-given DeltaFunction and a preset threshold.
TimeEvictor sets a threshold interval to delete all elements that are no longer in the max_ts-interval range, where max_ts is the maximum timestamp in the window.
Evictor is an optional method, and if the user does not select it, there is no default.
Trigger is used to determine whether a window needs to be triggered. Each WindowAssigner comes with a default trigger. If the default trigger does not meet your needs, you can customize a class and inherit it from Trigger. Let's describe the API and meaning of Trigger in detail:
OnElement () triggers each time an element is added to the window
OnEventTime () is called when event-time timer is triggered
OnProcessingTime () is called when processing-time timer is triggered
OnMerge () performs a merge operation on the state of two trigger
Clear () window is called when it is destroyed
The first three of the above APIs will return a TriggerResult,TriggerResult. There are several possible options:
CONTINUE doesn't do anything.
FIRE triggers window
PURGE clears the elements of the entire window and destroys the window
FIRE_AND_PURGE triggers the window, then destroys the window
2.2 Time & Watermark
After understanding the above, we still have two concepts to clarify for time-driven windows: Time and Watermark.
We know that Time is a very important concept in a distributed environment. In Flink, Time can be divided into three types of Event-Time,Processing-Time and Ingestion-Time. The relationship between the three can be seen in the following figure:
Event Time 、 Ingestion Time 、 Processing Time
Event-Time indicates the time when the event occurred, Processing-Time indicates the time the message was processed (wall time), and Ingestion-Time indicates the time it entered the system.
We can set the Time type in Flink in the following ways
Env.setStreamTimeCharacteristic (TimeCharacteristic.ProcessingTime); / / set to use ProcessingTime
After learning about Time, we also need to know the concepts related to Watermark.
We can consider such an example: an App will record all the clicks of the user and send back the log (in the case of a bad network, save it locally and postpone it later). A user operates on App at 11:02, B user operates App at 11:03, but A user's network is unstable and the return log is delayed, so we first receive the 11:03 message from B user on the server side, and then receive the message from A user at 11:02, the message is out of order.
So how can we make sure that when the event-time-based window is destroyed, all the data has been processed? This is what watermark does. Watermark will carry a monotonously increasing timestamp tjingle watermark (t) to indicate that all data with a timestamp not greater than t has arrived, and data less than or equal to t will not come again in the future, so you can safely trigger and destroy the window. An example of watermark in a disordered data stream is given in the following figure.
2.3 late data
The watermark above allows us to deal with out-of-order data, but in the real world we can't get a perfect watermark value-either it can't be obtained or it's too expensive, so in practice we use approximate watermark-after generating watermark (t), there is a small probability of receiving the data before the timestamp t. Define this data as "late elements" in Flink, and we can also specify the maximum time allowed for delay in window (default is 0), which can be set using the following code
After setting allowedLateness, the belated data can also trigger the window and output. Using Flink's side output mechanism, we can obtain the belated data as follows:
It is important to note that after allowedLateness is set, late data may also trigger windows, and for Session window, windows may be merged to produce unexpected behavior.
3. Internal implementation of Window
When discussing the internal implementation of Window, let's review the life cycle of Window through the following figure
After each piece of data comes in, the WindowAssigner will assign it to the corresponding Window. When the Window is triggered, it will be handed over to the Evictor (skip if no Evictor is set), and then the UserFunction will be processed. We have discussed WindowAssigner,Trigger,Evictor above, and UserFunction is user-written code.
There is one more issue to discuss throughout the process: state storage in Window. We know that Flink supports Exactly Once processing semantics, so what's the difference between state storage in Window and normal state storage?
First of all, give a specific answer: there can be no difference from the interface, but each Window will belong to a different namespace, but in a non-Window scenario, it belongs to VoidNamespace. Finally, State/Checkpoint ensures the Exactly Once semantics of the data. Let's take a code from org.apache.flink.streaming.runtime.operators.windowing.WindowOperator to explain it.
As we can see from the above, the elements in Window are also maintained through state, and then the Exactly Once semantics are guaranteed by the Checkpoint mechanism.
So far, all the contents related to Time and Window have been explained, including why there are three core components of Window; Window: how to deal with out-of-order data in WindowAssigner, Trigger and Evictor;Window, whether out-of-order data allows delay, and how to deal with late data; finally, we sort out the whole data flow of Window and how to ensure Exactly Once semantics in Window.
For more information, please visit the Apache Flink Chinese Community website.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.