In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces Flink Client, Window Time & WaterMarker, the article is very detailed, has a certain reference value, interested friends must read it!
Flink client
Flink Client:
Scala shell
SQL Client
Command line
Restfull
Web
Command line description:
Standalone mode
# View Command complete description flink-h# View Command parameters description flink run-h# start a standalone cluster bin/start-cluster.sh# run jobflink run-d examples/streaming/TopSpeedWindowing.jar # View task list flink list-m 127.0.0.1 standalone 808 stop the specified task, the source of the task needs to implement the StoppableFunction function flink stop-m 127.0.0.1 standalone 8081 d67420e52bd051fae2fddbaa79e046bb # cancel the specified task If conf/flink-conf.yaml is configured with state.savepoints.dir, savepoint will be saved, otherwise savepointflink cancel-m 127.0.0.1 savepointflink cancel 8081 5e20cb6b0f357591171dfcca2eea09de # triggers Savepointflink savepoint- m 127.0.1 savepoint 8081 ec53edcfaeb96b2a5dadbfbe5ff62bbb / tmp/savepoint# launches flink run-d-s / tmp/savepoint/savepoint-f049ff-24ec0d3e0dc7#info from the specified savepoint to view the Json content output from the execution plan (StreamGraph) flink info examples/streaming/TopSpeedWindowing.jar## copy Paste into this site: http://flink.apache.org/visualizer/
Yarn Per-Job mode (one flink cluster for each Job)
# single-task attach mode, the client will wait until the end of the task before exiting the flink session cluster# single-task detached mode on flink run-m yarn-cluster. / examples/batch/WordCount.jar# Yarn, and the client will exit the flink session cluster#-yd-m yarn-cluster. / examples/streaming/TopSpeedWindowing.jar# Yarn display flink per-job cluster after submitting the task.
Yarn Session mode (multiple Job running in one Flink cluster)
# launch sessionyarn-session.sh-tm 2048-s 3 tm memory 2g, each tm has 3 slot default atache modes, plus-d for detache mode Yarn shows as flink session cluster# submission task flink run. / examples/batch/WordCount.jar will be submitted to the newly launched Session according to the contents of the / tmp/.yarn-properties-admin file. # submit to the specified session. Submit to the specified Sessionflink run-d-p 30-m yarn-cluster-yid application_1532332183347_0708. / examples/streaming/TopSpeedWindowing.jar through the-yid parameter
The difference between Savepoint and Checkpoint:
Checkpoint is done incrementally, and each time is short and the amount of data is small. As long as it is enabled in the program, it will be triggered automatically, and the user does not need to perceive it. Checkpoint is automatically called when the job failover is called, without the need for the user to specify.
Savepoint is done in full, with a long time each time, a large amount of data, and needs to be triggered actively by the user. Savepoint is usually used for program version update, Bug repair AGreb B Test and other scenarios, which need to be specified by the user.
Restfull API submission method: https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html
Flink Window&Time
Window can cut infinite flow into finite flow, which is the core component of dealing with finite flow. Split the stream into buckets, which can be calculated in buckets
Window in Flink is divided into two types: time-driven (Time Window) and data-driven (Count Window).
The window method enters the parameter WindowAssigner, and WindowAssigner is responsible for distributing each input data to the correct window (a piece of data may be distributed to multiple Window at the same time), and Flink provides several general WindowAssigner:tumbling window (elements between windows are not duplicated), sliding window (elements between windows may be duplicated), session window, and global window. If you need to customize your own data distribution policy, you can implement a class that inherits from WindowAssigner
Window life cycle
In short, as soon as the first element belonging to this window arrives, a window is created, and when the time (event or processing time) exceeds its end timestamp plus a user-specified allowable delay, the window will be deleted completely
Window component
Window assigner: used to determine which window an element is assigned to
Trigger: trigger that determines when a window can be calculated or removed. The trigger policy may be similar to "when the number of window elements is greater than 4" or "when the water mark passes through the window ends"
Evictor: it removes elements from the window after the trigger is triggered & before / or after the function is applied.
Window classification
Tumble window (Tumbling window has no overlap)
Scroll window (Sliding window overlaps)
Session window (Session window activity gap)
WaterMarker
WaterMarker is a mechanism proposed by Apache Flink to deal with Event Time window computing, and it is also a kind of timestamp in essence. Used to handle out-of-order events or delayed data, which is usually implemented by watermark mechanism combined with window (Watermark is used to trigger window window calculation
Window trigger condition
1. For out-of-order (unordered) and normal data
Watermark timestamp > = window endtime
Data exists in [window_start_time, window_end_time]
two。 Too much data for late element
Event time > watermark timestamp
WaterMark setting method
1. Punctuation water mark
Punctuation Water level Line (Punctuated Watermark) triggers the generation of new water level lines through some special marking events in the data stream. In this way, the trigger of the window has nothing to do with time, but determines when the marked event is received.
In the actual production, the Punctuated mode will generate a large number of Watermark in the scenarios with high TPS, which will cause pressure on the downstream operators to a certain extent, so only in the scenarios with very high real-time requirements will we choose the Punctuated method to generate Watermark.
two。 Periodic water level line
Periodically (allowing for intervals or reaching a certain number of records) to produce a Watermark. The time interval for the rise of the water mark is set by the user, and some messages will flow during the interval between the two rises of the water mark, and the user can calculate the new water level based on this data.
In the actual production, the way of Periodic must combine the two dimensions of time and the number of accumulated bars to continue to generate Watermark periodically, otherwise there will be a great delay in extreme cases.
For example, the simplest water level algorithm is to take the largest event time so far, but this method is more violent, has a low tolerance for out-of-order events, and is prone to a large number of late events.
Late event
Lateness is inevitable, and the window is closed when the element arrives.
1. There are three common ways to deal with it:
Reactivate the closed window and recalculate to correct the results.
Collect the late events and deal with them separately.
Treat the late event as an error message and discard it.
Ps: flink uses the third discarding method by default, and also supports side output and allowed lateness
2. Side Output
The side output mechanism can put late events into a separate data flow branch, which will be used as a by-product of window calculation results, so that users can obtain and handle them specially.
3. Allowed lateness
The Allowed Lateness mechanism allows the user to set a maximum amount of late time allowed. Flink saves the state of the window after the window is closed until it exceeds the allowed lateness period, during which late events are not discarded, but the window recalculation is triggered by default. Because extra memory is needed to save the window state, and if ProcessWindowFunction API is used in window calculation, it is possible to trigger the full calculation of the window for each late event, which is costly, so the allowed time for lateness should not be set too long, and the late events should not be too many, otherwise we should consider reducing the speed of raising the watermark or adjusting the algorithm.
Summary
The function of window window is to obtain data periodically.
The function of watermark is to prevent data from being out of order (often) and not to get all the specified data within the event time.
AllowLateNess is to delay the window closing time for a further period of time.
SideOutPut is the last backstop operation, all expired delay data, the specified window has been completely closed, the data will be put to the side output stream.
The above is all the contents of the article "Flink Client, Window Time & WaterMarker". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 292
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.