What is the Chinese version of the official Apache Flink document-- Flink? 04/21 Update SLTechnology News&Howtos

What is the Chinese version of the official Apache Flink document-- Flink?

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Architecture

Original text link

Apache Flink is a framework and distributed processing engine for stateful computing of borderless and bounded data streams. Flink is designed to run in all common cluster environments and perform calculations at memory speed and on any scale.

here, we explain the important aspects of the Flink architecture.

Dealing with borderless and bounded data

any type of data is generated as an event stream. Credit card transaction transactions, sensor measurements, machine logs, and user interaction on websites or mobile applications, all of which generate streams.

data can be processed as an unbounded or bounded stream.

The unbounded flow defines the beginning but not the end. They do not stop providing data at build time. The unbounded flow must be processed continuously, that is, it must be processed as soon as the event is pulled. Cannot wait for all input data to arrive for post-processing because the input is boundless and will not be completed at any point in time. Processing unbounded data usually requires pulling events in a specific order, such as the order in which events occur, so that the integrity of the results can be inferred. There is a boundary flow that defines the start and end. You can process the bounded stream by pulling all the data before performing any calculation. Processing bounded streams does not require ordered pull, because bounded data sets can be sorted at any time. Processing with bounded streams is also called batch processing.

Apache Flink is good at dealing with borderless and bounded datasets. Precise control over events and states allows the Flink runtime to run any type of application on a borderless flow. Bounded flows are processed internally by algorithms and data structures designed specifically for fixed-size data sets to achieve excellent performance. Deploy applications anywhere

Apache Flink is a distributed system that requires computing resources to execute applications. Flink integrates with all common cluster resource managers, such as Hadoop YARN,Apache Mesos and Kubernetes, but can also be run as a stand-alone cluster.

Flink is designed to work well with every resource manager listed previously. This is achieved through resource manager-specific deployment patterns that allow Flink to interact with each resource manager in its usual way.

When deploys a Flink application, Flink automatically identifies the required resources based on the parallelism of the application configuration and requests them from the resource manager. If a failure occurs, Flink replaces the failed container by requesting new resources. All communication for submitting or controlling the application is done through REST calls. This simplifies the integration of Flink in many environments.

Run the application at any scale

Flink is designed to run stateful streaming applications at any scale. Applications can be parallelized into thousands of tasks distributed and executed simultaneously in a cluster. As a result, applications can take advantage of almost unlimited amounts of CPU, memory, disk and network IO. Moreover, Flink can easily maintain the state of very large applications. Its asynchronous and incremental checkpoint algorithms ensure minimal impact on delay processing and accurate state consistency.

users reported an impressive number of extensions to Flink applications running in their production environment, such as:

Applications handle trillions of events every day. Applications maintain the state of several TB. Applications run on thousands of CPU cores using memory performance.

stateful Flink applications are optimized for local stateful access. The task state always resides in memory, or, if the state size exceeds the available memory, it is saved in a data structure on a disk that accesses efficiently. As a result, the task performs all calculations by accessing the local (usually in-memory) state, resulting in very low processing latency. Flink ensures precise state consistency in the event of a failure through periodic and asynchronous checkpoints to persistent storage.

Application

Original text link

Apache Flink is a framework for stateful computation of borderless and bounded data streams. Flink provides multiple API at different levels of abstraction and provides a dedicated library for common use cases.

here, we introduce Flink's easy-to-use and expressive API and libraries.

Building blocks for streaming applications

The types of applications that the streaming computing framework builds and runs are defined by the level of framework control flow, state, and time. In the following sections, we describe these building blocks of streaming applications and explain how Flink handles them.

Flow

obviously, streaming is a fundamental aspect of streaming. However, flows can have different characteristics, which can affect how flows are handled. Flink is a versatile processing framework that can handle any type of stream.

Bounded and unbounded streams: streams can be unbounded or bounded, such as fixed-size datasets. Flink has the complex function of dealing with unbounded flow, but it also has special operators to deal with bounded flow effectively. Real-time and recorded streams: all data is generated as streams, and there are two ways to process data. Process it in real time at build time or persist the stream to a storage system (such as a file system or object storage) and process it later. Flink applications can handle records or real-time streams. Status

Every extraordinary streaming application in is stateful. Only applications that apply transitions to individual events do not need state. Any application running basic business logic needs to remember events or intermediate results in order to access them at a later point in time, such as when the next event is received or after a specific duration.

The state of the application is a first-class citizen in Flink. You can view it by looking at all the functions (functions) that Flink provides in the state processing environment (context context).

Multistate primitives: Flink provides state primitives for different data structures, such as atomic values (value), lists (list), or mappings (map). Developers can choose the most effective state primitive according to the access mode of the function. Pluggable status backend: the application status is managed and checked (checkpointed) by the pluggable state backend. Flink has different state backends, which can store state in memory or RocksDB. RocksDB (KV DB) is an efficient embedded disk data storage. You can also insert a custom status backend. Precise state consistency: Flink's checkpoint and recovery algorithm ensures consistency of application state in the event of a failure. Therefore, the fault is handled transparently and does not affect the correctness of the application. Very large state: due to its asynchronous and incremental checkpoint algorithm, Flink is able to maintain the application state of several TB. Scalable applications: Flink supports the extension of stateful applications by reassigning state to more or fewer Worker nodes. time

time is another important component of streaming applications. Most event streams have fixed time semantics because each event is generated at a specific point in time. In addition, many common flow calculations are time-based, such as window aggregation, conversationalization, pattern monitoring, and time-based connections. An important aspect of stream processing is how the application measures time, that is, the difference between time and processing time.

Flink provides a rich set of time-related features.

Event time mode: an application that uses event time semantics to process the flow calculates the result based on the timestamp of the time. Therefore, event time processing is accurate and consistent regardless of whether or not recording or real-time time is processed. Watermark support: Flink uses watermarks to infer the time in the event time application. Watermarking is also a flexible mechanism that can weigh the integrity of delayed data and results. Delayed data processing: when the watermark processing stream is used in event time mode, the calculation may be completed before all related events arrive. Such events are called delayed events. Flink has a variety of options for handling delayed events, such as rerouting them through edge output and updating results that have been completed before. Processing time mode: in addition to the event time mode, Flink also supports processing time semantics, and the execution of processing time semantics is triggered by the processing machine's wall clock (system) time. The processing time model is suitable for some applications with strict low latency requirements that can tolerate approximate results at the same time. Hierarchical interface API

Flink provides three layers of API. Each API provides a different trade-off between simplicity and expressiveness, and for different use cases.

Let's briefly introduce each API, discuss its application, and show a code example.

ProcessFunctions

ProcessFunctions is the most expressive functional interface provided by Flink. Flink provides ProcessFunctions to handle a single event from one or two input streams or events grouped into a window. ProcessFunctions provides fine-grained control over time and state. ProcessFunction can modify its state at will and register a timer that will trigger the callback function in the future. As a result, ProcessFunctions can implement the complex per event business logic required by many stateful event-driven applications.

the following example shows an example of how KeyedProcessFunction operates on KeyedStream, matching START, and END events. When a START event is received, the function remembers its status timestamp and registers a four-hour timer. If an END event is received before the timer is triggered, the function calculates the duration between the events END and START events, clears the state, and returns a value. Otherwise, the timer will only trigger and clear the state.

Package com.longyun.flink.processfuncs;import org.apache.flink.api.common.state.ValueState;import org.apache.flink.api.common.state.ValueStateDescriptor;import org.apache.flink.api.java.tuple.Tuple2;import org.apache.flink.configuration.Configuration;import org.apache.flink.streaming.api.functions.KeyedProcessFunction;import org.apache.flink.util.Collector / * * @ author lynnyuan * @ ClassName com.longyun.flink.processfuncs.StartEndDuration * @ Description TODO * K key * IN input * OUT output * @ Date 15:02 on 2018-12-3 * @ Version 1.0 * * / public class StartEndDuration extends KeyedProcessFunction {private ValueState startTime; @ Override public void open (Configuration configuration) throws Exception {/ / obtain state handle startTime = getRuntimeContext () .getState (new ValueStateDescriptor ("startTIme", Long.class)) } / * called for each processed event * @ param in * @ param context * @ param out * @ throws Exception * / @ Override public void processElement (Tuple2 in, Context context, Collector out) throws Exception {switch (in.f1) {case "START": / / set the start time if we receive a start event. StartTime.update (context.timestamp ()); / / register a timer in four hours from the start event. Context.timerService () .registerEventTimeTimer (context.timestamp () + 4 * 60 * 60 * 1000); break; case "END": / / emit the duration between start and end event Long sTime = startTime.value () If (sTime! = null) {out.collect (Tuple2.of (in.f0, context.timestamp ()-sTime)); / / clear the state startTime.clear ();} break; default: break }} / * * Called when a timer fires * / @ Override public void onTimer (long timestamp, OnTimerContext ctx, Collector out) throws Exception {/ / Time out interval exceeded Cleaning up the state. StartTime.clear ();}}

The example of illustrates the expressive power of KeyedProcessFunction, but it also emphasizes that it is a rather lengthy interface.

DataStream API

DataStream API provides many common stream processing operation primitives. Such as windows, record-at-a-time transformations, query external data stores rich event primitives. DataStream API can be used with Java and Scala and it is based on functions such as map (), reduce (), and aggregate (). You can define function parameters by extending the interface or the lambda function.

the following example shows how to session the clickstream and record the number of clicks per session.

/ / a stream of website clicksDataStream clicks =... DataStream result = clicks / / project clicks to userId and add a 1 for counting .map (/ / define function by implementing the MapFunction interface. New MapFunction () {@ Override public Tuple2 map (Click click) {return Tuple2.of (click.userId, 1L);}) / / key by userId (field 0) .keyby (0) / / define session window with 30 minute gap .window (EventTimeSessionWindows.withGap (Time.minutes (30L) / / count clicks per session. Define function as lambda function. Reduce ((a, b)-> Tuple2.of (a.f0, a.f1 + b.f1)); SQL & Table API

Flink has two relational API features, Table API and SQL. Both API are unified API for batch and stream processing, that is, queries are executed with the same semantics and produce the same results on an unbounded real-time stream or a bounded record stream. Table API and SQL use Apache Calicite for parsing, validation, and query optimization. They can be seamlessly integrated with DataStream and DataSet API and support user-defined scalar, aggregation, and table-valued functions.

Flink's relational API is designed to simplify data analysis, data pipelining, and the definition of ETL applications.

the following example shows how to session the clickstream and record the number of clicks per session. It is the same use case as the example in DataStream API.

SELECT userId, COUNT (*) FROM clicksGROUP BY SESSION (clicktime, INTERVAL '30' MINUTE), userId Library (Libraries)

Flink has several libraries for common data processing use cases. These libraries are usually embedded in API rather than completely independent. As a result, they can benefit from all the features of API and integrate with other libraries.

Complex event processing (CEP): pattern detection is a very common use case in event flow processing. Flink's CEP library provides an API to specify event patterns (such as regular expressions or state machines). The CEP library integrates with Flink's DataStream API to evaluate patterns on DataStream. The applications of CEP library include network detection, business process monitoring and fraud detection. DataSet API:DataSet API is the core API that Flink uses for batch applications. The primitives of DataSet API include map,reduce, (outer) join,co-group and iterate. All operations are supported by algorithms and data structures that operate on serialized data in memory and overflow to disk if the data size exceeds the memory budget. Flink's DataSet API data processing algorithm is inspired by traditional database operators, such as mixed hash joins or external merge sorting (hybrid hash-join or external merge-sort). Gelly:Gelly is an extensible graphics processing and analysis library. Gelly is implemented on top of DataSet API and is integrated with DataSet API. Therefore, it benefits from its extensible and powerful operators. Gelly has built-in algorithms such as label propagation (tag propagation), triangle enumeration, and page rank, but also provides a simplified Graph API implemented by custom graph algorithms. Operation

Apache Flink is a framework for stateful computation of borderless and bounded data streams. Because many streaming applications are designed to run continuously with the shortest downtime, the stream processor must provide excellent failure recovery and tools to monitor and maintain the application while the application is running.

Apache Flink is very concerned about the operational aspects of stream processing. Here, we will explain Flink's failure recovery mechanism and introduce its features to manage and supervise running applications.

Run the application around the clock

machine and processing failures are ubiquitous in distributed systems. Distributed stream processors such as Flink must recover from failures in order to be able to run streaming applications around the clock. Obviously, this means not only restarting the application after a failure, but also ensuring that its internal state is consistent so that the application can continue to process as if it had never failed.

Flink provides a number of features to ensure that applications are running and consistent:

Consistent checkpoint: Flink's recovery mechanism is based on a consistent checkpoint of the application state. If a failure occurs, the application is restarted and its status is loaded from the latest checkpoint. Combined with a resettable stream source, this feature ensures accurate state consistency. Efficient checkpointing: checking the state of the application can be expensive if the application remains at the TB level. Flink can perform both asynchronous and incremental checkpoints to keep the impact of checkpoints on the application's latency SLAs at a very small level. End-to-End precise once: Flink provides a transactional receiver (sink) for a specific storage system, ensuring that data is written only once, even if there is a failure. Integration with cluster manager: Flink is tightly integrated with cluster manager, such as Hadoop YARN,Mesos or Kubernetes. When a process fails, a new process is automatically started to take over its work. High availability settings: Flink has a high availability mode feature that eliminates all single points of failure. The HA pattern is based on Apache ZooKeeper-- as a proven and reliable distributed coordination service. Update, migrate, pause, and resume your application

needs to maintain streaming applications that support critical business services. Errors need to be fixed and improvements or new features need to be implemented. However, updating stateful flow applications is not easy. In general, we cannot simply stop the application and restart the fixed or improved version, because we cannot bear the state of the lost application.

Flink's Savepoints is a unique and powerful feature that solves the problem of updating stateful applications and many other related challenges. A savepoint is a consistent snapshot of the state of the application, so it is very similar to a checkpoint. However, compared to checkpoints, save points need to be triggered manually and are not automatically deleted when the application stops. Save points can be used to start a state-compatible application and initialize its state. Save points enable the following features:

Application evolution: save points can be used to develop applications. You can restart a fixed or improved version of the application from the SavePoint obtained from the previous version of the application. You can also start the application from an earlier point in time (assuming such a SavePoint exists) to fix the wrong results produced by the defective version. Cluster migration: using SavePoint, applications can be migrated (or cloned) to different clusters. Flink version update: you can use SavePoint migration applications to run on new versions of Flink. Application extension: save points can be used to increase or decrease the parallelism of the application. A / B testing and hypothetical scenarios: by starting all versions of the application at the same SavePoint, you can compare the performance or quality of two (or more) different versions of the application. Pause and resume: you can pause the application and stop it by getting the save point. At any later point in time, the application can be restored from the save point. Archiving: Savepoints can be archived so that the state of the application can be reset to an earlier point in time. Monitor and control your application

, like any other service, continuously running streaming applications need to be supervised and integrated into the organization's operational (operations) infrastructure (that is, monitoring and logging services). Monitoring helps to predict problems and respond in advance. Logging allows us to investigate faults based on root cause analysis. Finally, the easily accessible interface that controls the running application is also an important feature.

Flink is well integrated with many common logging and monitoring services and provides REST API to control applications and query information.

Web UI:Flink has Web UI features to check, monitor, and debug running applications. It can also be used to submit or cancel execution. Logging: Flink implements the popular slf4j logging interface and integrates with the logging framework log4j or logback. Metrics:Flink has a complex system of metrics for collecting and reporting on system and user-defined metrics. Metrics can be exported to several reporters, including JMX,Ganglia,Graphite,Prometheus,StatsD,Datadog and Slf4j. REST API:Flink exposes the public submission of a new application, gets the SavePoint of the running application, or cancels the application's REST API. REST API also exposes metadata, collected metrics for running or completed applications.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.