Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the concept of Flink streaming

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about what the concept of Flink streaming is, many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

First, level of abstraction

Flink provides different levels of abstraction to develop streaming / batch applications.

1,stateful streaming

At the bottom. It is embedded in DataStream API through Process Function. It allows users to handle events freely from one or more streams and use a consistent fault-tolerant state. In addition, users can register event time and processing time callback, allowing the program to implement complex calculations.

2,Core APIs

In fact, most applications do not need the low-level abstraction mentioned above, but are programmed against Core API (such as DataStream API (bounded / unbounded flow)) and DataSet API (bounded dataset). These fluent API provide common building blocks for data processing, such as user-specified transformations, connections, aggregations, windows, states, and so on. The data types processed in these API are represented as classes in their respective programming languages. The integration of underlying Process Function and DataStream API makes it possible to achieve a lower level of abstraction for specific operations. DataSet API provides additional primitive functions such as loops / iterations for bounded datasets.

3,Table API

Table API is a table-centric declarative DSL that may dynamically change the table (when representing the flow). Table API follows the (extended) relational model: Table has additional schema (similar to tables in relational databases), and API provides operations, such as select,project,join,group-by,aggregate, and so on. Table API represents what logical operation should be done, rather than directly specifying how to write the source code for the operation. Although Table API can be extended through various types of user-defined functions, it is not as expressive as Core API, but it is more concise to use (much less code).

In addition, Table API programs can also use optimizers that apply optimization rules before execution.

Seamlessly converts between tables and DataStream / DataSet, allowing programs to mix Table API with DataStream and DataSet API.

4,SQL

The highest level of abstraction is SQL. This level of abstraction is very similar to Table API in both syntax and expression. The SQL abstraction interacts closely with Table API, and SQL queries can be executed on tables defined in Table API.

Second, Programs and Dataflows

The basic building blocks of Flink programs are streams and transformations. Conceptually, stream is data records's (potentially endless) flow, and transformation is an operation that takes one or more streams as input and produces one or more output streams as a result.

When executed, the Flink program is mapped to streaming dataflows, which consists of streams and conversion operators. Each dataflow starts with one or more sources and ends with one or more sinks. Dataflows is like any directed acyclic graph (DAG). Although special forms of loops are allowed through iterative construction, most of our tasks are DAG for simplicity.

In general, there is an one-to-one correspondence between transformations in a program and operators in a data stream. Sometimes, however, a transformation may consist of multiple transformation operators.

Three, Parallel Dataflows

Programs in Flink are essentially parallel and distributed. During execution, the stream has one or more flow partitions, and each operator has one or more operators subtask. The operator subtask is independent of each other and executes on different threads, possibly on different machines or containers.

The number of operator subtasks is the parallelism of that particular operator. The parallelism of stream is always the parallelism of its production operators. Different operators of the same program may have different levels of parallelism.

Streams can transfer data between two operators in one-to-one (or forwarding) mode or reallocation mode:

1), one on one

One-to-one streams (for example, between the Source and map () operators in the figure above) preserves the partition and ordering of the element. This means that the subtask [1] of the map () operator will see the elements in the same order as those produced by the Source operator.

2), redistribute

Redistribute the flow (such as between map () and keyBy / window above and between keyBy / window and Sink) to change the partition of the flow. Each operator subtask sends data to a different target subtask based on the selected transformation.

KeyBy is repartitioned according to the hash value of key, and rebalance () is repartitioned in a random manner. In the reassignment exchange, the sorting between elements is retained only in each pair of send and receive subtask pairs (for example, map () of keyBy / window and subtask [1] of subtask [2]). In the previous example, only the element ordering of the same key is guaranteed.

Four, Windows

Aggregating events (for example, counts,sums) works differently on streams than in batches. For example, it is not possible to count all elements in a convection because the flow is usually infinite (unbounded). Instead, the aggregation (count, sum, etc.) on the stream is defined by a window, such as "last 5-minute count" or "sum of the last 100 elements."

Windows can be time-driven (for example, every 30 seconds) or data-driven (for example, every 100 elements). Usually distinguish between different types of windows, such as flipped windows (no overlap) [tumbling windows (no overlap)], sliding windows (with overlap) [sliding windows (with overlap)] and session windows (divided by inactive gaps) [ession windows (punctuated by a gap of inactivity)].

Five, Time

When using time in Streaming programs, such as defining windows, you can refer to different concepts of time:

1,Event Time

Event Time is the time when the Event was created. The Event Time in Events exists in the format of a timestamp. Fink obtains event timestamps through timestamp assigners. An article will be published later in timestamp assigners to specify.

2,Ingestion time

Ingestion time is the time when event entered flink dataflow.

3,Processing Time

Processing Time is the local time that the time-based operator executes.

Six, Stateful Operations

Although many operations in dataflow handle a single event at a time (such as an event parser), some operations remember information across multiple events (such as window operators). These operations are called stateful. A stateful operator that is stored in an embedded key / value store. The state is partitioned and assigned along with the streams read by the state operator. After using the keyBy function, only keyed streams may get the key/value state and is limited to the value related to the current event key. Align the keys of stream and state to ensure that all status updates are local operations, ensuring consistency and no transaction overhead. This alignment also allows Flink to reassign the state and transparently adjust the flow partition.

Seven, Checkpoints

Flink uses a combination of stream replay and checkpoint for fault tolerance. A checkpoint is associated with a specific point in each input stream and the corresponding state of each operator. Streaming dataflow can restore the flow from the checkpoint while maintaining consistency (one-time processing semantics) by restoring the state of the operator and re-executing events from the checkpoint.

Checkpoint intervals are a means of eliminating fault-tolerant overhead during execution in the case of recovery time (the number of events that need to be recalculated).

Eight, Batch on Streaming

Flink executes a batch program as a special case of a stream program, where streams is limited (a limited number of elements). DataSet is internally treated as a data stream. The above concepts therefore apply to batch programs as well as streaming programs, with some exceptions:

1. The batch program's fault tolerance does not use checkpoint. Recovery requires full execution of the flow. This is possible because the input is limited. Recovery costs increase, but make processing cheaper because there is no need for checkpoint.

2Stateful operations in the in-memory/out-of-core dataset API use a simplified in-memory/out-of-core data structure instead of a key / value index.

3Magical dataset API introduces special superstep-based iterations, which can only be carried out on bounded streams. It will be introduced in the following article.

Nine, Tasks and Operator Chains

For distributed execution, Flink chains merge operator subtasks into tasks. Each task is executed by a thread. Chained merging of operations into tasks is a useful optimization: it reduces thread-to-thread switching and buffering overhead and increases overall throughput while reducing latency. Chained operations are configurable (detailed in a later article).

The following example of dataflow, which uses five subtasks execution, requires five concurrent threads.

Ten, the role of flink

The Flink runtime consists of two types of processes:

1) JobManager is also called master to coordinate distributed execution. They schedule tasks, coordinate checkpoints, coordinate fault recovery, and so on. There is at least one JobManager. Multiple JobManager can be launched under high availability, one of which is elected as leader and the rest as standby.

2) TaskManager, also known as worker, is responsible for executing the specific tasks. Cache, swap data streams. There is at least one TaskManager.

JobManager and TaskManager can be started in many ways: they can be started directly as a Standalone cluster, or they can be managed by yarn or Mesos. TaskManager connects to JobManager, announces that he is available, and accepts the assigned work.

The client is not part of runtime and program execution, but is used to prepare and send data streams to JobManager.

The client can then disconnect or remain connected to receive a progress report. The client runs as part of the Java / Scala program that triggers execution, or runs. / bin/flink in a command line process.

National Day holiday, Task Slots and Resources

Each worker (or TaskManager) is a jvm process that can execute one or more subtasks in a separate thread. To control the number of tasks accepted by worker, worker calls it the number of task slots (at least one).

Each task slot represents a fixed part of the TaskManager resource. For example, a TaskManager has three slot, which distributes the memory it manages equally to each slot. Allocating resources by slot means that an subtask does not compete with the subtask of other job for managed memory, but uses reserved memory resources. Isolation of cpu resources is not implemented here. Only memory-based resource isolation is implemented.

By adjusting the number of task slots, users can define how subtasks are isolated from each other. Having one slot per TaskManager means that each task group runs in a separate JVM (for example, it can be started in a separate container). Having multiple slots means that more subtasks share the same JVM. Tasks in the same JVM share TCP connections (through reuse) and heartbeat messages. They can also share datasets and data structures, reducing the overhead of each task.

By default, Flink allows subtasks to share slot, even if they are subtasks of different tasks, as long as they come from the same job. The result is that one slot can handle the entire job pipeline. There are two main benefits of allowing this slot share:

1) Flink clusters need exactly the same number of task slot as the highest degree of parallelism used in job. There is no need to calculate how many tasks a program contains (with different parallelism).

2), it is easier to obtain better resource utilization. Without slot sharing, non-intensive source / map () subtasks would have as many resources as resource-intensive window subtasks. Using slot sharing, by increasing the basic parallelism of the example China from 2 to 6, you can make full use of slot resources while ensuring that heavy subtasks is evenly distributed to different taskmanager.

There is also a resource group mechanism to organize unwanted slot shares.

As a rule of thumb, the best default number of task slot will be the number of CPU cores.

Twelve, State Backends

The exact data structure stored by the key/values index depends on the State Backends selected (there are currently three: MemoryStateBackend,FsStateBackend,RocksDBStateBackend). One state backend stores data in an in-memory hash map, another state backend uses RocksDB, and another uses files. In addition to defining the data structure of the saved state, state backends also implements logic to take a point-in-time snapshot of the key / value state and store the snapshot as part of the checkpoint.

Thirteen, Savepoints

Programs written in Data Stream API can resume execution from Savepoints. Savepoints allows updates to programs and Flink clusters without losing any state.

Savepoints are manually triggered checkpoints that take a snapshot of the program and write it to the status backend. They rely on this conventional checkpoint mechanism. During execution, periodically take snapshots on the worker node and generate checkpoints. In order to recover, only the last completed checkpoint is needed, and once the new checkpoint is completed, the older checkpoint can be safely discarded. Savepoints is similar to these periodic checkpoints except that they are triggered by the user and do not automatically expire when newer checkpoints are completed. You can create a SavePoint from the command line, or you can cancel the job through REST API.

Fourteen, summary

As a streaming framework, Flink is also very common in real-time computing. The Flink application has a total of three parts:

1), input data processed by Data source:Flink

2), the processing steps when Transformations:Flink modifies incoming data

3), Data sink: output location after flink processing

Kafka,hbase,spark and other source code introduction to in-depth, spark machine learning, big data security, big data operation and maintenance, please follow the Langjian official account, read high-quality articles.

After reading the above, do you have any further understanding of the concept of Flink streaming? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report