What is Flink1.10 state management like? 04/18 Update SLTechnology News&Howtos

What is Flink1.10 state management like?

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article focuses on "what Flink1.10 state management is like". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn what Flink1.10 state management is like.

I. Overview

Let's first look at the first sentence of the Flink official document:

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

Youdao is translated as follows:

Apache Flink is a framework and distributed processing engine for stateful computing on unbounded and bounded data streams. Flink is designed to run in all common cluster environments, performing calculations at in-memory speed and on any scale.

To be clear here, the streaming of Flink can be stateful or stateless. For example, there are some tasks that I only calculate based on some independent data. For example, the simplest one is to receive data from the source side and print it directly to the console, and then output it to sink. This does not depend on other data, which does not involve state at all. It is a piece of data to process a piece of data. This should be easier to understand.

As we said before, most streaming applications are stateful. When Flink executes computing tasks, there will be a lot of operator between source and sink, and there will be multiple temporary states in the middle. If a task of the task dies, its state in memory will be lost. If we do not store intermediate states, we need to start from scratch. If we store intermediate states, we need to start from scratch. You can return to the intermediate state and continue to calculate from that state instead of starting from scratch. Flink has designed a mechanism to save the intermediate state of task execution, which is the state management mechanism.

For example, in the classic wordcount program, task constantly receives data from the source. Before processing the data, task visits state to get the current count number of the word, adds 1, then updates state, and outputs the new calculation results. The following figure shows how flink's task interacts with state.

Which scenarios require state calculation? Here are some typical scenarios I've sorted out:

a. Data increment statistics

b. Aggregation operation

c. Machine learning training model saves the current model during iterative operation

D.Job failure restart, need to recover from the previous state

e. Data record deduplication

f. The comparison of historical data.

Second, state classification

Before we can explain the state classification of Flink, we need to distinguish several concepts:

1). State statu

State generally refers to the state of a specific task/operator. In order to ensure that an exception occurs during the calculation process for data recovery, Flink stores the intermediate result. The intermediate result is State. By default, State is saved in Jobmanager memory, or it can be saved in TaskManager local disk or HDFS distributed file system.

2). State Backend status backend

How exactly State is stored, accessed, and maintained is determined by a pluggable component called stateful backend (State backend). An state backend is responsible for two things: local state management, and checkpoints for state and storage to an external address.

3) .Checkpoint checkpoint

Checkpoint refers to a global snapshot of the entire job at a specific time, which can be restored from the backup when we encounter a failure or restart.

According to the way the data is divided and expanded, there are two types of state:operator state and keyed state in Flink:

The scope of the 1.operator state operator state is limited to the operator task, one task per state.

The scope of the operator state is limited to the operator task, and all data processed by the same parallel subtask can access the same state.

Operator states are shared for the same task (each parallel subtask shares a state)

The operator state cannot be accessed by another task of the same or different operator (nor between different tasks of the same operator)

Operator state provides three prototypes:

List state

Express state in the way of list

Union list state

State is also represented as list. But it differs from a regular list state in how it recovers in the event of a failure, or how an application starts from a checkpoint.

Broadcast state

It is used in special situations when the state of each task of an operator is the same. This property can be used for checkpoints, or when rescaling an operator.

2.keyed state

The keying state is based on the state above the KeyStream, the Operator State after the keyBy.

Keying state is based on the key (key) defined in the input data stream to dimension and access state.

Flink maintains a state instance for each key and partitions all data with the same key into the same operator task, which maintains and processes the state corresponding to the key.

When a task processes a piece of data, it automatically limits the access scope of the state to the key of the current data

Keyed state provides three prototypes:

Value state

Save a single value for each key (which can be of any type). Complex data structures can also be stored as value state

List state

Store a list value for each key. This list can be of any type

Map state

Save a key-value map for each key. The key and value in the mapping can be of any type.

Third, the form of state existence.

Keyed State and Operator State can exist in two forms: the original state and the managed state.

The hosting mode is that the state management is managed by the framework provided by flink, and the value of the state is updated and managed through the interface provided by the flink state management framework. This includes data structures for storing state data, off-the-shelf packaging classes, and so on.

In the original way, the specific data structure of the state is managed by the user. When the framework is doing checkpoint (checkpoint is the mechanism for persistent storage of state data in flink), it uses byte [] to read and write the state content, and knows nothing about its internal data structure.

The usual state on DataStream recommends the use of managed state, and when implementing a user-defined operator, the original state is used. Generally speaking, managed state is used more often.

At this point, I believe you have a deeper understanding of "what Flink1.10 state management is like". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.