How to serialize Flink 04/10 Update SLTechnology News&Howtos

How to serialize Flink

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to do serialization of Flink". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let Xiaobian take you to learn "how to do serialization of Flink"!

How does Flink support batch flow integration?

The developers of Flink consider batch processing to be a special case of stream processing. Batch processing is finite stream processing. Flink uses an engine that supports both the DataSet API and DataStream API.

How does Flink exchange data efficiently?

In a Flink Job, data needs to be exchanged among different tasks, and the entire data exchange is handled by TaskManager, whose network component first collects records from buffer and then sends them. Records are not sent one by one, and the other is to accumulate a batch and then send it. Batch technology can make more efficient use of network resources.

3. How does Flink do fault tolerance?

Flink fault tolerance mainly depends on strong CheckPoint mechanism and State mechanism. Checkpoint is responsible for regularly making distributed snapshots and backing up the state in the program;State is used to store intermediate state in the calculation process.

4. What is the principle of Flink distributed snapshot?

Flink's distributed snapshots are tailor-made according to the Chandy-Lamport algorithm. Simply put, it is about continuously creating consistent snapshots of distributed data streams and their states.

The core idea is to insert a barrier at the input source side and control the synchronization of the barrier to achieve snapshot backup and exactly-once semantics.

How does Flink guarantee Exactly-once semantics?

Flink implements end-to-end consistency semantics by implementing two-phase commit and state preservation.

It is divided into the following steps:

Start Transaction Create a temporary folder to write data to

PreCommit writes data cached in memory to a file and closes it.

Commit. Place previously written temporary files in the target directory. That means there's some delay in the final data.

To discard temporary files.

If the failure occurs after successful pre-commit, before formal commit. Pre-committed data can be submitted or deleted based on status.

6. What is special about Flink's kafka connector?

Flink source code has a separate connector module, all other connectors are dependent on this module, Flink in version 1.9 release of the new kafka connector, abandoned the previous connection between different versions of the kafka cluster need to rely on different versions of the connector this practice, only need to rely on a connector.

Q. How does Flink's memory management work?

Instead of storing a large number of objects on the heap, Flink serializes them all onto a preallocated block of memory. In addition, Flink uses a lot of out-of-heap memory. If the data to be processed exceeds the memory limit, some of the data is stored on the hard disk.

Flink implements its own serialization framework for direct manipulation of binary data.

Flink memory management is theoretically divided into three parts:

Network Buffers: This is allocated when TaskManager is started. This is a group of memory used to cache network data. Each block is 32K. The default allocation is 2048. It can be modified by "taskmanager. network.numberOfBuffers".

Memory Manage pool: A large number of Memory Segment blocks used for run-time algorithms (Sort/Join/Shuffle, etc.), which are allocated at startup. The following code calculates the memory allocation method based on various parameters in the configuration file. (heap or off-heap, this is discussed in the next section), memory allocation supports pre-allocation and lazy load, the default lazy load mode.

User Code, which is the data structure used for the User Code and TaskManager itself in memory other than Memory Manager.

8. How does Flink's serialization work?

Java itself comes with serialization and deserialization functions, but the auxiliary information takes up a large space, and too much class information is recorded when serializing objects.

Apache Flink abandons Java's native serialization approach and handles data typing and serialization in a unique way, including its own type descriptors, generic type extraction, and type serialization framework.

TypeInformation is the base class for all type descriptors. It reveals some basic properties of the type and can generate serializers. TypeInformation supports the following types:

BasicTypeInfo: Any Java primitive type or String type

BasicArrayTypeInfo: Any Java primitive type array or String array

WriteableTypeInfo: Any implementation class of the Hadoop Writeable interface

TupleTypeInfo: Any Flink Tuple type (Tuple1 to Tuple25 supported). Flink tuples is a fixed-length, fixed-type Java Tuple implementation

CaseClassInfo: Any Scala CaseClass(including Scala tuples)

PojoTypeInfo: Any POJO (Java or Scala), e.g., all member variables of Java objects, either defined with public modifiers or with getter/setter methods

GenericTypeInfo: Any class that cannot match previous types

For the first six types of datasets, Flink can automatically generate corresponding TypeSerializers, which can serialize and deserialize datasets very efficiently.

9. The Window in Flink has data skew. What solution do you have?

Window skew refers to the excessive amount of data that is stacked in different windows. Essentially, the reason for this is that the data source sends data at different speeds. This situation is usually resolved in two ways:

Pre-aggregate before data enters window

Redesign window aggregation key

X. How to solve data hot spots when using aggregate functions GroupBy, Distinct, KeyBy, etc. in Flink?

Data skew and data hotspots are issues that all big data frameworks cannot get around. There are three main ways to deal with such problems:

Avoid such problems in business

For example, in a hypothetical order scenario, the order volume of Beijing and Shanghai increases dozens of times, while the data volume of other cities remains unchanged. At this time, when we aggregate, there will be data accumulation in Beijing and Shanghai, and we can separately data Beijing and Shanghai.

Key's design.

Split the hot key, such as Beijing and Shanghai in the previous example, you can split and aggregate Beijing and Shanghai according to regions.

parameter setting

An important improvement in Flink 1.9.0 SQL(Blink Planner) performance optimization is the upgrade of the minibatch model, MiniBatch. The principle is to buffer a certain amount of data and then trigger processing to reduce access to the State, thereby improving throughput and reducing data output.

Eleven, Flink task delay is high, want to solve this problem, how would you start?

In Flink's background task management, we can see which operators and tasks of Flink have backpressure. The main methods are resource optimization and operator optimization. Resource tuning is to tune parameters such as parallelism, CPU (core), heap_memory, etc. of the Operator in the job. Job parameter tuning includes: parallelism setting, state setting, checkpoint setting.

How does Flink handle back pressure?

Flink internally is based on the producer-consumer model for message delivery, and Flink's backpressure design is also based on this model. Flink uses efficiently bounded distributed blocking queues, just like Java's generic BlockingQueue. Downstream consumers slow down, upstream will be blocked.

What are the differences between Flink's backpressure and Strom?

Storm is to monitor the receive queue load in Bolt. If it exceeds the high water mark, it will write the backpressure information to Zookeeper. Watch on Zookeeper will notify all workers in this topology to enter the backpressure state. Finally, Spout stops sending tuples.

Backpressure in Flink uses an efficiently bounded distributed blocking queue, where slower downstream consumption causes congestion at the sender.

The biggest difference between the two is that Flink is a step-by-step backpressure, while Storm is a direct deceleration from the source.

Do you understand the concept of Operator Chains?

For more efficient distributed execution, Flink chains subtasks of operators together as much as possible to form tasks. Each task is executed in a thread. Chaining operators into tasks is a very effective optimization: it reduces switching between threads, reduces serialization/deserialization of messages, reduces data swapping in buffers, reduces latency and improves overall throughput. This is what we call a chain of operators.

15. Flink Under what circumstances will Operator chains be put together to form operator chains?

Two operator chains together:

The parallelism of upstream and downstream is consistent

Downstream nodes have an ingress degree of 1 (that is, downstream nodes have no input from other nodes)

Both upstream and downstream nodes are in the same slot group (slot group explained below)

The chain policy of downstream nodes is ALWAYS (it can be linked with upstream and downstream nodes, and the default is ALWAYS for map, flatmap and filter)

The chain policy of the upstream node is ALWAYS or HEAD (only downstream links, not upstream links, Source defaults to HEAD)

Data partitioning between two nodes is forward (see Understanding Partitioning of Data Flow)

User did not disable chain

What are the new features of Flink 1.9?

Hive reading and writing support, UDF support

Flink SQL TopN and GroupBy optimizations

Checkpoint and savepoint are optimized for real business scenarios

Flink state query

When consuming kafka data, how do you handle dirty data?

A fliter operator can be added before processing to filter out data that does not conform to the rules.

At this point, I believe everyone has a deeper understanding of "how to do Flink serialization," so let's actually operate it! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.