What are the diagnostic ideas for common problems in Flink? 10/21 Update SLTechnology News&Howtos

What are the diagnostic ideas for common problems in Flink?

2025-10-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What is the common problem diagnosis idea of Flink? in view of this problem, this article introduces the corresponding analysis and answer in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

1. Common operation and maintenance problems

1.1 Job running environment

The job running environment introduced in this paper is mainly in Alibaba Group, the Flink cluster built on the Hadoop ecology, including Yarn, HDFS, ZK and other components, and the job submission mode adopts yarn per-job Detached mode.

In the first step, the job submission is to upload the job code written by the user and the compiled jar package to HDFS through Flink Yarn Client

Step 2 Flink Client communicates with Yarn ResourceManager to apply for the required Container resources

Step 3, after receiving the request, ResourceManager allocates the Container process to start AppMaster in the NodeManager in the cluster. The AppMaster contains the Flink JobManager module and the ResourceManager module for Yarn communication.

Step 4, in JobManager, the Execution Graph,ResourceManager module is generated according to the JobGraph of the job to communicate with the ResourceManager of Yarn, and the container resources required by TaskManager are applied, and the container is pulled by the NodeManger of Yarn. Each NodeManager downloads resources from HDFS, launches Container (TaskManager), and registers with JobManager; JobManger deploys different task tasks to execute in each TaskManager.

■ resource application method

Specify resource size

When submitting, specify how much memory and CPU resources are used for each TaskManager and JobManager.

Fine-grained resource control

Alibaba Group mainly uses ResourceSpec to specify the size of resources required for each Operator, and applies to Yarn according to the concurrent aggregation of task into container resources.

■ environment is highly available

If JM is highly available and AppMaster (JobManager) is abnormal, you can ensure high availability through Yarn's APP attempt and ZooKeeper mechanisms.

Data is highly available. When a job is checkpoint, TaskManager writes to the local disk first, while asynchronously writes to HDFS;. When the job starts again, it can be restored from the HDFS to the point of the last checkpoint to continue the job flow.

1.2 Why is my homework delayed? ■ time type

Processing time

Processing time refers to the system time of the machine on which the task processes the data

Event time

Event time refers to the time of a data column in the data

Ingestion time

Ingestion time refers to the system system time when the flink source node receives this data

■ delay definition

Add Gauge type metrics to custom Source source resolution, and report the following metrics:

Record the event time in the latest piece of data and use the current system time-event time when reporting metrics.

Record the system time when the data was read-the event time in the data, and report the difference directly.

Delay = current system time-data event time (event time)

Description: reflect the progress of data processing.

Fetch_delay = system time when the data was read-data event time (event time)

Description: reflect the actual processing capacity of real-time computing.

■ delay analysis

From the upstream source, check the concurrency of each source

Is the sparsity of upstream data causing

Job performance problem

1.3Why did I failover my homework? ■ job failover is mainly divided into two categories.

There are two main types of Flink Failover, one is Job Manager's Failover, and the other is Task Manager's Failover.

1.4 Job cannot be submitted, exception stops ■ cannot be submitted

Yarn problem-Resource limit

HDFS problem-Jar packet is too large, HDFS exception

Insufficient JobManager resources to respond to TM registration

Exception during TaskManager startup

■ abnormal stop-Metrics Monitoring cannot be overwritten

Restart policy configuration error

The number of restarts reached the limit.

two。 Treatment mode

2.1 ways to deal with delay problems

Use delay and fetch_delay to determine whether sparse upstream leads to delay or poor job performance leads to delay.

After the delay is determined, the reverse pressure node is found through the reverse pressure analysis.

Analysis of index parameters of reverse pressure node

By analyzing JVM process or stack information

By viewing logs such as TaskManager

■ delay and throughput

Observe the correlation between delay and tps index, whether the job performance is not delayed due to the abnormal increase of tps.

■ reverse pressure

Find the source of the reverse pressure.

The mode of data transmission between nodes is shuffle/rebalance/hash.

For the concurrent throughput of each node, the reverse pressure is not caused by the tilt of data.

Business logic, whether there are regularities, external system access, etc. IO/CPU bottleneck, resulting in insufficient performance of nodes.

■ index

How long does GC take?

GC multiple times in a short time

IO of state local disk

External system access delay, etc.

■ stack

On the node where the TaskManager is located, check the thread TID and CPU usage to determine if it is a CPU or IO problem.

Ps H-p ${javapid}-o user,pid,ppid,tid,time,%cpu,cmd

# after converting to hexadecimal, check out tid specific stack jstack ${javapid} > common processing methods of jstack.log ■

Increase the number of concurrency of reverse pressure nodes.

Adjust node resources to increase CPU and memory.

Split the node to split the operator that consumes more resources from chain.

Job or cluster optimization, tuning through primary key fragmentation, data deduplication, data tilt, GC parameters, Jobmanager parameters and so on.

2.2 Job failover analysis

Some log information printed when viewing the job failover

Check the Subtask of failover to find the Taskmanager node

Combined with log information such as Job/Taskmanager

Combine related logs such as Yarn and OS

3. Job life cycle

3.1 Job status change-JobStatus

You can see the entire state transition of the job in the figure above. The whole life cycle from job creation to running, failure, restart, success, etc.

What needs to be noticed here is the status of reconciling, which indicates that the AppMaster in yarn is restarted and the JobManager module in it is restored. This job will go from created to reconciling, wait for other Taskmanager reports, restore JobManager's failover, and then go from reconciling to normal running.

3.2 Task status change-ExecutionState

The figure above shows the Task state transition of the job. It should be noted that when the job status is in running state, it does not mean that the job must be running consumption information. In streaming computing, the job does not really run until all the task is in running.

By recording the state changes of each stage of the job and forming the life cycle, we can clearly show when the job starts to run, when it fails, as well as key events such as taskmanager failover, and further analyze how many jobs are running in the cluster, forming the SLA standard.

4. Instrumental experience

4.1 indicators

How to measure whether an assignment is normal or not?

Delay and throughput

For Flink jobs, the most key indicators are delay and throughput. In the case of the TPS water level, the operation will begin to delay.

External system call

The time-consuming statistics of external system calls can also be established from the indicators, such as how long it takes for dimension table join,sink to write to the external system, which helps us to eliminate some factors of external system anomalies.

Baseline management

Establish baseline management of indicators. For example, when state access takes time and there is no delay, how much time does it take to access state? What is the approximate amount of data per checkpoint? Under abnormal circumstances, all these are helpful for us to troubleshoot the problems with Flink's homework.

4.2 Log

Error log

JobManager or TaskManager keyword and error log alarm.

Event log

The state change of JobManager or TaskManager forms a key event record.

History log collection

When the job is finished, if you want to analyze the problem, you need to look for historical information from Yarn's History Server or the log system that has been collected.

Log analysis

With JobManager,TaskManager logs, you can cluster common failover types and label some common failover, such as OOM or some common upstream and downstream access errors, and so on.

4.3 Association analysis

Job metrics / events-Taskmanager,JobManager

Yarn event-Resource preemption, NodeManager Decommission

Machine exception-downtime, replacement

Failover log clustering

After processing these metrics and logs, you can correlate the events of each component, for example, when TaskManager failover, it may be due to machine exceptions. You can also use Flink jobs to resolve Yarn events, associate jobs with Container resource preemption, NodeManager offline events, and so on.

The answers to the frequently asked questions about Flink diagnosis ideas are shared here. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.