What is the principle of Metrics 04/20 Update SLTechnology News&Howtos

What is the principle of Metrics

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

The main content of this article is to explain "what is the principle of Metrics". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "what is the principle of Metrics"?

What is Metrics?

The Metrics provided by Flink can collect metrics within Flink to give developers a better understanding of the status of jobs or clusters. Because it is difficult to find the internal actual situation after the cluster is running, whether it is running slowly or fast, whether it is abnormal, etc., developers cannot view all Task logs in real time, such as when the jobs are very large or there are many jobs, what to do? At this point, Metrics can help developers understand the current state of the job.

Metric Types

The types of Metrics are as follows:

First of all, commonly used such as Counter, developers who have written mapreduce jobs should be very familiar with Counter, in fact, the meaning is the same, that is, the accumulation of a counter, that is, for multiple pieces of data and multi-megabyte data has been added up the process.

Second, Gauge,Gauge is the simplest Metrics, which reflects a value. For example, if you want to see how much Java heap memory is used, you can expose the current value of Gauge,Gauge in real time one at a time, that is, the amount used by heap.

Third, Meter,Meter refers to the statistics of throughput and the number of "events" occurring per unit time. It is equivalent to finding a rate, that is, the number of events divided by the time used.

Fourth, Histogram,Histogram is complex and not commonly used. Histogram is used to count the distribution of some data, such as Quantile, Mean, StdDev, Max, Min and so on.

Metric Group

Metric has a multi-layer structure in Flink and is organized in the way of Group. It is not a flattened structure. Metric Group + Metric Name is the unique identity of Metrics.

The levels of MetricGroup are TaskManagerMetricGroup and TaskManagerJobMetricGroup, and the group,task of each Job specific to a task is divided into TaskIOMetricGroup and OperatorMetricGroup. There are also IO statistics and some Metrics under Operator, and the entire hierarchy is roughly shown in the following figure. Metrics does not affect the system, it is in different groups, and Flink supports adding Group on its own, and can have its own hierarchy.

TaskManagerMetricGroup TaskManagerJobMetricGroup TaskMetricGroup TaskIOMetricGroup OperatorMetricGroup ${User-defined Group} / ${User-defined Metrics} OperatorIOMetricGroup ·JobManagerMetricG roup JobManagerJobMetricGroup

JobManagerMetricGroup is relatively simple, the equivalent of Master, and has relatively few levels.

The definition of Metrics is relatively simple, that is, the information of indicators can be collected and counted by themselves, and the information of Metrics can be seen in the external system and can be aggregated and calculated.

How to use Metrics?

System Metrics

System Metrics, the status of the entire cluster has been covered in great detail. It includes the following aspects:

Master-level and Work-level JVM parameters, such as load and time;, are also divided into Memory details, including heap usage, non-heap usage, direct usage, and mapped usage; Threads can see how many threads there are; and very useful Garbage Collection.

Network is widely used, and Network is very useful when you need to solve some performance problems. Flink is not only a network transmission, but also a directed acyclic graph structure. We can see that each upstream and downstream is a simple producer-consumer model. Flink is equivalent to a standard queue model between producers and consumers through a finite length through the network. If you want to evaluate location performance, the intermediate queue will quickly narrow the scope of the problem and quickly find the bottleneck of the problem.

For more information, see CPU Memory Threads Garbage Collection Network Classloader Cluster Availability Checkpointing StateBackend IO.

People in the operation and maintenance cluster will be more concerned about the information related to Cluster. If the job is too large, they need to pay close attention to Checkpointing, which may fail to reflect potential problems in some conventional indicators. For example, if Checkpointing does not work for a long time and the data flow seems to have no delay, it may appear that everything is normal in the job. In addition, if after a round of failover restart, because Checkpointing is not working for a long time, it is possible to roll back to the state it was a long time ago, and the whole job may be scrapped directly.

RocksDB is a commonly used state backend implementation in production environments. If the amount of data is large enough, you need to pay more attention to the Metrics of RocksDB, because its performance may decline as the amount of data increases.

User-defined Metrics

In addition to the system's Metrics, Flink supports custom Metrics, or User-defined Metrics. The above is all about the system framework, for their own business logic can also use Metrics to expose some indicators, in order to monitor.

User-defined Metrics is now talking about datastream's API,table and sql may need context assistance, but if you write UDF, they are more or less the same.

The API of Datastream inherits RichFunction, and only inheriting RichFunction can have the interface of Metrics. Then through RichFunction, you will bring a getRuntimeContext (). GetMetricGroup (). AddGroup (...) This is the entrance to User-defined Metrics. In this way, you can customize the user-defined Metric Group. If you want to define a specific Metrics, you also need to use getRuntimeContext (). GetMetricGroup (). Counter/gauge/meter/histogram (...) Method, which will have a corresponding constructor that can be defined into its own Metrics type.

Inherit RichFunction Register user-defined MetricGroup: getRuntimeContext (). GetMetricGroup (). AddGroup (...) Register user-defined Metric: getRuntimeContext (). GetMetricGroup (). Counter/gauge/meter/histogram (...)

User-defined Metrics Example

Here is a simple example of how to use Metrics. For example, if you define a Counter to pass a name,Counter, the default type is single counter (an implementation built in Flink). You can inc () the Counter and get it directly in the code.

The same is true of Meter, Flink has a built-in implementation is Meterview, because Meter is a record of how long the event occurred, so it is how long it has to have a window. Usually use Meter directly markEvent (), which is equivalent to adding an event to constantly manage, and finally use the method of getrate () to directly divide the events that occurred during this period of time to calculate.

Gauge is relatively simple. Type out the current time and type System::currentTimeMillis directly into the Lambda expression, which is equivalent to adjusting the system time of the day for calculation each time you call it.

Histogram is a little more complicated. The code in Flink provides two implementations. If you choose one of these implementations, you still need a window size, and you can give it a value when you update it.

These Metrics are generally not thread-safe. If you want to use multithreading, you need to add synchronization. For more details, please refer to the link below.

Counter processedCount = getRuntimeContext (). GetMetricGroup (). Counter ("processed_count"); processedCount.inc (); Meter processRate = getRuntimeContext (). GetMetricGroup (). Meter ("rate", new MeterView (60)); processRate.markEvent (); getRuntimeContext (). GetMetricGroup (). Gauge ("current_timestamp", System::currentTimeMillis); Histogram histogram = getRuntimeContext (). GetMetricGroup (). Histogram ("histogram", new DescriptiveStatisticsHistogram (1000); histogram.update (1024)) [https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/metrics.html#metric-types]

Get Metrics

There are three ways to obtain Metrics, first of all, you can see it on WebUI; secondly, you can get it through RESTful API. RESTful API is more program-friendly, such as writing automated scripts or programs, automated operation and testing, and the Json format returned by RESTful API parsing is relatively friendly to the program. Finally, it can also be obtained through Metric Reporter, and monitoring mainly uses Metric Reporter function.

How is the way to get Metrics implemented in physical architecture?

Understanding the background and principles will lead to a deeper understanding of the use. WebUI and RESTful API are the ways to pull up the Metrics in each component through periodic queries of centralized nodes. Among them, fetch is not necessarily updated in real time, and the default is 10 seconds, so it is possible that the data refreshed in WebUI and RESTful API is not the data you want in real time. In addition, fetch may be out of sync, for example, two components are not moving on one side and the other side is not moving. It may be due to some reason that the timeout is not pulled over, so the related value cannot be updated. It is the operation of try best, so sometimes the indicators we see may be delayed, or the relevant values may be updated after waiting.

The red paths go through the MetricFetcher, and a central node aggregates them and displays them. Unlike MetricReporter, which reports directly from each individual point, it has no centralized node to help with aggregation. If you want to aggregate, you need to do it in a third-party system, such as a common TSDB system. Of course, not centralized structure is also its advantage, it can avoid the problems caused by centralized nodes, such as internal storage, and so on. MetricReporter directly Reporter the original data, and processing with the original data will have more powerful functions.

Metric Reporter

Flink has a lot of built-in Reporter, so you can refer to the technology selection of external systems. For example, JMX is a technology that comes with java and does not strictly belong to a third party. There are also InfluxDB, Prometheus, Slf4j (directly in log), etc., which are easy to use when debugging. You can directly look at logger,Flink 's own log system, which will be put into the Flink framework package. For details, please refer to:

Flink has a lot of built-in Reporter, so you can refer to the technical selection of external systems. For more information, please see [https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/metrics.html#reporter](https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/metrics.html) Metric Reporter Configuration Example metrics.reporters: your_monitor Jmx metrics.reporter.jmx.class: org.apache.flink.metrics.jmx.JMXReporter metrics.reporter.jmx.port: 1025-10000 metrics.reporter.your_monitor.class: com.your_company.YourMonitorClass metrics.reporter.your_monitor.interval: 10 SECONDS metrics.reporter.your_monitor.config.a: your_a_value metrics.reporter.your_monitor.config.b: your_b_value

How is Metric Reporter configured? As shown above, first the names of Metrics Reporters are separated by commas, then find reporter through metrics.reporter.jmx.class 's classname reflection, and you also need to get the configuration of metrics.reporter.jmx.port, for example, third-party systems send more over the network. But to know where to send, ip address, port information is more common. In addition, there is metrics.reporter.your_monitor.class is a must, you can define the interval, Flink can be parsed, do not need to read, and can also write their own config.

Actual combat: monitoring with Metrics

Metrics is commonly used to do automatic operation and maintenance and performance analysis.

Automatic operation and maintenance

What does automated operation and maintenance do?

First, collect some key Metrics as a basis for decision, use Metric Reporter to collect Metrics to a storage / analysis system (such as TSDB), or obtain it directly through RESTful API.

Once you have the data, you can customize the monitoring rules to focus on key metrics, Failover, Checkpoint, and business Delay information. The most versatile custom rules can be used to call the police, saving a lot of manual work, and can customize how many times failover requires human intervention.

When there is a problem, there are nail alarm, mail alarm, SMS alarm, telephone alarm and other notification tools.

The advantage of automatic operation and maintenance is that it can clearly view the data through the market and report form, understand the overall information of the job at all times through the market, and optimize through report analysis.

Performance analysis.

Performance analysis generally follows the following process:

First of all, from the discovery of the problem, if there is a Metrics system, coupled with monitoring alarm, you can quickly locate the problem. Then analyze the problem, it will be more convenient to look at the market, through specific System Metrics analysis, narrow the scope, verify the hypothesis, find the bottleneck, and then analyze the cause from the business logic, JVM, operating system, State, data distribution and other dimensions; if you can not find the cause of the problem, you can only use profiling tools.

Actual combat: "my task is slow, what should I do?"

"what if the task is slow?" It can be called one of the ultimate questions that cannot be answered.

The reason is that the problem is the framework of the system, such as telling the doctor that he is not feeling well and then letting the doctor come to a conclusion. Doctors usually need to narrow it down and determine the problem through a series of tests. By the same token, the question of slow tasks requires multiple rounds of analysis to get a clear answer.

In addition to not familiar with the Flink mechanism, the problem for most people is that the whole system is running in a black box, do not know how the system is running, lack of information, can not understand the state of the system. At this point, an effective strategy is to turn to Metrics to understand the internal situation of the system, which is illustrated by some specific examples below.

Find a problem

For example, the following figure shows that one of the failover indicators on the line is not 0, and the others are all 0. At this time, the problem is found.

For example, the following figure shows that the Input index is normally at 40 or 5 million, but suddenly falls to zero, and there are also problems here.

The business delay problem is shown in the figure below. For example, when the data processed is compared with the current time, it is found that the data processed is the data of an hour ago, and usually the data of one second ago is processed, which is also problematic.

Narrow the scope and locate the bottleneck

When there is a place that is slow, but I don't know where it is, in the red part of the figure below, the OUT_Q concurrency value has reached 100%, and all the others are normal or even excellent. There is a problem with the producer-consumer model here. The producer IN_Q is full and the consumer OUT_Q is full. From the figure, we can see that node 4 is already very slow, and the data node 4 generated by node 1 cannot handle it, while the performance of node 5 is very normal, indicating that the queue between node 1 and node 4 has been blocked, so we can focus on node 1 and node 4, narrowing the scope of the problem.

Each of the 500InBps has 256PARALLEL, so it is impossible to see so many points one by one, so you need to tag the number of index concurrently when aggregating. Aggregation is divided by label to see which concurrency is 100%. In the diagram, the two highest lines, line 324 and line 115, can be divided, thus further narrowing the scope.

The way to narrow the scope with Metrics, as shown in the following figure, is to align with Checkpoint Alignment to narrow the scope, but this method is less used.

Multidimensional analysis

Why are analytical tasks sometimes so slow?

When you locate a Task that is particularly slow, you need to analyze the slow factors. The factors of slow analysis tasks are prioritized and can be checked from top to bottom, from the business side to the underlying system. Because most of the problems appear in the business dimension, for example, the impact of the business dimension can have the following aspects, such as whether the degree of concurrency is reasonable, data peaks and troughs, and data skew; secondly, from the perspective of Garbage Collection, Checkpoint Alignment, State Backend performance; finally, from the perspective of system performance, such as CPU, memory, Swap, Disk IO, throughput, capacity, Network IO, bandwidth and so on.

Quan A

Metrics is the internal monitoring of the system, can it be used as the output of Flink log analysis?

Yes, but not necessary, all use Flink to deal with the logs of other systems, the output or alarm can be directly regarded as sink output. Because Metrics is a statistical internal state, you are dealing with normal input data, and you can output it directly.

Does Reporter have special threads?

Each Reporter has its own separate thread. Inside Flink, there are actually quite a lot of threads. If you run a job and go directly to the TaskManager, jstack can see the details of the thread.

At this point, I believe you have a deeper understanding of "what is the principle of Metrics". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.