How to understand Flink indicators, monitoring and alarm 04/19 Update SLTechnology News&Howtos

How to understand Flink indicators, monitoring and alarm

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to understand Flink indicators, monitoring and alarm, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

Why do we care about indicator monitoring?

We will take the weather as an example.

Metrics: the way objects are measured and described

Quantifiable: for example, the weather has been very hot recently. Is it hotter today than yesterday? Is the temperature in Beijing hotter than in Shanghai? There is no way to judge, so temperature is such an indicator to quantify the degree of heat in our weather. Standardization: the temperature we are used to talking about is Celsius. If someone tells you about Fahrenheit and says 77 degrees today, you will wonder why the temperature is so high. Therefore, our indicators still need to be standardized. There needs to be a unified standard. Multi-dimensional: students in the south feel that 35 degrees is too stuffy to breathe; students in the north feel that 35 degrees seems to be like that. Because in addition to the temperature, which will affect the comfort of the human body, there is also an indicator called air humidity. So measuring the weather requires a combination of multi-dimensional indicators.

Monitoring: monitoring and control of indicators

Real-time: such as weather forecast, real-time forecast is the monitoring content we need. Easy to use: compared with the weather information broadcast at a fixed time on the TV, the mobile phone App is an easy-to-use weather monitoring software. You can check the history: for example, it has been raining somewhere a few days ago, and the river is fast, which may affect my choice of travel.

Today's sharing starts from the following four aspects:

The link of monitoring and alarm-- based on the practice of Meituan Dianping real-time computing platform, the commonly used monitoring and alarm items-- which indicators can effectively measure the aggregation of my work indicators? horizontal view is the winding mountain side is the application of steep peak index monitoring-- what are the common expressions for reference?

1. Monitor the link of alarm

1.1 Monitoring alarm link

Meituan Dianping's indicator monitoring alarm link is shown in the following figure. First of all, we will conduct a unified and centralized collection of logs and indicators. Reporter (described in 2.8,3.1) types the metrics of Flink jobs as logs. Then collect it through log collection and receive it in Kafka. Next, we will do parsing and aggregation through real-time jobs, and then drop the resulting metrics into Kafka as a real-time data source.

Downstream will do different processing and display of different data according to different needs. Log data will be sent to ES for query, and will also be processed according to keywords and real-time jobs to make log-related alarms; numerical indicators will be put into OpenTSDB for query, and various indicators will also be supported for alarm. In the end, these contents will be concentrated in our real-time computing platform to give users a unified presentation.

The whole link is mainly divided into three key links.

In the part of log collection, first of all, we should collect these logs and indicators uniformly and centrally. As for this link, as mentioned by the previous two lecturers, Flink now provides three ways: one is to see some indicators of the assignment directly on Flink UI; the second is to obtain indicators from the assignment; and the third is to match a variety of third-party Reporter. Meituan here is on the basis of slf4j to add their own dimensional information formatted to send down. In the parsing display part, some Flink jobs are used to parse the metric data of all jobs on the aggregation platform, which are displayed to users and provided to downstream users. In the monitoring and alarm part, some personalized and configurable rule alarms are made for the indicators that have been aggregated.

1.2Metric display: Grafana

Grafana supports more data source formats, such as ES, OpenTSDB, etc.; it has a variable function, which can look at the metrics of a job or compare them together.

Compared with the self-developed indicator display tool, the Grafana configuration interface will be more convenient, time-saving and labor-saving, and cost-effective. If you just want to simply show the indicators of all the assignments, Grafana is a good choice, it can also be embedded in other pages. However, the type of Grafana diagram is relatively simple, and there may be some limitations in the actual direct use process.

two。 Commonly used monitoring items

Let's take a look at what metrics are commonly used to measure the running of a job.

2.1 commonly used indicators

■ system index

The system indicators are described on the official website of Flink.

The most common concern for system metrics is the availability of jobs, such as uptime (how long the job lasts) and fullRestarts (the number of times the job is restarted).

The second concern is the traffic of the job. You can pay attention to the number of messages processed by the job every day and the traffic during the peak period through relevant metrics such as numRecordsIn and numBytesInLocal. By following these metrics, you can observe whether the traffic performance of the job is normal.

Then there are CPU (e.g. CPU.Load), memory (e.g. Heap.Used), GC (e.g. GarbageCollector.Count, GarbageCollector.Time) and network (inputQueueLength, outputQueueLength) related metrics, which are generally used to troubleshoot job fault information.

There is also checkpoint-related information, such as the length of recently completed checkpoint (lastCheckpointDuration), the size of recently completed checkpoint (lastCheckpointSize), the ability to recover after job failure (lastCheckpointRestoreTimestamp), the number of checkpoint successes and failures (numberOfCompletedCheckpoints, numberOfFailedCheckpoints), and barrier alignment time in Exactly once mode (checkpointAlignmentTime).

There are also indicators of connector, such as the commonly used Kafka connector. Kafka itself provides some indicators to help us understand the latest news consumed by the job, whether there is any delay in the job, and so on.

■ custom metrics

Custom metrics means that users can bury points in their job logic, so that they can monitor their own business logic.

As other lecturers have mentioned, today's Flink job is more like a micro-service, not only concerned about whether the job has processed all the data, but also hoping that the job can run 7 × 24 hours a day to process the data. So important metrics in business logic are also important in Flink.

For example, processing logic takes time to manage, for example, a business system that contains complex logic can be managed before and after the logic, so that you can see how long it takes for each message to finish processing the logic.

The other is the performance of external service invocation. In a Flink job, you may need to access external storage (such as Redis). You can check the time consuming of the request, the success rate of the request, etc.

There is also the cache hit rate. Sometimes because the dataset is too large, we only access hot data, and some of the information is cached in memory. We can monitor the cache hit rate. If the cache hit rate is very high, the cache is valid. If the cache hit rate is very low and you have been accessing external storage, you need to consider whether the cache design is reasonable.

In addition, there are several types of job processing logic. If the processing logic throws an exception, it will result in the job fullRestarts. In this case, these exceptions will generally be catch. If complex computing is involved, you can try several times through the retry mechanism, and if the retry is not successful, the data will be discarded. At this time, the proportion of the processed data or some characteristics of the data can be reported as indicators, so that we can observe the proportion of such data to observe whether there are anomalies in the data processing. For example, the percentage of data filtered by filter can observe whether the logic of filter is normal, and time-related operators such as windows need to monitor the proportion of data discarded during time and so on.

2.2 how to determine which indicators require attention?

The first point is related to the job status, such as whether the job fails, whether the job survives, whether the job runs stably, and the risk factors that affect the availability of the job (such as whether the last checkpoint was successful, the time of the most recent successful checkpoint).

The second point is related to job performance, such as job processing latency, data skew, performance bottlenecks (such as external access), and so on.

The third point is related to business logic, such as upstream data quality, whether there is a problem with the new logic, and whether there is data loss (whether data loss is allowed in Exactly once semantics).

3. The aggregation mode of indicators

The commonly used monitoring indicators are introduced above, and then how to look at these indicators. The same index may be seen from the point of view of the machine, or from the point of view of the operation, different angles will lead to different results.

The first is the aggregation dimension of the job, such as fine-grained dimensions such as Task and Operator, and slightly larger granularity such as Job, machine, cluster or business dimension (such as each region). When looking for problems, start with large granularity and troubleshoot to fine granularity. If you want to see the overall situation, you need a relatively coarse granularity. The original metrics can be reported and aggregated according to different scenarios. Fine-grained queries, such as task granularity, are required for performance testing.

On the other hand, the way of aggregation, such as sum, mean, maximum, minimum, rate of change, etc., need to pay attention to eliminate statistical errors, take the moving average or fixed time average of the data to make the curve smoother. There is also the difference, such as the difference between the amount of upstream data and downstream data, and the difference between the latest offset and the consumed offset. In addition, 99 lines can be used to measure xx rate and xx time. Finally, there is one more point to pay attention to, and it is also the pit we have encountered in the actual work, that is, the lack of indicators. If we do not get the indicators, the job status will become a black box. We need to pay attention to whether the collection of indicators of the job is normal. We need to monitor whether there is a loss of indicators, whether a single indicator is lost or whether the indicators of the whole job are lost.

Finally, when observing metrics, you may need complex aggregation queries of multiple metrics, such as common timeline comparisons, such as reverse pressure on previous normal jobs today, you can query the growth of today's data volume compared with yesterday's data volume. In addition, different businesses have different trends, such as the peak period of takeout, which can be measured by comparing the month-on-month growth of data volume during the peak period. There are also indicators of concern about the duration, such as the data delay of the job, if the delay time is long, the job may be abnormal; and the periodicity of the indicator, if the change of the indicator is periodic, consider whether it is due to the influence of the time window.

There is also need to consider the combination of external systems for calculation, such as upstream for consumption Kafka, in addition to want to know the current consumption situation, but also want to check the amount of data upstream. For example, in the figure, the blue line is the amount of data of the upstream Kafka, and the red line is the amount of output data of the source operator of the operation. It can be seen that the data in the upstream is basically the same in the afternoon peak and the evening peak, and the upstream data has a higher growth in the afternoon peak and evening peak. Although there is reverse pressure in the peak time, it is mainly due to the growth of upstream data rather than the lack of processing capacity of the job. If there are multiple operators upstream can add the amount of data of multiple operators, which is why we use the front-end developed by ourselves in addition to Grafana to display, the self-developed front-end can display indicators more flexibly.

4. Application of Index Monitoring

4.1 alarm for abnormal operation

Abnormal job status: including the abnormal status of the job task, such as failing, but also the exception of indicators such as uptime. Job reporting without indicators: job reporting without indicators will send an alarm to the person in charge of the job; when a large number of jobs are reported to a certain extent, an alarm will be sent directly to the administrator of the platform when it reaches the expected value. The index reaches the threshold: it is the most commonly used alarm item. For example: processing capacity drop 0 consumption delay (lagging behind a certain number, lasting for a certain period of time) failure rate, loss rate and other personalization: there are many kinds of tasks in the real-time computing platform, and different tasks will have different characteristics. For example:

Alarm period: alarm in different time periods may require different domain values, or different alarm methods (company communication software alarm, telephone alarm, etc.)

Aggregation: different businesses may have different ways of aggregating alarms, which also need to be compatible as far as possible. Error log, keyword log: when the error log reaches a certain amount or when a keyword appears in the log, trigger the alarm.

Note: the stability of the alarm system itself should be put in the first place to avoid false positives, false positives and delays. Otherwise, it will affect the accurate judgment of the business side.

4.2 indicators reflect the overall status of the platform: outliers are highlighted and multi-dimensional aggregation timelines are compared, and faults are found in time and quickly located in the direction in which the platform can be optimized to facilitate overall resource allocation 4.3 several stages of automatic operation and maintenance:

Unable to operate and maintain: there is no indicator, the job status is a black box, and when something goes wrong, a group of people check the code. Manual operation and maintenance: restart, expand capacity, roll back, migrate, downgrade, correct error codes, optimize processing logic. Manual operation and maintenance staff said that no matter what you are doing, when you call the police, you need to take out your computer and cell phone to troubleshoot the problem. Auxiliary operation and maintenance: when manual operation and maintenance do too much and standardize all the indicators of your business operations, we can get some reference values. Summarize these experiences as suggestions for other students' operation and maintenance, so that even newcomers can quickly deal with them with the help of these auxiliary tools to reduce learning costs. Intelligent operation and maintenance: intelligent operation and maintenance is an operation and maintenance mode that does not need to be handled by people and operates automatically when a failure occurs. The machine that executes the job hangs, pulls up automatically and starts the job automatically. If the resources are insufficient, the capacity will be expanded automatically. If there is a problem with the online homework, automatically switch to the backup homework. Of course, what can be done at present can only solve part of the problem, and some code problems still need to be artificially involved in repairing bug.

Quan A

Q1: build a whole set of index system, how to maintain the index database? Do you need to make code-level changes to the program, or just modify the configuration?

A: since we want to build a complete monitoring system, we naturally want this indicator to be as adaptable as possible, so what do we need to do?

When designing the architecture of the whole system, we need to have a certain degree of compatibility, we can not only focus on one kind of indicators.

At the beginning of the design, we need to consider what types of indicators there are, what characteristics each type of indicators have, what the dimensions of aggregation may be, and how to aggregate them.

Build a model.

Design, first extract the features of the indicators, and then design these features, and finally make a compatible system, so that for known types of indicators, you only need to modify the configuration to expand.

The display effect of the Q2:Grafana platform is very good, but the alarm is not friendly; is there a more mature tool for alarm?

A: you can take a look at Prometheus. The police are quite mature. Alarm needs more personalized things than index aggregation, and self-research may need to be considered if the function is very perfect.

Q3: can the taskManager index be obtained within the operator?

A: to get it through restful API, it is not recommended to do it inside the operator. The indicator itself should not affect the processing logic of your homework itself. Monitoring should be a peripheral thing.

Q4: how to find the root cause of the homework problem according to the index?

A: according to the indicators from coarse to fine, you can refer to the tutorials of sections 2.8 and 3.1.

Q5: the amount of metric data is relatively large, how to choose storage?

A: you can choose openTSDB, and so can other TSDB, like other Hive or OLAP engines. As a kind of time series data, there are many mature schemes to choose from. Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.