The best practice of building a three-dimensional monitoring system 07/06 Update SLTechnology News&Howtos

The best practice of building a three-dimensional monitoring system

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Author: Yunqi Community

Original: https://www.aliyun.com/aliware/news/monitoringsolution

Abstract: this paper will start with the complex status quo of distributed system calls, analyze the three usage scenarios of the call chain and the best practices of the call chain, and briefly describe how to take the call chain as the core of troubleshooting problems. through it, all kinds of data can be associated together to improve the ability of troubleshooting. [news flash] the EDAS online method tracks new features and opens the "last kilometer" of application diagnosis.

1. The present situation of distributed calling system

At present, with the expansion of Internet architecture, distributed systems become more and more complex, and more and more components begin to be distributed, such as micro-services, messaging, distributed database, distributed cache, distributed object storage, cross-domain invocation, these components constitute a complex distributed network.

As shown on the right side of the figure above, when Application A makes a request, dozens or more services may be called behind it. If the distributed system is compared to the highway network, the request at each front end is equivalent to the vehicle driving on the highway, and the application of processing the request is the toll station on the highway, where the traffic information of the vehicle is recorded into a log, including time, license plate, station, highway, price, etc., if the logs on all the toll stations are integrated together, the complete traffic record of the vehicle can be determined by the unique license plate number. Distributed call system tracking and monitoring is this idea of analogy, tracking each request, and then clear the application of each request, time-consuming and other information.

The implementation of Alibaba's distributed call tracking-- Hawkeye

Alibaba's distributed call tracking is realized by the EagleEye system, which is a log-based distributed call tracking system. Its key core is the call chain, which generates a globally unique ID (Traceld) for each request, through which the "isolated" call information of different systems is associated together to restore more valuable data.

The above figure shows a call chain from the generation environment. You can see a series of applications that go through the intermediate process of the request in the application name column. You can see that the Buy application is the first, followed by calls to delivery, tee, inventoryplatform, etc., forming a call tree (indentation on the tree indicates nesting relationship). From the call tree, it is easy to see the complete processing of the front-end request.

It is also worth noting that the above image is made up of a white background and a blue background. The blue background indicates that the call chain becomes an asynchronous message channel after passing through the message, and its subsequent processing is also an asynchronous processing process; the white background indicates the synchronous process. In general, the waiting time for front-end users does not include the time spent in the blue background, that is, only the synchronization time is included.

The page shown in the figure above also clearly shows that it takes time for each application to process the request, and it is very intuitive to locate; in addition, the status information is also a point of concern. As shown in the figure above, if an error occurs during the call, an exception will occur (marked in the red area in the figure). By clicking the status code, the user can view the specific information of the error.

Eagle Eye launched internally in Alibaba in 2013 and currently supports Ali Group's e-commerce, Gaode, Youku and other businesses. the technical level covers front-end gateway access layer, remote service invocation framework (RPC), message queue, database, distributed cache, custom components (such as payment, search SDK, local method burial, etc.). In 2016, Aliyun's middleware cloud product EDAS was released to provide services to the outside world. In addition, Hawkeye also supports proprietary cloud output.

two。 Working with scen

Let's take a look at the specific usage scenarios of the call chain.

Locating abnormal and time-consuming problems

You can find Traceld in the error messages in the business exception log (such as TraceId=ac18287913742691251746923 in the figure). Then you only need to enter Traceld in the Hawkeye system to see the specific situation in the call chain, locate the problem more intuitively on the call chain (as shown in the figure above), and determine the location of the problem after layer-by-layer troubleshooting.

Monitoring report with call chain drilldown

For the distributed call tracking system, it not only provides the function of call chain, because it buries all the calls of the middleware, so all the situations on the middleware can be monitored. Therefore, a detailed call monitoring report will be formed in the process of forming the call chain, which is different from other monitoring in that the monitoring report is a report with the function of drilling up and down. Because the call chain is a detailed underlying statistics, the report dimensions that can be formed above are very rich. In the call report shown in the figure above, you can not only see the situation of the service, but also drill down to the service it invokes. In addition, you can drill down the call chain from the monitoring report to view clear call chain information.

Link analysis

Link is different from call chain, link is a statistical concept, and call chain is the process of single call. The value of analyzing links is mainly reflected in the following points:

1. Topological morphological analysis: analyze the source, destination, and identify unreasonable sources

two。 Dependency carding: identify vulnerable points / performance bottlenecks, strong dependencies, etc.

3. Capacity estimation: evaluate capacity based on link call ratio and peak QPS

Let's analyze these four points in detail.

a. Topological morphological analysis: analyze the source, destination, and identify unreasonable sources

The above figure is the global call topology diagram, which shows that there is a complex call relationship between different applications, and you can also view the call relationship between an application and other applications and the frequency of calls. The red dot in the figure indicates that an error occurred during the call. Through the topology diagram, the architect can clearly observe the invocation on the system. In addition, click on a node on the global invocation topology diagram to drill down to the single application link topology diagram shown in the following figure.

In the single application link topology diagram centered on an application, you can see the specific call relationship between the application upstream and downstream of the call chain.

b. Dependency carding and capacity estimation

In addition to topology analysis, link analysis can also sort out dependencies: identify vulnerable points, performance bottlenecks, strong dependencies and other problems; it can also evaluate capacity according to the link call ratio and peak QPS.

The figure above is a single-link report, which refers to the invocation relationship formed by the superposition of the call chain of the same HTTP entry and includes all dependencies. The fuzzy part on the left side of the above figure is a call tree, which shows the dependency between applications. Unlike the call chain, this dependency is statistically dependent, so the report contains data of QPS and statistical QPS statistical types. When making capacity estimates, it is easy to analyze the pressure caused by upstream applications on the downstream.

On this report, we can also do the work of dependency carding to determine the fault points according to the error rate; in addition, those places where there are strong dependencies and error blockages are potential failure points; finally, the relevant performance optimization can be carried out according to the time-consuming ratio.

3. Best practic

As the core of troubleshooting, call chain can associate all kinds of data together and improve the ability of troubleshooting. Let's take a look at the best practices of the call chain-holographic troubleshooting.

Holographic investigation

The problems shown in the above figure are often encountered in the actual problem troubleshooting, and these problems have clear business implications. Although these problems do not seem to have anything to do with the call chain, they can be well solved with the call chain. As shown on the right side of the figure above, the invocation relationship carried by the five Amure nodes on the call chain is actually some specific business. For example, node A handles the HTTP request and indicates that the seller abc clicks to place an order; when calling B, it is actually calculating the freight of the seller xyz on this route, and so on. When troubleshooting a problem, the most valuable entry point is to start with the business problem, and then further identify the problem in the call chain.

We can reverse check the call chain according to the business time ID, so as to find more upstream and downstream business information. For example, a problem is found in a transaction order (2135897412389123). We can find the TraceId bound to it according to the order number. According to TraceId, we can see not only the events of system calls, but also business-related events, such as orders issued by users, current inventory, and so on. In other words, according to the transaction ID, you can view transaction, commodity inventory, payment and other information on the call chain, which greatly improves the speed of error troubleshooting.

Let's go back to the three questions just mentioned: to analyze which order operation caused the call exception is actually a correlation between TraceId and OrderId; to analyze whether the abnormal order is caused by some abnormal operation of the seller on the freight template of the goods to which it belongs, it is actually associated with ItemId based on OrderId, and finally associated with TraceId;. For the third question, UserId is usually associated with TraceId and then to MyBizld.

According to these problems and their solutions, we can see that the key to holographic investigation is the two-way binding of business time id and TraceId/RpcId. There are three common ways to implement two-way binding:

1. Put the business event id in the Tags or UserData of the call chain to establish the association between the call chain and the business event id

two。 Connect the TraceId to the data changes in the database, thus establishing the relationship between the call chain and each data change.

3. Record TraceId, business event id and other information in the business log, so as to establish the relationship between the call chain and the business event log.

At present, based on Aliyun ARMS, the above three two-way binding implementations are integrated, and users can easily configure them on the product.

Holographic survey panorama

The above is a holographic survey panorama of Alibaba's interior. The core of the diagram is the back-end system initially covered by Hawkeye, including services, messages and caches; at the front-end level, it involves front-end users accessing logs, which has the ability to associate TraceId; on the mobile side, it also has the ability to associate TraceId; by associating TraceId, users can access logs At the database level, the TraceId is transferred to the binlog of the database through the SQL statement, and the association between the record of each data change and the TraceId can be easily obtained during data replication and data distribution; in addition, the business can also print TraceId through its own business log and exception stack In this way, all the components of the business layer, the mobile end, the front end and the data layer are associated with TraceId, and then associated with the order number, user number, commodity number, logistics order number and transaction number in the business, and finally form a very powerful ecology-- from a call chain, you can see the relevant orders and user details of upstream and downstream, and at the same time, you can look up the business ID related to the order. Then, according to the business ID, it extends to more ID or even TraceId related to it, and finally forms the network structure of TraceId-- > business ID-- > new TraceId, and transforms the troubleshooting problem into finding the whole piece of information needed from the network structure.

Three-dimensional monitoring system built by EDAS+ARMS

Currently, a three-dimensional monitoring system can be created by combining EDAS with ARMS provided by Aliyun, in which EDAS is used at the application management and control level to control links and applications, while ARMS pays more attention to the business operation level, such as e-commerce transactions, vehicle networking and retail. In fact, monitoring requires omni-directional attention to business, links, applications and systems. Through the mutual complement of ARMS and EDAS, a three-dimensional monitoring system is formed.

[news flash] EDAS online tracking new features to get through the "last kilometer" of application diagnosis

EDAS method tracking can help users to quickly troubleshoot problems when the application is running.

A typical scene

1. When the application is running, it suddenly finds that it takes a long time to execute a certain business logic, so it wants to have a way to locate the time-consuming parts of the runtime code to determine where the time-consuming point is.

two。 Everything is fine when the application is running, and in most cases, the business runs very smoothly, but in a certain case, when the XXX parameter is passed, the business response is very slow. At this point, you want to have a way to observe code execution for specific method input parameters.

3. A more complex program method, in which the business logic is more complex, in the real run time, it is impossible to determine which logic is invoked and the timing of the call. At this time, I hope to have a way to show the specific logic and timing of the method execution in detail.

In addition, any of the above scenarios want the code to be intrusive and can locate the problem without downtime while the application is running.

EDAS method tracking uses JVM bytecode enhancement technology to increase the necessary time-consuming and call sequence record enhancement for all method calls of the selected method, so as to achieve the purpose of watching the specific execution sequence in the execution process.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.