Full-link observability behind double 11: the overall upgrade of Alibaba Hawkeye in the "Cloud original era" 04/23 Update SLTechnology News&Howtos

Full-link observability behind double 11: the overall upgrade of Alibaba Hawkeye in the "Cloud original era"

2025-04-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Click to download "different double 11 Technologies: cloud Native practice in Alibaba economy"

This article is excerpted from the book "different double 11 Technologies: cloud Native practice in Alibaba economy". Click on the picture above to download it!

Author:

Zhou Xiaofan (heir) Senior Technical expert of Aliyun Middleware Technology Department

Wang Huafeng (Shuifeng) Technical expert of Aliyun Middleware Technology Department

Xu Tong (Shaokuan) Technical expert of Aliyun Middleware Technology Department

Xia Ming (Ya Hai) Technical expert of Aliyun Middleware Technology Department

Guide: as a team that has been engaged in Link tracking Technology (Tracing) and performance Management Services (APM) for many years, the engineers of Alibaba Middleware Hawkeye team have witnessed many upgrades to Alibaba's infrastructure, each of which brings great challenges to system observability (Observability). What are the new challenges brought to us by this "cloud native" upgrade?

Cloud origin and observability

In the past Singles' Day in 2019, we once again witnessed a technological miracle: this time, we spent a whole year to put Alibaba's core e-commerce business into the cloud. And the use of Aliyun's technology infrastructure to withstand the peak of 540,000 zero-point transactions per second; our R & D, operation and maintenance model has also officially entered the cloud-based era.

The new paradigm advocated by Cloud Native has a great impact on the traditional R & D and operation and maintenance model: the concepts of micro-services and DevOps make R & D more efficient, but it brings more difficulties in troubleshooting and fault location of massive micro-services. The gradual maturity of container orchestration technologies such as containerization and Kubernetes makes it easy to deliver large-scale software, but the challenge is how to evaluate capacity and schedule resources more accurately to ensure the best balance between cost and stability.

The new technologies such as Serverless and Service Mesh explored by Alibaba this year will completely take over the work of operation and maintenance middleware and IaaS layer from users in the future, which is an even greater challenge to the automation of infrastructure.

Infrastructure automation (Automation) is a prerequisite that cloud native dividends can be fully released, and observability is the cornerstone of all automated decisions.

If the execution efficiency, success or failure of each interface can be accurately counted, the context of each user request can be fully traced, and the dependencies between applications and underlying resources can be sorted out automatically. then we can automatically determine what is the abnormal cause of the business based on this information? Is it necessary to migrate, expand or remove the underlying resources that affect the business? According to the peak of double 11, we can automatically calculate whether the preparation resources required for each application are sufficient and not wasted.

Observability ≠ monitoring

Many people will ask whether "observability" means "monitoring" in another way, and the industry's definitions of the two things are actually very different.

Different from "monitoring", monitoring pays more attention to problem discovery and early warning, and the ultimate goal of "observability" is to give a reasonable explanation for what happens in a complex distributed system. Monitoring pays more attention to the process of software delivery and after delivery (Day 1 & Day 2), which is what we often call "during and after the event", while "observability" is responsible for the life cycle of all R & D and operation and maintenance.

Going back to "observability" itself, it is still made up of the platitudes of "Tracing", "Metric" and "Logging". But how do these three things integrate with the cloud infrastructure? How can they be better related and integrated? And how can they better integrate with the online business of the cloud era? It is the direction that our team has been trying to explore in the past year or two.

What did we do this year?

This year's Singles Day, Eagle Eye team engineers are exploring technology in four new directions, providing a strong guarantee for the group's business to go to the cloud in an all-round way, automatic preparation for Singles Day and overall stability:

Scene-oriented business observability

With the continuous complexity and diversification of Alibaba's e-commerce format, Dachi's preparation for war also tends to be refined and scene-oriented.

In the past, the person in charge of each micro-service system fought on his own according to the situation of the system and the upstream and downstream of the system. Although this divide-and-conquer approach is efficient enough, it is inevitable that there are omissions. The fundamental reason lies in the misplaced relationship between the Taiwan application and the actual business scenario. Take the trading system as an example, a trading system will simultaneously carry many types of business, such as Tmall, box horse, barley, flying pig, and so on, while the expected transfer amount and downstream dependent path of each business are all different. As the person in charge of the trading system, it is difficult to sort out the impact of the upstream and downstream detail logic of each business on its own system.

This year, Eagle Eye team launched a scenario-based link capability, which combines business metadata dictionary with non-intrusive automatic marking to achieve traffic staining, operationalize the actual traffic, and break through the data of business and downstream middleware. From the previous application-centric view to the business scenario-centric view, it is closer to the real promotion model.

As shown in the figure above, this is a case of querying goods. Four systems A, B, C and D provide the query capabilities of "commodity details", "commodity type", "price details" and "discount details" respectively. Ingress application A provides a commodity query interface S1. Through Hawkeye, we can quickly find that application B, C, D belong to the dependency of application A, and are also downstream of interface S1. For the governance of system stability, such a link data is sufficient.

But in fact, this perspective does not have the observability of business, because there are two business scenarios in such a dependency structure, and the corresponding links of these two scenarios are also completely different: the links corresponding to category A goods are A-> B-> C color D, while those for category B goods are A-> B-> C. Assuming that the proportion of daily normal these two types of goods is 1:1, and the proportion of great promotion is 1:9, then only from the system point of view or business point of view to comb the link, it is impossible to get a reasonable traffic prediction model.

Therefore, if we can dye the two kinds of traffic by marking at the system layer, we can easily sort out the links corresponding to the two business scenarios. Such a more refined perspective is particularly important to ensure the stability of the business and to rely more reasonably on the configuration of current-limiting and downgrading strategies.

This kind of business scene ability has played a great value in preparing for Singles' Day this year, and many business systems have combed out their core business links based on this ability, so that they can prepare for the war more calmly and without omission. At the same time, a series of service governance tools, enabled by Hawkeye, have carried out comprehensive scene upgrades, such as scene-based traffic recording and playback, scene-based fault drill tools, scene-based accurate test regression, and so on. With these service governance tools that are more in line with the business scenario, the observable granularity of the entire double 11 preparations has entered the "HD era".

Intelligent Root cause Localization based on observability data

In the cloud native era, with the introduction of technologies such as micro-services and the growth of business scale, the number of instances of applications is growing, and the dependence of core business is becoming more and more complex. On the one hand, we enjoy the dividend of exponential improvement in development efficiency, and at the same time, we are suffering from the high cost of fault location. Especially when there is something wrong with the business, how to quickly find the problem and stop the bleeding becomes very difficult. Hawkeye team, as the "patron saint" of application performance in the group, how to help users quickly complete fault location has become a new challenge this year.

To complete the fault location, you must first answer, what do you think is the fault? Behind this, operation and maintenance personnel need to have a deep understanding of the business. Many maintenance personnel like to use exhaustive means with all the observability indicators, and all kinds of alarms, plus, appear to be "sense of security". In fact, when the fault comes, the screen is full of abnormal indicators, increasing alarm text messages, such "observability" seems to be powerful, but the actual effect is counterproductive.

The team carefully combed the faults in the group over the years, and the core applications in the group usually have four types of faults (non-business logic problems): resource class, traffic class, delay class, error class.

Subdivide it further:

Resource category: for example, cpu, load, mem, number of threads, connection pool; traffic class: business traffic falls to zero, OR abnormal rises and falls sharply, middleware traffic, such as services provided by messages, falls to zero; delay category: services provided by the system, OR system depends on services, delay suddenly soars, which is basically a sign that the system has problems. Error class: the total number of errors returned by the service, and the success rate of the OR dependent service provided by the system.

With the above fault classification as a grasp, what we need to do is to "follow the rattan". Unfortunately, with the complexity of the business, the "vine" is getting longer and longer. taking the fault of sudden increase in time delay as an example, there are many possible root causes: it is possible that the upstream business promotion leads to a sudden increase in the number of requests, and it is possible that the application itself frequently GC causes the overall application to slow down. It is also possible that the overload of the downstream database leads to slow response, as well as numerous other reasons.

Hawkeye used to provide only these metrics information, and it is impossible for maintainers to look at a single call chain data and scroll the mouse several times to see a complete piece of tracing data, not to mention switching back and forth between multiple systems for troubleshooting problems.

The essence of fault location is a process of continuous investigation, negation and re-investigation, and a process of "eliminating all the impossibilities and leaving the truth". If you think carefully about the enumerable possible + cyclic iterative process, isn't that what computers are good at? The intelligent fault location project was born under this background.

When it comes to intelligence, many people's first reaction is to associate the algorithm together and demonize the algorithm excessively. In fact, students who know about machine learning should know that the data quality ranks first, the model ranks second, and finally the algorithm. The reliability and integrity of data acquisition and domain model modeling are the core competitiveness. Only when the road of data acquisition is accurate, is it possible to be intelligent.

The evolution route of intelligent fault location is also gradually completed in accordance with the above ideas, but before that, we have to ensure the quality of the data: thanks to the Hawkeye team's deep ploughing on big data's processing for many years, the reliability of the data has been guaranteed with very high quality, otherwise if there is a failure, we will have to doubt whether it is our own index.

Then there is the completeness of data and the modeling of diagnosis model, which are the cornerstone of intelligent diagnosis and determine the level of fault location. at the same time, these two parts complement each other. Through the construction of the diagnosis model, the observability index can be detected and filled, and the depth of the diagnosis model can be increased by filling the index.

It is mainly improved through the combination of the following three aspects:

First, the historical fault deduction, the historical fault is equivalent to the examination paper that already knows the standard answer, construct the initial diagnosis model through part of the historical fault + artificial experience, and then iteratively deduce the rest of the historical fault. but the model coming out of this step is easy to appear over-fitting phenomenon; second, chaos engineering is used to simulate common anomalies and constantly modify the model. Third, online man-made marking to continue to complete the observability indicators and modify the diagnostic model.

After the above three stages, this cornerstone is basically completed. The next step is to solve the problem of efficiency, the model iterated from the above steps is not the most efficient, because human experience and thinking is linear thinking, the team has done two aspects of work on the existing model: edge diagnosis and intelligent pruning. Sink the part of the positioning process to each agent node, automatically save the key information of the incident site and report the key events for some phenomena that may affect the system, and the diagnosis system automatically adjusts the location path intelligently according to the weight of each event.

After the smart root location is online, it has helped thousands of applications to complete the fault root cause location, and achieved high customer satisfaction. Based on the root cause location conclusion, observability is the cornerstone, and the automation capability of infrastructure will be greatly improved. During the preparation period of double 11 this year, there is such a rapid fault location function, which provides a more automatic means for the person in charge of application stability. We also believe that in the cloud native era, the dynamic balance of quality, cost and efficiency pursued by enterprise applications is no longer out of reach and can be expected in the future.

Problem positioning ability of the last kilometer

What is the problem orientation of "last kilometer"? what are the characteristics of the problem of "last kilometer"? Why not "the last 100 meters" and "the last meter"?

First of all, let's align a concept, what is the "last kilometer"? In daily life, it has the following characteristics:

Walking is a little far, the bus is too close, and the distance is not far. The road condition of the last kilometer is very complicated. It may be a wide road, a rugged path, or even an indoor journey like a labyrinth (which the takeout boy should know most).

So what is the last kilometer in the field of distributed problem diagnosis, and what are its characteristics?

In the diagnosis process, it is not too far from the root, and it is basically located to a specific application, service or node, but the specific exception code fragment cannot be determined. There are plenty of data types that can locate the root cause, which may be memory footprint analysis, CPU footprint analysis, specific business log / error code, or even simply from the problem appearance, combined with diagnostic experience to quickly determine the conclusion.

Through the above analysis, we have now reached some consensus on the concept of the last kilometer. Next, we will introduce in detail: how to achieve the problem positioning of the last kilometer?

First of all, we need a way to accurately reach the starting point of the last kilometer, that is, the application, service, or machine node where the root cause of the problem lies. This avoids invalid analysis at the root, just like delivery took the wrong order. So, how to accurately demarcate the range of root causes in complex links? Here, we need to use the link tracking (Tracing) capability that is more commonly used in the field of APM. Through link tracking, abnormal applications, services or machines can be accurately identified and analyzed, pointing out the direction for our last kilometer location.

Then, we achieve the problem location of the last kilometer by correlating more detailed information on the link data, such as local method stack, business log, machine status, SQL parameters, etc., as shown in the following figure:

Core interface burial point: record the basic link information, including TraceId, RpcId (SpanId), time, status, IP, interface name, etc., by inserting pile burial points before and after the execution of the interface. The above information can restore the most basic link shape; automatically associate data: the associated information that can be automatically recorded during the call life cycle, including SQL, request entry and exit parameters, exception stack, and so on. This kind of information does not affect the shape of the link, but it is a necessary condition for accurately locating problems in some scenarios; active association data: associated data that needs to be actively recorded during the call life cycle, usually business data, such as business log, business identification, and so on. Because the business data is very personalized, it can not be configured uniformly, but after being actively associated with the link data, the efficiency of business problem diagnosis can be greatly improved; local method stack: due to performance and cost constraints, link burying points cannot be added to all methods. At this time, we can achieve accurate local slow method location by means of method sampling or on-line pile insertion.

Through the problem positioning of the last kilometer, we can check the hidden dangers of the system in daily and in-depth combat situations, and quickly locate the root causes. here are two practical application examples:

An application has an occasional RPC call timeout when the overall traffic peak. By analyzing the snapshot of the local method stack automatically recorded, it is found that the actual time-consuming is consumed on log output statements, because versions below LogBack 1.2.x are prone to "hot lock" in high concurrent synchronous call scenarios. This problem is completely solved by upgrading the version or adjusting to asynchronous log output.

When a user reports an order exception, the business classmate first retrieves the business log of the order entry through the user's UserId, and then arranges all the dependent business processes, states and events according to the link identification TraceId in the log according to the actual calling order to quickly locate the cause of the order exception (UID cannot be automatically transmitted to all downstream links, but TraceId can).

Monitoring alarms often only reflect the appearance of the problem, and in the end, the root of the problem needs to go deep into the source code to find the answer. Hawkeye has made a great breakthrough in the "fine sampling" of diagnostic data this year, greatly improving the fineness and gold content of the data needed for the last kilometer positioning without inflating the cost. Throughout the long period of preparation for Singles' Day, it helped users eliminate one source of system risk after another, thus ensuring the "silky smoothness" of the day.

Fully embrace cloud native open source technology

Over the past year, the Hawkeye team has embraced open source technology and fully integrated the industry's mainstream observability technology framework. We have released a link tracking (Tracing Analysis) service on Aliyun, which is compatible with mainstream open source Tracing frameworks such as Jaeger (OpenTracing), Zipkin, Skywalking, etc. The programs of these frameworks have been used without modifying a line of code, only need to modify the configuration file of the data reporting address, and can obtain much more powerful link data analysis capabilities than open source Tracing products at a much lower cost than open source self-built products.

At the same time, the Hawkeye team also released a fully managed version of Prometheus service, which solves the write performance problem when the open source version takes up too many deployment resources and too many monitoring nodes, and optimizes the slow query speed in long-range and multi-dimensional queries. The optimized Prometheus hosting cluster fully supports Service Mesh monitoring and several heavyweight Aliyun customers in Alibaba, and we have also fed many optimization points back to the community. Similarly, the hosted version of Prometheus is compatible with the open source version, and can also be migrated to the hosted version with one click on Aliyun's container service.

Observability and stability are inseparable. This year, Hawkeye engineers have sorted out a series of articles and tools related to observability and stability construction over the years, which have been included on Github. Welcome to join us for co-construction.

The highlight of this book

In the practice of Shuang 11 super large K8s cluster, the problems and solutions encountered are described in detail. The best combination of Yunyuan biochemistry: Kubernetes+ container + Shenlong, to achieve the technical details of the core system 100% on the cloud. Double 11 Service Mesh super large-scale landing solution

"Alibaba Cloud Native focus on micro-services, Serverless, containers, Service Mesh and other technology areas, focus on cloud native popular technology trends, cloud native large-scale landing practice, to be the best understanding of cloud native developers of the technology circle."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.