Detailed explanation of APM data sampling and end-to-end 04/08 Update SLTechnology News&Howtos

Detailed explanation of APM data sampling and end-to-end

2025-04-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Gao Chi Tao Chief Architect of Cloud Intelligence

According to cloud intelligence statistics, APM performance data collected from clients may account for 50% of business data, and it is not easy for enterprises to accurately collect all data involved in the entire link from Request to Response and effectively concatenate them to achieve true end-to-end.

So how does cloud intelligence sample APM data and satisfy users 'high-performance analysis of business data in "end-to-end" application performance management? At the APM session of the Global Operation and Maintenance Conference in September 2016, Mr. Gao Chitao, Chief Architect of Cloud Intelligence, revealed the big data mystery behind APM for you.

Neeke Gao, Chief Architect of Cloud Intelligence, member of PHP/PECL development team, and author of PECL/SeasLog,PECL/JsonNet,GoCrab, etc. 10 + years of R & D management experience, engaged in large-scale enterprise information architecture R & D in the early stage, involved in the field of Internet digital marketing in 2009 and in-depth research on architecture and performance optimization. Joined Cloud Intelligence in 2014, dedicated to APM product architecture research and development, advocating agility, efficiency, GettingReal.

The following is a wonderful sharing of Gao Chi Tao:

Today is APM special session. I believe everyone has a certain understanding of APM. I will share APM data sampling and end-to-end aspects. This is also the practical result of cloud wisdom in the process of serving and solving customer needs in recent years.

APM and Big Data

There is a very obvious feature in the use of APM, that is, the amount of data that can be collected is very large, so large that it is unimaginable. Look at the computer room above, who can accurately tell how much data flows every day, and this is just a few simple cabinets. We have done statistics on customer data. On the Internet, APM's data collected from clients can account for more than 50% of enterprise business data, which means that if the collected data is very detailed, it is likely to be larger than the original business data. Assuming that the bandwidth of business data is 2T, in order to support APM, it needs to increase the bandwidth of 2T. It may take 300 servers to support the business. Now, it needs to add at least 150 additional servers to support APM. This is a big challenge in data processing. For most enterprises, APM is not the core business of the enterprise, but it uses a lot of computing and storage resources. This is what happens when the data is not sampled.

What is APM (Application Performance Management), literally "application + performance + management", the first two guests are talking about APM category, the core of their talk is application performance, pay attention to not business but performance. There is also a word behind APM that is management, which is to understand this performance data from a business perspective, such as how many users will be affected by a crash or a jam, and how much loss will be caused to the enterprise by the affected users. This is the embodiment of APM's business value, and it is also the direction we are working hard and practicing.

Why do we use APM? Today, Tencent guests, for example, play CF games on mobile phones. A player is playing CF. Recently, he is often killed because of the application running jam. Even if he buys a good gun and good equipment, he still can't beat others. Users will inevitably complain. After the complaint, the customer service will ask a lot of questions according to the knowledge base of the system, and then promise the player to arrange the operation and maintenance inspection system immediately. Finally, it often goes nowhere. In the process of serving users, enterprise business personnel often lack a tool, or a platform to timely and accurately discover user problems, or even locate specific users, specific SQL and specific key codes.

APM has two major advantages, one is to improve work efficiency, reduce the time of ineffective communication with users; the other is to find and accurately locate problems in time, because the business system running on the Internet is often the first user to perceive the system failure, if the first time to receive user feedback to find and solve the problem, it will greatly reduce the business loss caused by the failure. For a simple example, cloud wisdom has a customer's production system failure caused a suspension of service for two hours, resulting in tens of millions of losses, the solution of background operation and maintenance is very simple, cut off the service, restart a set of clusters, cut off the business, keep the site, and then spent a week to find that it is actually a memory leak. It took them a week to find the problem. Later, with the help of Cloud Intelligence Insight, they reproduced the problem directly on the test system and accurately located the location of the memory leak within 10 minutes. Using APM can effectively shorten the discovery time of the problem and effectively solve it to avoid similar problems from happening again.

Why is APM Big Data? We know that big data has very clear 4 V characteristics:

One is a large amount of data (Volume). One of our typical users generates more than 500G of data storage in the APM system every day.

One is Variety, for example, there are more than 300 mobile APM indicators we know at present, and there are more dimensions;

One is high speed (Velocity), the speed of data generation and consumption are very scary;

One is data value (Value), single data price is low, need to synthesize a large number of data for multi-latitude comprehensive analysis, in order to obtain data status and trend;

This is typical of big data, and APM data fits right in.

Apdex gains and losses

How to deal with such a large amount of data, the most direct and effective way is sampling. Why do we need to do sampling? One can effectively reduce the amount of data. From the perspective of data value, we don't want a piece of data to be missed, but when a large amount of data comes in, it takes several days to describe the amount of data in a day, which means that it can never be accurately described.

How do you handle it? Look at this scatter plot of Jmeter's requests, marked with dense dots, one point per request, constantly dotting the canvas according to the time dimension and response time. At this time, it is very difficult to describe each accurate point, just objectively describe a thing, just like a journal, but it cannot describe the whole application, nor can it describe what the application looks like.

Using this scatter plot can make such a two-dimensional histogram, the same histogram has area and height and time, so several dimensions intersect to make a two-dimensional map, the right side is from a large number of different dimensions of data in several indicators through APDEX algorithm fusion into an Apdex indicator.

APDEX is an application performance indicator, which is a standard commonly followed in the APM field. This algorithm is not limited to the application performance field, but can be used when we want to describe a large amount of data with the same indicator. Let's first look at why APDEX is used. The figure on the left is a Gaussian distribution, that is, a normal distribution diagram. You can draw the scattered points of an indicator to this place to form a Gaussian distribution diagram. The higher the peak on its fluctuation curve, the worse the performance. The gentler the performance, the better. However, this description has an obvious disadvantage. It is easy to ignore the poles. The poles of this graph are the fastest or slowest response, while the Gaussian distribution focuses more on the medium state. Assuming that the non-medium data are abnormal data, this means that the description actually discards the very good and very bad states, leaving only the medium data.

APDEX is an improvement on Gaussian distribution. This bar is a scale. The top of this scale is 1.00T. T is a unit of APDEX. APDEX describes an indicator from 0 to 1. For example, calculate the average response time of an application on a certain day. Suppose there are forty requests in total. The average response time is two seconds. When it is less than two seconds, it is set to one. Ten requests from zero to two seconds are counted as one. Twenty requests from two seconds to eight seconds are counted as 0.5. The other ten requests greater than eight seconds are counted as zero. APDEX is calculated as (1×10+0.5×20+0×10) divided by 40=0.75. The response time metric that can be used to describe the application in one day is 0.75, which is acceptable when placed on this column, but if it is below 0.5, it is completely unacceptable and may be faulty. This is APDEX algorithm, can use a value to describe the application over a period of time a large number of samples of data overall situation.

What's wrong with APDEX? Take blood pressure as an example. For example, high pressure 120 is the benchmark. There are 40 people to measure it. These 40 people are like what I just said. Ten are excellent, 20 are moderate, and 10 are people whose blood pressure is too high. Describe the health status of 40 people and get a good data of 0.75. At this time, there is a very terrible problem. It is OK to describe this crowd with 0.75, but the last part of the data larger than four times the scalar is ignored. That is to say, the 10 people who are about to die simply ignore him. This is the biggest problem of APDEX. Another problem with APDEX is the lack of raw data and end-to-end data, because APDEX saves a lot of storage by calculating metrics directly during the data flow, not only the raw data is lost, but also the end-to-end data is discarded.

For another more direct example, if there is a problem with the database connection pool of the application system, the request received by the whole application may quickly throw an exception and respond to a static page at the front end after determining that there is a problem with the connection pool. At this time, the whole application responds very quickly, and the APDEX value will be very ideal. The performance of the whole application is actually very poor, because all normal services are interrupted.

True end-to-end and APM sampling

The real end-to-end is the data that can connect each request from the client to the whole link of the network, DB, physical layer, external service and file operation behind. The data cannot exist as an isolated island. If the data can be connected in series through an ID number or time dimension, this is the real end-to-end.

The middle layer of this diagram is the end-to-end link. The end-to-end implementation is to collect data on so many services and components at each point, and at the same time, a unique identifier is passed among the data collected on each service component; while analyzing the client user behavior, you can also directly trace the SQL and code stack executed by the API corresponding to the backend, as well as the CPU/memory/network/IO and other system states of the server at the same time through a client API call. One of the biggest problems is sampling. While using APDEX, it is also necessary to realize end-to-end. This is actually a contradiction. It is necessary to accurately describe the application situation, but also to reduce the difficulty of description, and not to lose a single piece of data. This is a very big challenge.

There are a lot of ways to do this at this time, and this graph is a real machine load data graph when testing the solution for customers, TPS is reduced by 4%, CPU resource utilization is below 5%. How did this happen? We do a lot of work where the data is transmitted and where the data is collected. For example, if there is a problem with the system or interface, where may the problem be? According to R & D and operation and maintenance experience, it is very likely that there is an operation or a network request. Another possibility is the memory and CPU resources. After knowing these conditions, you can collect targeted data instead of collecting all of them. There is also a big challenge to cover an application with a daily PV of millions without losing data.

This is the end-to-end data acquisition schematic of Cloud Intelligence, our goal is full acquisition, and we need to pay attention to various response thresholds, response thresholds for time, CPU and memory response thresholds, and errors and exceptions. Why errors and anomalies? Because APDEX in the general sense is to calculate the response time index and make a prescribed description.

For example, if you access a news item through an interface or through a page, make a request and get an article, if the response time is within 100 milliseconds, it is very good, but it is very likely that you will have to connect once when the response time is 100 milliseconds. After connecting once, you will have to write a cache or write an operation with clicks. At this time, it will return that this is a normal business, but it is very likely that you will not connect, resulting in errors or exceptions, and the response time is 90 milliseconds. Can we say this 90 millisecond response request is better than a hundred milliseconds? Therefore, it is problematic to use response time alone to measure performance. We should pay attention to abnormal indicators while paying attention to response time indicators. Abnormal indicators are more important than normal indicators. We must pay attention to abnormal indicators when measuring APDEX.

Finally, take an APM application example. This is the comparison between before and after use of the monitoring treasure. The upper right corner is the response time ratio. Below is the access time. You can see that the ×× part in the upper right corner is slow response. In fact, you will find that there are many problems in the application. The slow number is greater than 90%. This is the indicator of errors and exceptions. The optimized data is green. The response time of the query is obviously reduced. This is an indicator of the intersection of response time and errors. Through transaction snapshot, you can also view the code runtime stack/SQL/API request/request parameters of each specific request, and quickly view the details of errors or exceptions if there are errors or exceptions.

Thank you!

Q: Is APDEX the standard in the APM industry, or is it a summary of cloud wisdom's experience over the years?

Gao Chitao: The definition of APDEX comes from the word APM, which has existed since the emergence of APM, and APDEX is also a standard proposed by many professional analysts. The four times scalar just mentioned gives the definition of 0.5, and more than four times gives zero. In fact, there is no convention, but everyone has always done this, which is an unwritten convention.

Just now I said that I should pay attention to several samples, pay attention to the response time and response threshold. The response threshold includes the access time. This is an indicator of concern. When collecting, you can first determine the connection. No matter how fast or slow the connection is, there is no error. It must be collected, because this is an unknown and very critical operation. Key operations must be collected. There are also normal operations, such as no errors and no exceptions, CPU and memory are normal, this time if the response time threshold is less than one millisecond we will discard the method.

All designs of cloud wisdom require that users do not change a line of code, no engineering intrusion; if you want to code to obtain data, there is no need to use a third-party platform, developers can easily implement it themselves. Cloud wisdom must be deployed cold from scratch, and can be deployed hot from suspension or unloading.

Yunzhi is a service provider of business operation and maintenance solutions. Its products Monitoring Treasure (www.jiankongbao.com), Perspective Treasure (www.toushibao.com) and Pressure Test Treasure (www.yacebao.com) have provided one-stop application performance monitoring, management and testing services for hundreds of thousands of users in e-commerce, mobile Internet, advertising media, online games, education and medical care, financial securities, government and enterprise industries.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.