Stress testing practice of back-end service performance 10/26 Update SLTechnology News&Howtos

Stress testing practice of back-end service performance

2025-10-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Stress testing practice of back-end service performance

Label: performance stress testing back-end service stress testing practice

Author: Wang Qingpei (Plen wang)

Background environment testing presses and pressure tools testing Linux openfiles limit settings troubleshooting perimeter dependent empty interface pressure testing aggregation report throughput calculation pressure testing and performance troubleshooting methods focus on two ways of logLinux conventional command performance troubleshooting at each latitude (from top to bottom, from bottom to top) summarize the background

In the last half a year, I have been responsible for some work of performance pressure test twice. If you do something once, you may not be able to sum up something, but you can still find some common problems after two times, so summarize the general practice of performance stress testing. But there are certainly more than these problems, there are more deep-seated problems waiting to be discovered, and we will summarize and share them one by one.

Do not say much about the reasons for performance stress testing, generally two points in time must be done, greatly promote the front, the new system on-line. The purpose of pressure testing is to keep the processing capacity and stability of the system within a standard range.

From the perspective of the whole industry, apart from some large factories, the fully automated performance stress testing environment is still relatively few. In order to build a set of fully automated performance stress testing environment, at least several issues are involved, such as CI\ CD, independent and isolated pressure testing environment, automatic pressure testing tools, daily stress testing performance alarm, performance report analysis, troubleshooting / solving performance problem process, and so on. Only in this way can we regularize the performance stress testing. once it is not conventional performance stress testing, there will be the problem that the configuration of code and middleware lags behind the production environment. Over a long period of time, it is tantamount to starting to build and investigate the pressure test environment again.

If the performance stress testing environment is fully automated, then the performance stress testing work can be routinized into a routine task in the research and development process, and the implementation efficiency will be very high, and the stress testing will be relatively easy. The benefits are also obvious.

But most of the time we still need to start the performance stress test from scratch. After all, the cost of building such an environment for enterprises is also huge. The performance pressure test is sensitive to the environment, so it is necessary to divide the independent deployment and isolation units in order to read the pressure test report intuitively in the follow-up routine pressure test process.

Beside the question, if we have an automated stress testing environment, we still need to understand the basic structure of the whole stress testing environment. After all, the stress testing environment is not a real production environment, and we need to know whether it is normal or abnormal.

Environmental detection

When we need to do performance stress testing, the first problem we have to face is the environmental problem, which includes several common points:

1. Machine problems (physical or virtual machine, CPU, memory, network adapter inlet and outlet bandwidth, hard disk size, whether the hard disk is SSD, kernel basic parameter configuration)

two。 Network problems (whether there are cross-segment problems, whether network segments are isolated, if there are cross-segment machines, whether they can be accessed, and whether there is a bandwidth speed limit across network segments)

3. Middleware issues (whether all dependent middleware in the program are deployed, whether the configuration of the middleware is initialized, what is the structure of the middleware cluster, whether these middleware have been subjected to performance stress testing, what is the latitude of the pressure testing, is it benchmark or stress testing for specific business scenarios)

These environmental problems will be a little tired when they are checked for the first time, but after mastering some methods, tools and processes, what is left is routine, but more manual work is involved.

In the above questions, some of the questions are relatively simple to view, so we will not introduce them here, such as the basic configuration of the machine. Some configurations only need to be pushed by you, and it is relatively simple to follow the relevant process to check and accept them, such as network segment isolation.

What is not clear is the problem of middleware, which seems to be available, but it just can't be pressed up, so you need to do simple pressure tests by yourself, such as db single table insertion, cache concurrent reading, mq landing writing, and so on. This time involves a problem, you need to have a certain depth of understanding of these middleware, to know the internal operating mechanism, otherwise it is really difficult to troubleshoot anomalies.

In fact, no one can be familiar with all the middleware on the market, every middleware is very complex, and it is impossible for us to master all the points of a middleware, but we need to master some commonly used ones, at least know a general internal structure, and we can track down the problems.

But as a matter of fact, there are always people you are not familiar with. At this time, we will seek help from each other and then explore and find some information on our own. We have never encountered, perhaps others have encountered, learning technology is actually such a process.

Testing of presses and pressure tools

Since we need to understand the pressure testing machine and pressure tools first, we mainly have locust, jmeter and ab, and the first two are mainly used by pressure testing colleagues for quasi-exit acceptance testing.

The latter two are mainly used to submit the self-test before the stress test, which is used by the developer to check and debug. It needs to be emphasized here that ab actually does benchmarking, which is different from the role of jmeter.

It is necessary to know whether the press is on the same network segment as the server of the machine being tested, and there is no bandwidth limit between the network segments. Whether there is a bottleneck in the configuration of the pressure testing tool of the press, generally, if it is jmeter, you need to check some basic configuration of java.

But in general, if the press is fixed and used all the time, then there will be no problem, because the pressure testing colleagues of the press have always been maintainers, but the parameters of the stress testing tools they use should be well configured and tested.

When using jmeter stress test, if the stress test time is too long, remember to turn off the listener-> graphic results panel, because if the rendering time is too long, it will basically fake death, mistakenly thinking that it will be a memory problem, but it is actually a rendering problem.

In the development of benchmark pressure testing, there is a problem is the bandwidth problem between the office network and the pressure test server network, too much pressure will lead to office network problems. So you need to stagger the time period.

After roughly combing, we need to use some tools to see if the basic configuration is normal. For example, ethtool network adapter information, nload traffic, and so on, and of course there are many other excellent tools to view the configuration, which are not listed here.

Before using ethtool to view network adapter information, you need to determine how many network adapters there are on the current machine. The best way is to use ifconfig to find the network adapter you are using.

Excluding the 127.0.0.1 adapter, there are three other adapter information, only the first bond0 is the one we are using, and then use ethtool to view the details of the current bond0 adapter. Focus on the speed domain, which represents the bandwidth of the current network adapter.

Although there may be no problem with the configuration of the network adapter, you still need to consult the relevant operation and maintenance colleagues to check whether there is no problem with the whole network, and there may be a problem of speed limit.

To make sure that there is no problem with network bandwidth, we also need a real-time monitoring tool for network traffic, here we use nload to monitor entry and exit traffic problems.

This tool is still very good, especially in the process of pressure testing, you can observe the inlet and outlet of the flow, especially to check some gap jitter.

If it is found that the import traffic has been very normal, the system may call the external call and then slow down when the exit traffic comes down, it may be the downstream call block, but the request thread pool is not yet full, or it may be pure async inside, and the request thread may not run full at all, or it may be the pressure problem of the stress test tool itself, and so on. But at least we know that there is something wrong with the external call boundary of our own system.

Linux openfiles limit Settin

In the working environment, we do not need to set the upper limit on the number of file handles opened by linux. These initialization values are generally set by OPS colleagues and are in line with the unified standards of operation and maintenance. But sometimes the setting of the maximum number of connections depends on the usage scenario of the back-end system.

Just in case, we still need to check whether we meet the stress test requirements of the current system.

In Linux, everything is a file, and socket is also a file, so you need to check the current machine's restrictions on the opening of file handles, check the open files domain of ulimit-a, or check ulimit-n directly.

If you feel that the configuration parameters need to be adjusted, you can edit the / etc/security/limits.conf configuration file.

Check for peripheral dependencies

If you want to stress-test a service, you need to check the surrounding dependencies of the service, and it is possible that the service you rely on may not necessarily have stress-testing conditions. Not every system has a stress test within a period of time, so when you are in the stress test, other people's services may not need stress testing and so on.

There are similar middleware issues, for example, if we rely on middleware cache, then whether there is a local level of cache, if so, perhaps not too dependent on the middleware cache of the stress testing environment. If we rely on the middleware mq, we can break our dependence on mq in business, because we are not stress testing mq after all. And the services we rely on don't care about our stress fluctuations.

After sorting it out, it is best to draw a sketch, and then re-git branch-b re-pull a performance test branch to adjust the code dependency according to the sketch. Then observe the flow rate and the trend of the data during the pressure test, whether it is in line with our route after combing.

Empty interface pressure test

A simple way to quickly verify the stress test service is to check whether the entire network is unobstructed and whether the parameters are generally normal by testing an empty interface.

Generally speaking, in any back-end service, there is an endpoint similar to _ _ health_check. For convenience, you can directly find an API without any downstream dependencies for stress testing. This kind of API is mainly used to verify the online and offline__ status of the server.

If the current service does not have an empty interface like _ _ health_check__, and it has been proved that a service needs such an interface very much in the production environment, it can help troubleshoot the calling link if necessary.

"publish! Software Design and deployment "Jolt Award Book Chapter 17 Transparency introduces the transparent design role of the architecture.

Throughput calculation in aggregation report

We need to unify our understanding of the throughput in the aggregation report when using jmeter for stress testing.

Normally, when using jmeter stress test, you will carefully observe the changes in the throughput column, but if you do not understand the calculation principle of thourghput, you will mistakenly think that tps/qps has come down. In fact, sometimes the entire remote server does not have response at all.

Throughput=samples/ pressure test time

Throughput (throughput) is the number of requests processed per unit time, which is generally calculated by second. If it is an interface of write type, it is the tps metric. If you stress-test an interface of type read, it is the qps indicator. These two types of indicators are completely different, and we should not be confused.

200 (throughput) tps=1000 (write) / 5 (s)

1000 (throughput) qps=2000 (read) / 2 (s)

When we find that throughput is gradually coming down, we have to consider a latitude of time.

In other words, our service may have become unresponsive, but with the accumulation of stress testing time, the calculation of the whole throughput is naturally declining slowly, and sharp problems like this cannot be found.

This is particularly obvious with the ui version of jmeter, because its expression is to slow down. It is better to use the Linux version of jmeter because its output is printed at an interval time.

The lack of clarity on this point has a great impact on our judgment on the results of the performance stress test. Therefore, we must have a monitoring report during the pressure test in order to know whether the indicators of the server have been abnormal in the whole pressure test process.

Most of the time, we will also use apache ab to do the basic pressure test, mainly used to compare with jmeter, whether the results of the two tools are not much different, mainly used to correct some performance false high problems.

Apache ab and jmeter have their own emphasis. Ab can be pressed by a fixed number of requests, and jmeter can be pressed by time. In the final calculation, you need to pay attention to the difference between the two. Ab does not seem to request error prompts and interrupts. Jmeter does have error prompts and various latitude assertion settings.

When we use the stress testing tool, a general understanding of some of the principles of the tool is helpful to the accurate use of this tool.

Pressure testing and performance troubleshooting methods

In the earlier part of the article, we talked about the environmental inspection steps for troubleshooting peripheral dependencies. In fact, in order to successfully carry out the pressure test, this step is necessary. After this step of analysis, we will have a basic system dependent on roadmap.

Based on the dependence of this system on roadmap, we will carry out performance stress testing, problem location and performance optimization.

A reasonable system architecture should be that the upper layer depends on the lower layer, and there is no way to determine the bottleneck of the upstream system performance without determining the performance of the downstream system.

Therefore, the order of stress testing should be carried out from the bottom to the top as much as possible, so as to avoid meaningless troubleshooting of performance problems caused by insufficient downstream throughput. The more downstream the system, the higher the performance requirements, because the performance bottleneck of the upstream system directly depends on the downstream system.

For example, if the v1/product/ {productid} foreground interface of a commodity system has a throughput of qps 8000, then the maximum throughput bottleneck of all upstream services that rely on this interface is 8000 on this code path, and the code path is the same bottleneck whether tps or qps.

Upper-layer services can use async to improve request concurrency, but can not improve the throughput of code paths on v1/product/ {productid} services.

We should not confuse concurrency with throughput. Just because the system can withstand concurrency does not mean that the throughput is high. There are many ways to increase concurrency, such as threadpool increasing thread pool size, socket class c10k, nio event-driven, and so on.

Follow log at different latitudes

When the relatively high cost-effective method of locating performance problems in the process of stress testing is the request processing log, the request processing time log, and the external API call time log, which can generally locate most of the obvious problems. When we use some middleware, we will output the corresponding execution log.

As shown below, the implementation of log at many latitudes is supported in the development framework we are using, which is very convenient when troubleshooting problems.

It is very necessary to record slow logs of slow.log type, which is needed not only in stress testing, but also in production.

If we use all kinds of middleware, we need to output the processing log of all kinds of middleware, such as mq.log, cache.log, search.log and so on.

In addition to these log, we also need to focus on the runtime gc log.

We mainly use the Java platform, so it is normal to pay attention to gc log during the stress test. Even if it is not a Java program, a language like vm needs to pay attention to gc log. The output log varies depending on the configuration of the jvm gcer.

For general e-commerce services, when giving priority to response, gc mainly uses cms+prenew, pays attention to the frequency of full gc, pays attention to the execution time of cms initial marking, concurrent marking, re-marking, concurrent clearing, real time executed by gc, memory recovery size during pernew execution, and so on.

There are a lot of things involved in the complexity of java gc, and the interpretation of gc log also needs to vary according to the current size of each generation of memory and the related configuration of a series of gc.

Gosling, the father of java, the authoritative guide to Java performance optimization, recommends that you can study and learn for a long time.

Linux general command

In the process of pressure testing, in order to observe the resource consumption of the system, we need to check it with the help of a variety of tools, including network, memory, processor, traffic.

# netstat

It is mainly used to view all kinds of network-related information.

For example, during the stress test, netstat wc is used to see if the number of tcp connections matches the server threadpool setting.

Netstat-tnlp | grep ip | wc-l

If the threadpool of our server is set to 50, then you can see that the number of tcp connections should be 50. Then count whether the number of threads in the request runing status of the jstack server is > = 50.

The description of the number of request threads may vary depending on the nio framework used.

There are also the most frequently used to check the port status of the system startup and whether the tcp connection status is establelished or listen status.

Netstat-tnlp

Then cooperate with the ps command to check the startup status of the system. This is generally used to determine whether the program is actually started, and if so, whether the port of listen is inconsistent with the port specified in the configuration.

Ps aux | grep ecm-placeorder

The netstat command is powerful and has many functions, and if we need to see other features of the command, we can use man netstat to look through the help documentation.

Vmstat

It is mainly used to monitor the running queue statistics of virtual processors.

Vmstat 1

During the pressure test, you can print every 1s or 2s to see if the processor load is too high. The procs column r subcolumn is the processing queue of the current processor, and if this value is too high the current number of cpu core, then the processor load will be too high. It can be paired with the top command described below.

At the same time, this command can check whether there is enough memory when the processor is too high, whether there is a large amount of memory swap, and how much swap si is swapped in and out. Swap so swapped out. Whether there is a very high number of context switching system cs switches per second, and whether the system us user mode running time is very small. Whether there is a very high io wait and so on.

There have been many excellent articles on this command online, so I won't waste time repeating it here. You can also use the man vmstat command to view various uses.

Mpstat

Mainly used to monitor multiprocessor statistics

Mpstat-P ALL 1

This is a 32 core stress test server, through mpstat you can monitor the load of each virtual processor. You can also view the overall processor load.

Mpstat 1

You can see the percentage of cpu in which% idle is idle,% cpu occupied by% user user mode tasks,% cpu occupied by% sys system mode kernel,% cpu occupied by% soft soft interrupts,% cpu occupied by% nice adjust task priority, and so on.

Iostat

Mainly used to monitor io statistics

Iostat 1

If we have a large number of io operations, we can monitor the amount of data written and read by io through iostat, and we can also see the average load of cpu when the load of io is particularly high.

Top

Monitor the overall performance of the whole system

The top command is the one we use most frequently in daily situations, and we can know the current system environment like the back of our hand. Processor load rate, memory consumption, which task consumes the highest cpu or memory.

Top

The top command is so rich that it can be sorted according to% MEM and% CPU, respectively.

The load average domain represents the cpu load rate, and the last three paragraphs represent the average load rate of the last 1 minute, 5 minutes, and 15 minutes, respectively. This value cannot be greater than the current number of cpu core. If it is greater than that, the cpu load is already too high. Check to see if the number of threads is set too high, and consider whether these tasks take too long to process. The number of threads set is directly related to the length of time the task is processed.

The Tasks domain represents the number of tasks, the total number of tasks in total, the number of tasks running in running, the number of tasks in sleeping hibernation, the number of tasks in stopped pauses, and the number of tasks in zombie zombie status.

The Swap domain represents the swap area of the system. During the pressure test, pay attention to whether the used will continue to rise. If it continues to rise, it means that the physical memory has been used up and begins to swap memory pages.

Free

View the memory usage of the current system

Free-m

Total total memory size, used has allocated memory size, free currently available memory size, shared tasks between the shared memory size, buffers system has allocated but not yet used, used to store file matedata metadata memory size, cached system has allocated but not yet used, memory size used to store file content data.

-/ + buffer/cache

Used subtracts buffers/cached, which means that instead of using so much memory, part of the memory is used in buffers/cached.

Free needs to be added with buffers/cached, which means that there is still buffers/cached free memory to be added.

According to Swap switching area statistics, the total size of total switching area, the size of switching area already used by used, and the available size of free switching area. You only need to pay attention to the size of the swap area that used has used. If there is an occupancy here, the memory has reached a bottleneck.

"in-depth understanding of the LINUX Kernel" and "LINUX Kernel Design and implementation" can be used as a reference manual to flip through problems.

There are two ways to troubleshoot performance (top-down, bottom-up)

When there is a performance problem in the system, we can troubleshoot the problem from two levels, from top to bottom, from the bottom of the Internet, or we can use these two methods comprehensively, and the information of these two latitudes can be viewed at the same time during pressure testing.

While opening top, free to observe the system-level consumption of cpu and memory, while looking at the internal status of the application runtime through tools such as jstack, jstat, and so on.

Summary

This article is mainly from the point of view of throwing a brick to attract jade, sort out the conventional problems and troubleshooting methods and processing flow when we do the general performance stress test, and there is not much advanced technical point.

Once the performance problem occurs, it will not be a simple problem, it takes a lot of energy to troubleshoot the problem, using a variety of tools and commands to check step by step, and the information output by these tools and commands is the underlying principle of the system. Need to be understood and tested one by one, and no silver bullet can solve all the problems.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.