Record the pressure Test of Grpc Interface & performance tuning 10/15 Update SLTechnology News&Howtos

Record the pressure Test of Grpc Interface & performance tuning

2025-10-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

0. Experience summary:

If the pressure does not go up all the time in the process of pressure testing, it can be considered that the pressure machine can not go up concurrently, or the request of the pressed machine can not be processed. The pressure does not go up or the pressed machine request can not be processed, is it because of the bottleneck of the machine CPU? Memory bottleneck? Number of ports bottleneck? Check the location step by step. Similar to requests such as Grpc that require a rpc connection, the number of expandable ports affects the number of connections established when concurrent. When locating the problem, the stress test chain of the long link can start from the short chain and expand to the long chain step by step, and finally complete the stress test of the whole link. Pay attention to the abnormal phenomena that may occur during the stress test, even if it is not obvious, there may be a BUG.

I. background note

This week, we conducted stress tests on several new interfaces in the project, during which we encountered some holes we had not encountered before and took some detours. Here we summarize and review the experience of this stress test. At the same time, I also hope to provide some simple ideas for all of you who see this article.

First of all, introduce the structure of our project. The service entry is a gateway module that provides an interface of Grpc type, and the data transfer mode is a unitary data mode. The gateway module interacts with other business modules through the Dubbo interface.

The architecture overview of the service is as follows:

The configuration of the server deployed by the business interface is the same as that of the server deploying the MySQL component, which is a 4-core 8GPY 50G ordinary hard disk and is in the same private network segment. We estimate that the performance index should reach 300 concurrency and 600TPS.

In the process of stress testing, we focus on indicators such as TPS, the number of GC, CPU occupancy and interface response time.

Second, the testing process

After completing the deployment of the project, we began to edit the jemeter test script, set the stress test standard to 300 concurrent threads, all started in 10 seconds for 15 minutes, and then started the jemeter script for testing.

1. The first stress test

Garbage collection strategies include enabling the CMS garbage collection algorithm in the old era, enabling the ParNew garbage collection algorithm in the new generation, using the CMS algorithm when the maximum survival period of the new generation is 15 times minorGC,FullGC, and turning on the parallel tags in the CMS.

Based on the experience of previous stress tests, we set the initial heap memory to 2048MB because small heap memory settings are easy to burst during stress tests.

JVM memory allocation: maximum / minimum heap memory 2048MB, the ratio of Eden and Survivor is 8:2, and the ratio of the new generation to the old is 1:2. Because the server is installed with the JDK8 version, the configuration of the permanent generation is abandoned.

The JVM configuration parameters are as follows:

-XX:+PrintGCDetails-XX:+PrintGCDateStamps-Xloggc:/var/log/$MODULE/gc.log-XX:+UseConcMarkSweepGC-XX:+UseParNewGC-XX:MaxTenuringThreshold=15-XX:+ExplicitGCInvokesConcurrent-XX:+CMSParallelRemarkEnabled-XX:+HeapDumpOnOutOfMemoryError-XX:HeapDumpPath=/var/log/$MODULE-Xmx2048m-Xms2048m-XX:SurvivorRatio=8

(2) performance indicator monitoring

The top command observes the CPU utilization of the java thread (us represents the user process and sy represents the system process) and uses the jstac-gcutil Pid 1000 command to periodically check the GC of the virtual machine.

When everything is ready, we start to run the stress test script and check the performance monitoring indicators.

However, we do not see the expected pressure, but when the concurrency reaches a certain value, it seems that the concurrency pressure is suddenly interrupted, and then the pressure reappears after an interval of 1-2 seconds. During this period, all the metrics of the interface server are not abnormal. Obviously, there is a problem with the concurrency!

(3) troubleshooting and solving problems

The above phenomena in the process of pressure testing can be subdivided into two categories:

The number of concurrency did not reach the expected value, and hovered at a low level. After 2 minutes of stress testing, only more than 3W requests were received on the interface side. In fact, 7W requests were attempted by the pressurized server. The phenomenon of similar pressure interruption occurs in the process of concurrency, and it is no accident that the assertion in the stress test script begins to throw out an error message: connection exception.

The exception information is as follows:

Because the Grpc interface needs to establish a RPC connection between the client and the server, both sides need to specify a port for data communication at the same time. Based on this, we may judge that there are two reasons for the first phenomenon:

The port of the pressurized machine is limited and does not start enough threads to initiate requests; the port of the pressed machine is limited and cannot receive all the requests sent by the pressurized machine, resulting in the loss of connection requests.

The reason for the second phenomenon may be that due to the limited number of ports, the concurrent thread cannot be initiated normally, and you need to wait for the connection port to be released before you can continue to initiate new connection threads.

So we went to look at the list of extended ports on both ends of the machine, and the command: cat / proc/sys/net/ipv4/ip_local_port_range

The result is:

That is to say, only the ports between 32768 and 60999 have been opened, and the number is about 3W.

Then we reset the port extension list of the pressed machine, and command:

Echo "10000 65535" > / proc/sys/net/ipv4/ip_local_port_range

The result is:

2. The second pressure test

After the first pressure test adjustment, we began to test and verify the effect of the adjustment.

(1) JVM configuration

The JVM configuration has not changed.

(2) performance indicator monitoring

Check the GC and CPU usage of the pressed machine, and there is still no significant change, but the number of requests has increased slightly, indicating that port expansion has a certain effect, but it is not obvious.

Continue to check the information of the pressure machine, the result of the API call asserts that there is still an error, although the error rate has decreased (due to the increase in the number of requests received by the pressed machine), the error message is still connected abnormally.

From this result, the solution to the problem is in the right direction, so we continue to open the port expansion of the pressure machine and begin the third round of test verification.

3. The third pressure test

(1) JVM configuration

The JVM configuration has not changed.

(2) performance indicator monitoring

After the port list of both the pressurized machine and the pressed machine is released, the grpc connection request is normal. In the case of 100 concurrency and 300 concurrency, there is little difference in the number of requests initiated in 2 minutes, indicating that it is close to the processing limit of the two servers.

(3) New problems are exposed.

I thought it would be OK, but suddenly found that the GC of the pressed machine began to appear abnormal, that is, the number of YGC began to change, but the FGC was frequent.

After pressure testing for a period of time, FGC began to appear with an average frequency of 2-3 times per second.

Usually, the reason for FGC is that the old age is full, so check the old heap memory of the thread:

Jstat-gcold PID

Oh my God! 64KB in the old days! After filling up the new generation of memory, it began to plug into the miserable old age, so FGC was triggered frequently.

So look at the JVM parameter configuration and find that the memory configuration of the old days is missing.

The old heap memory configuration can be set with-XX:NewRatio=3 to represent the ratio of the old generation to the new generation. That is, if the 2GB heap memory, then the old era is 1.5GB, the new generation is 0.5GB.

Once configured, restart the stress test module to check the heap memory allocation of the process:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.