Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

A tomcat pressure test tuning record

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

1. preface

The tomcat web application assumes the function of group login registration page, and has certain requirements for performance. Since there is not much relevant experience before (only one dubbo service has been tested), it is difficult to adjust this time, so I will make a record.

2. tuning process

At first, there was no tomcat configuration requirement for OPS, and at the same time, they did not pay attention to confirming the tomcat configuration, which led to various strange problems in the subsequent pressure test process.

a. At the beginning of the pressure test, no request comes in after 10 minutes of continuous requests. There are a lot of CLOSE_WAIT connections on the server where tomcat is located viewed by netstat.

CLOSE_WAIT connections are generally caused by the lack of closed connections in their own programs, but the viewing program did not find where it was not closed, and most CLOSE_WAIT is a tcp connection with the http protocol on the browser side. After operation and maintenance investigation, it was caused by centos 'own BUG, which was solved after upgrading to centos-release-6-6.el6.centos.12.2.x86_64.

Among them, the TCP connection for CLOSE_WAIT and TIME_WAIT has not understood how to appear at first, how to solve it, and then look at TCP four times to disconnect the connection in detail to understand the whole process. (Photo from the Internet)

For example, the client application program initiates a CLOSE message, the server enters the CLOSE_WAIT state and makes an ACK after receiving it, and then the server program initiates a CLOSE message, the client enters TIME_WAIT after receiving it, the server enters the CLOSED state after receiving the ACK from the client, and the client TIME_WAIT needs to wait until timeout before entering the CLOSED state.

Based on this, a large number of CLOSE_WAIT on the server side is not a normal state. First, you need to confirm the IP of the other party in the CLOSE_WAIT state, and then check whether the code corresponding to this IP lacks a closed connection.

However, if there is a large number of TIME_WAIT, it does not matter too much, as long as it does not occupy a full handle. If it is really full, you can try to modify the kernel TCP timeout time and TCP TIME_WAIT reuse.

b. Then test 500 concurrent connection timeout and read timeout. This situation is basically when the number of requests exceeds the configured maximum value. First, find the operation and maintenance to exclude the current limit of nginx and vm, and then check the limit of tomcat. It is found that tomcat does not configure the maximum number of threads. By default, the maximum number of threads is 200, and the maximum waiting queue is 100. Then modify the server.xml configuration of tomcat.

12

Adjust protocol to use nio, adjust maxThreads to 2048 for pressure testing, minSpareThreads to 100, and accept request queue to 512

If you press 1000 or 2000, there will be no rejection of the request.

Here, the connection timeout and read timeout are further refined. The connection timeout is in the state of timeout before the connection has been established. If it is not a network problem, maxThreads+acceptCount is probably not enough to handle concurrent requests. You can try to increase these two values;read timeout is after the connection has been established. One is timeout caused by waiting in the queue for too long, and the other is that the program itself has been executing for too long. You can log the request execution time through the access log.

c. After pressing for a period of time, the request response is slow, and the request cannot enter. At this time, most of them are connection rejected (due to the thread pool previously adjusted, the thread pool problem has been investigated). Later, checking the gc log found that they have been doing full gc and the old generation memory cannot be released. Since the gc mode has not been specified, it was thought that it was caused by gc, so the gc configuration was adjusted to cms first.

Log gc logs and dump memory configuration when services hang

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/vol/logs/heap.bin -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/vol/logs/gc.log1

Modify the gc configuration for cms

-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=10 -XX:+CMSClassUnloadingEnabled -XX:+CMSParallelRemarkEnabled -XX:MaxTenuringThreshold=15 -XX:CMSInitiatingOccupancyFraction=70 1

CMS GC understanding is not deep, roughly know that the default gc mode is blocking applications, CMS using concurrent mark clean up, and then turn over the book to learn. TODO

d. Some relief was obtained by modifying the gc mode, and the throughput was reduced and the response was very slow when the pressure measurement was less than 30 minutes;

Roughly know that there are many objects not released, through jmap to view the top 20 largest objects (./ jmap -histo pid |head -n 20), seems to be OK, the largest is the session, but also only 300 M (estimated to be just not during the pressure has been released);

Then dump all the memory down when it appears slow, but it is too big (5G), download too long;

After adjusting the heap memory to 1G and pressing it to appear slowly, check that dump is also the largest session, accounting for about 80%, at this time I didn't realize that it was a session problem, thinking that 1G was too small to give a fundamental problem;

Then adjust the heap memory to 2G and press it to appear slowly. Check the dump and find that the session occupies 1.5G. Suddenly, there is a flash of inspiration. My request session is encapsulated in redis. The local tomcat session is so large. (Flash is also a bit too late), and then set tomcat session timeout to one minute and there will be no more frequent FULL GC problems. Then check the code, the original is encapsulated session where the brain extracted to write a super.getSession(true). By default, a new session will be generated every time a request is made, and the timeout time is 30 minutes. It will take 30 minutes in memory to be destroyed!!

It seems that the above problems are very obvious... But in a stress test environment trying to figure out what the source of these problems is is really troublesome.

E. Well, the last step, the light is not far ahead. The pressure test result was not ideal. The tps was around 4800, but at least it was stable after a long time of pressure. Then use the jprofile tool to see where cpu consumption is to analyze single request time points.

jprofile To connect to a remote machine, you need to open agent on the remote machine. After executing bin/jpenable, select Yes to Gui Access and fill in the port.

f. Here again record the problems encountered after some optimization after stable pressure test:

When trying to improve TPS, we found that all the pressure measurement indicators seem to be not full, but the throughput is always not going up (which unnoticed resource should be full), tomcat thread pool is less than half, CPU is floating about 70%, heap memory is less than 1G, disk IO is about 8%, bandwidth is very low (speed is 80k/s) TCP connection number is 1W +

When troubleshooting this problem, we first suspected that a backend http service used could not keep up with it. There was no obvious improvement after the service was removed by mock.

After suspicion is tomcat is not configured correctly, then air pressure a test.jsp, tps in 2W+

Doubt again is redis not used correctly, in the empty test.jsp before using a redis session filter, equivalent to only press redis, tps also in 1W8 to 2W+

At this time, all peripheral services have been excluded, and it is confirmed that it is a problem with its own program. To view a function that takes a lot of time through jprofile, it is found that ua parsing is slow. By putting the parsing result into threadlocal, tps gets a small increase, from 4800 to 5000+.

Then we found that log4j has many blocks. After adjusting the level of log4j to ERROR, tps can reach 7000+ at once, but we still can't understand it. Later, some colleagues said that it may be the limit of disk IO times. This point was not concerned at that time, but the environment has been removed.

3. tips

3.1 Preparation before pressure measurement

Set pressure test business scenarios: for example, users complete login from loading login page to using account password

Prepare for pressure test environment: server with the same configuration as online, pure environment with data volume and no sharing of resources

Set pressure test targets according to online requirements: TPS>1000, average response time

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report