Note the performance optimization of 5.28 pressure test-thread pool related problems 07/03 Update SLTechnology News&Howtos

Note the performance optimization of 5.28 pressure test-thread pool related problems

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Table of contents:

1. Environment introduction

two。 Symptom

3. Diagnosis

4. Conclusion

5. Solve

6. Compare java implementation

No more nonsense, this article shares a performance problem solved by the blogger during the 5.28 stress test. I think this is quite interesting and is worth summarizing and sharing.

The department served by the blogger is as a public business platform, which supports all upper business systems (2C, UGC, live broadcast, etc.). One of the core services in the platform is order domain related services, order issuing service, order checking service and payment callback service. Of course, we are still in charge of the settlement page for the time being, which is responsible for issuing orders, settling accounts and skipping payment centers. Every time the business side carries on the big promotion period the platform must carry on a routine pressure test, make sure that the heart is clear.

In the first half of the stress test, some not-so-strange problems were solved one after another, and the time to locate the problem was all within the plan. Order issuing service, order checking service and settlement page all passed the pressure test smoothly. But when it comes to the stress testing of the payment callback service, a strange problem arises.

1. Environment introduction

We basically promote it twice a year, 5.28 and double 12. The interval between the two promotions is only about half a year, so each time the pressure test will be a little low, basically is to find out and check. Because the previous pressure test performance in this half a year generally will not have too big performance problems. This premise is because we conduct performance stress tests every time we release a major project, so the stress tests gradually become routine and automated, and there should not be too many performance problems left out. As a matter of fact, performance indicators are concerned at ordinary times, rather than cramming when great promotion is coming. In fact, it is already too late, and we can only demolish east wall to make up for west wall.

Application server configuration, physical machine, 32core, 168g, gigabit network card, pressure testing network bandwidth gigabit, IIS 7.5, .NET 4.0. this pressure testing server is still very strong.

We will use JMeter locally to troubleshoot problems. Since this article is not about how to do performance stress testing, other situations that have little to do with this article will not be introduced. It includes the isolation of pressure testing network, the configuration of pressure testing machine and the number of nodes, etc.

Our requirement is that the average response time of the top-level service should not exceed 50 milliseconds under 200 concurrency, and the TPS should be about 3000. The requirement of the first-level service, that is, the lowest service is higher, and the average response time of commodity system, promotion system and card voucher system can only be accepted within 20 milliseconds. Because the response speed of the first-level service directly determines the response speed of the upper-level service, some other invocation overhead should be removed here.

two。 Symptom

The symptoms of this performance problem are strange, and the situation goes like this: 200 concurrency and a 40w call to 2000 looppene. In the first few seconds of the beginning, the speed was relatively fast, basically the TPS reached about 2500. The CPU of the server is about 60, which is quite normal, but the processing speed drops sharply after a few seconds, and the TPS is slowly falling. From the monitoring of the server, it is found that the CPU of the server is 0% consumption. It's scary. Why don't you deal with it all of a sudden? TPS has fallen to more than 100. it is obvious that it will continue to fall. After waiting for less than 4 minutes, CPU came up again. The TPS can be around 2000.

Let's take a closer look. First of all, the throughput of JMeter is calculated according to the average response time of your request, so here it seems that TPS is slowing down, but it has basically stopped. If your average response time is 20 milliseconds, your throughput can be basically calculated per unit time.

This is the main symptom, and we will diagnose it next.

3. Diagnosis

Start by walking through the code to see if you can find anything.

This is a payment callback service. There is not much business processing before and after the code. Authentication check, order payment status modification, trigger payment completion event, call and delivery, peripheral business notification (some of which need to be compatible with old codes and old interfaces). First of all, we mainly look at the externally dependent part, find that there is code read and written by redis, and comment out part of the code of redis for pressure testing. The result is normal all of a sudden, which is quite strange. Redis is shared by our other stress testing services. Why is there no problem with the stress test before? It doesn't matter that much, it may be that the execution sequence of the code is different, which makes sense in the concurrency field.

Let's print the time of redis execution to see how long it takes to process it. The results show that the processing speed is uneven, the front is very fast, and the latter time is 5-6 seconds, which is uneven but regular.

So we all thought it was a problem related to redis, so we began to dive in to check the problem of redis. Start to check the redis, first of all, turn on the Wireshark TCP connection monitoring, check the Slowlog of the link and redis server to check the processing time. View the source code of the redis client library (the redis client excludes the native StackExhange.Redis with two layers of encapsulation, a total of three layers), focusing on where there are locks and where the thread wait is. At the same time, troubleshoot the network problems, and then do the pressure test when the ping redis server to see if there is any delay. It's about 21:00 in the evening, when everyone knows what's going on in the brain. )

It was such a carpet search that I thought it would be able to locate the problem. But we ignored the hierarchical structure of the code, suddenly focused on too much detail, ignored the overall architecture (refers to the development architecture, because we did not write the code, do not know much about the surrounding situation of the code).

First take a look at the establishment of the redis server. Tcp grabs and checks the packets. The connection is established normally, there is no packet loss, and the speed is also very fast. The processing speed of redis is no problem, and it takes less than 1 millisecond for slowlog to view basic get key. Note here that the processing time of redis also includes the time it takes to wait in the queue. Slowlog can only see the time of redis processing, but not the time of blocking, which also includes the time of redis's command in the client queue. )

So the processing time of the printed redis is very slow, not just the processing time of the redis server, there are several links that need to be checked.

After some toss and investigation, the problem was not located, it was already late at night, and there was a serious lack of energy, and it was time for the last train of the subway to leave. I couldn't catch up any more. I went home from work and got on the last train without three minutes.

Reorganize the train of thought and continue to investigate the next day.

When we locate the connection to the redis client, we can warm up first, warm up when the global application_begin starts, and then the performance will be normal all of a sudden.

The scope is further narrowed down, and the problem lies in the connection, and here we reflect again (after a night's sleep and a clear head), so why didn't we have this problem in our previous stress tests? We, who are crazy about technology, can't give up. At this time, the problem is solved, but the relevant clues involved can not be worn, and it is always uncomfortable. (it was almost evening in the afternoon of the second day after a short break. If technicians want to have this kind of conqueror, they must be clear about it.

We started to restore the scene, and then began to make a big move, start the dump process files, divided into different time periods, grabbed several dump files down to the local analysis.

First of all, I looked at the thread situation,! runaway, and found that most threads took a bit long to execute. Then switch to a thread ~ xxs to view the thread call stack. Found waiting for a monitor lock. Switch to several other threads at the same time to see if they are all waiting for the lock. Turns out they're all waiting for this lock.

In conclusion, it is found that half of the threads are waiting for the moniter monitor lock, and over time, whether or not they are waiting for the lock. It's kind of weird.

This lock is used by lock to obtain redis connectioin when it is encapsulated in the third layer of the redis library. We comment out the lock directly, continue the pressure test, continue the dump, and then find another monitor. This lock is in StackExchange.Redis, and the code will not be digested for a while. We only check the body code and surrounding code, and there is no time to check the overall situation. Because time is tight. Trust the third-party library completely for the time being, and then check the parameters of redis connection string to see if you can adjust the timeout, connection pool size, and so on. But it still hasn't been solved.

Go back to check dump, check the CLR connection pool,! ThreadPool, I saw the problem all of a sudden.

Continue to look at several other dump files, Idle is 0, which means there are no threads in the CLR thread pool to process requests, at least the creation rate of CLR thread pool does not match the concurrency rate.

The creation rate of CLR thread pool is generally 2 threads per second. It is not clear whether there is a sliding time in the creation rate of thread pool. The size of the thread pool can be set by C:\ Windows\ Microsoft.NET\ Framework64\ v4.0.30319\ Config\ machine.config configuration, which is automatically configured by default. The minimum number of threads is generally the number of CPU cores of the current machine. Of course, you can also use ThreadPool related methods to set, ThreadPool.SetMaxThreads (), ThreadPool.SetMinThreads ().

Then we continue to examine the code and find the place in the code where the delegate of Action is used, and this Action handles asynchronous code, and the reading and writing of the redis mentioned above are all in this Action. All at once we understood that all the clues were connected.

4. Conclusion

The .NET CLR thread pool is a shared thread pool, which means that there is a thread pool behind ASP.NET, delegate, and Task. There are two types of thread pools, request thread pool and IOCP thread pool (completion port thread pool).

Let's take a look at the clue:

1. It is an illusion that the JMeter stress test throughput is slowly decreasing from the beginning, when the processing has been completely stopped, and the server's CPU processing is 0%. The naked eye appears to be slower because the request delay time has increased.

There is no problem with the TCP link of 2.redis, there is no exception checked by Wireshark, there is no problem with Slowlog, and the key comnand of redis is slow because blocking resides.

3. All other services have no problem with stress testing because we are calling redis synchronously, and the speed will increase after the first TCP connection is established.

4.Action looks fast, but all Action are threads in the CLR thread pool, and it looks fast because there is no bottleneck in the CLR thread pool.

Action asyncAction = () = > {/ / read / write redis / / send email / /...}; asyncAction ()

There is no delay in 5.JMeter stress testing, and the program does not warm up during stress testing, resulting in everything that needs to be initialized, IIS, .NET, and so on. All of these will make it look fast at first and then slowly decline.

Summary: it takes time to establish a TCP connection for the first time, and the concurrency is too large. All threads will be swapped out by CPU after wait,wait. This is an obvious context switching process for all threads, which is part of the overhead. When the thread depletion throughput of the CLR thread pool starts to drop sharply. Each call actually opens two threads, a Request that handles the request, and an Action delegate thread. When you think the thread is enough, the thread pool is already full.

5. Solve

We have queued up to deal with this problem. It is equivalent to abstracting a work queue based on the CLR thread pool, and then the consumption thread of the queue is controlled within a certain number. When initializing, it defaults to one thread, which provides the interface to create up to 6 threads. This can be called when the processing speed of the queue cannot keep up. The approximate code is as follows (with appropriate modifications, non-source code appearance, for reference only):

Service section:

Private static readonly ConcurrentQueue AsyncNotifyPayQueue = new ConcurrentQueue (); private static int _ workThread;static ChangeOrderService () {StartWorkThread ();} public static int GetPayNoticQueueCount () {return AsyncNotifyPayQueue.Count;} public static int StartWorkThread () {if (_ workThread > 5) return _ workThread; ThreadPool.QueueUserWorkItem (WaitCallbackImpl); _ workThread + = 1; return _ workThread } public static void WaitCallbackImpl (object state) {while (true) {try {PayNoticeParamEntity payParam; AsyncNotifyPayQueue.TryDequeue (out payParam); if (payParam = = null) {Thread.Sleep (5000); continue } / / get order details / / carry-over apportionment / / send text messages / / send messages / / ship} catch (Exception exception) {/ / log}

The place where the call was originally called was directly changed to queue:

Private void AsyncNotifyPayCompleted (NoticeParamEntity payNoticeParam) {AsyncNotifyPayQueue.Enqueue (payNoticeParam);}

Controller Code:

Public class WorkQueueController: ApiController {[Route ("worker/server_work_queue")] [HttpGet] public HttpResponseMessage GetServerWorkQueue () {var payNoticCount = ChangeOrderService.GetPayNoticQueueCount () Var result = new HttpResponseMessage () {Content = new StringContent (payNoticCount.ToString (), Encoding.UTF8, "application/json")}; return result } [Route ("worker/start-work-thread")] [HttpGet] public HttpResponseMessage StartWorkThread () {var count = ChangeOrderService.StartWorkThread () Var result = new HttpResponseMessage () {Content = new StringContent (count.ToString (), Encoding.UTF8, "application/json")}; return result;}

The above code is not encapsulated abstractly and is for reference only. The idea is the same, maximizing thread utilization, delaying tasks without taking up too many threads, and separating CPU-intensive from IO-intensive. Separate the movements that do not match the speed.

The optimized TPS can reach 7000, nearly three times faster than the original.

6. Compare JAVA implementation

In fact, if this problem may not occur easily in JAVA, the thread pool function of JAVA is relatively powerful, and the concurrent library is relatively rich. It can be done in two lines of code in JAVA.

ExecutorService fiexdExecutorService = Executors.newFixedThreadPool (Thread_count)

Directly construct a specified number of thread pools, of course, we can also set the queue type and size of the thread pool, including the rejection policy after the queue is full and the thread pool is full. These are quite convenient to use.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.