Methods and steps of troubleshooting accident in Dubbo Thread Pool 07/11 Update SLTechnology News&Howtos

Methods and steps of troubleshooting accident in Dubbo Thread Pool

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "Dubbo thread pool accident troubleshooting method steps", in daily operation, I believe many people have doubts on the Dubbo thread pool accident troubleshooting method steps problem, small editor consulted all kinds of information, sorted out simple and easy to use operation methods, hope to answer "Dubbo thread pool accident troubleshooting method steps" doubts helpful! Next, please follow the small series to learn together!

problem

One morning suddenly mobile phone received a company service alert SMS, thread pool exhausted? On the way to the company, the first thing I thought about was that there had been activities in the company recently. A sudden increase in traffic? Someone's releasing the system in the morning? Or did someone crash my system while I was gone?

With all kinds of thoughts in the company group to see synchronous information, all the above may be refuted!!!

The following is the warning message at that time:

RejectedExecutionException:Thread pool is EXHAUST! Thread Name: DubboServerHandler-xx.xx.xxx:2201, Pool Size: 300 (active: 283, core: 300, max: 300, largest: 300)

Q: How did this problem arise? Can we solve the problem by expanding the thread pool? What is the default implementation of thread pool in dubbo?

A: When we investigate problems, we must have an idea that we must find out the causal relationship in order to improve ourselves to a certain extent. With this idea, we can solve the mystery step by step.

With the problem, next we go to check the dubbo code configuration, understand the dubbo underlying implementation, only to understand the underlying implementation we can more accurately find the problem, deal with the problem, improve themselves...

First let's look at our code configuration:

From the exception thrown we have set the thread pool size to 300

Here I explain a point, not the larger the thread pool configuration, the better, this is to grant us system level, as well as configuration JVM parameters related. Generally we are the default configuration 200 can be roughly considered from these aspects:

1. JVM parameters: -Xms initial heap size-Xmx maximum heap size-Xss stack size per thread 2. System level 1. Maximum threads that can be created by the system 2. Formula line: Number of processes = (available memory of the machine itself- (heap memory allocated by JVM +JVM metadata area)) / Xss value

Anyway, let's start looking at the source code. Here we take version 2.7.19 as an example.

We can see that the ThreadPool interface is an extension point, and then the default implementation is fixed, and then there is a getExecutor method inside, decorated with the @Adaptive annotation.

ThreadPool has 4 implementation classes in dubbo

1. CachedThreadPool cache thread pool, delete after keepAliveTime, create again when used

2. FixedThreadPool FixedThreadPool A thread pool with a fixed number of threads that, once established, is always held.

3. LimitedThreadPool Scalable thread pool, threads only grow without shrinking.

4. EagerThreadPool When the core thread count is busy, create a new thread instead of putting the task in a blocking queue. This uses its own TaskQueue.

Here we look directly at the default implementation FixedThreadPool

exception handling mechanism

From the code we can find:

The dubbo thread pool uses jdk's ThreadPoolExecutor, the default number of threads is 200, and the SynchronousQueue queue is used by default. If the queue length configured by the user is greater than 0, the LinkedBlockingQueue queue will be used.

If a thread ends due to an execution exception, a new thread is added to the thread pool.

Therefore, dubbo defaults to the SynchronousQueue work queue submitted directly, so all tasks will be submitted directly to a worker thread in the thread pool. If there is no available thread, then the task processing will be rejected and the problem we are currently encountering will be thrown.

Here are the necessary parameters for creating a thread pool:

corePoolSize -Number of threads saved in the pool, including idle threads.

maximumPoolSize-Maximum number of threads allowed in the pool.

keepAliveTime -When the number of threads is greater than the core, this is the maximum time that extra idle threads wait for a new task before terminating.

unit -The unit of time for the keepAliveTime parameter.

workQueue -Queue used to hold tasks before execution. This queue holds only Runnable tasks submitted by the execute method.

threadFactory -The factory used by the executor when creating new threads.

handler A handler used when execution is blocked due to exceeding thread range and queue capacity. ThreadPoolExecutor is the low-level implementation of the Executors class.

Well, after reading so much source code, from the above we have already understood the source of this exception, what is the reason for the thread pool exhaustion I encountered?

The first stage of investigation:

Because 10.33.xx.xxx this machine thread pool exhausted, accompanied by a relatively long time ygc; so suspect because ygc time is long, resulting in tight machine resources, thus dragging down the thread pool;

2021-03-26T10:14:45.476+0800: 64922.846: [GC (Allocation Failure) 2021-02-24T11:17:45.477+0800: 64922.847: [ParNew: 1708298K->39822K(1887488K), 3.9215459 secs] 4189242K->2521094K(5033216K), 3.9225545 secs] [Times: user=12.77 sys=0.00, real=3.92 secs]

So I have been thinking about what caused YGC to take so long:

Here is a brief explanation of why IO high will cause GC time to be long

1. JVM GC needs to record GC behavior by issuing a system call write().

The write () call can be blocked by background disk IO.

3. Logging GC logs is part of the JVM quiesce, so the time of the write() call is also counted within the JVM STW quiesce time.

Through GC logs, we can see that the pause time of Cenozoic garbage collection is 3.92s; for Cenozoic space around 1.8G, it is obviously abnormal; the following steps will occur during the working process of ParNew collector:

(1)Mark-Mark living objects---> (2) Copy objects from eden area to survivor area----> (3) Clean eden area

Theoretically, the third step takes a certain amount of time, so it might take a long time, either in the first step or in the second step.

If the marking time of the first step is too long, it means that before this GC, there are a large number of small objects in the eden area (because the size of the eden area is certain), and the magnitude should be dozens of times the number of normal objects.

If the second step is too long, then there are the following possibilities:

1. After marking, there are still a large number of objects in the eden area (indicating that the new generation objects still occupy a large amount of memory after recycling), which can be excluded from glog (the size of the new generation after recycling is still 39M)

2. After tagging, there are still a lot of fragmented small objects

3. YGC started fullGC, but we didn't see the logs.

At this time, the above situation points to a possibility, that is, there are a large number of fragmented small objects in the Cenozoic;

To test this argument, there was only one way to analyze the heap snapshot, but we were just restarting the machine at the time and couldn't figure out why.

When we couldn't verify it, the second machine had the same problem. When we were ready to jump the log again, we looked back at the GC log and found that GC was normal. So overturn stage one.

Here are some of the server commands we use:

top : This is the most commonly used, but also the most complete display of information, you can see the load, memory, cpu and many other things

For example, use top common analysis steps:

1.top-Hp command to see which thread is more occupied 2. Use printf command to view the hex of this thread 3. Use jstack command to view the method being executed by the current thread

Use jmap to check memory and analyze for memory leaks.

jmap -heap 3331: View java heap usage

jmap -histo 3331: View the number and size of objects in heap memory (histogram)

jmap -histo:live 3331: JVM will trigger gc first, then statistics

jmap -dump:format=b,file=heapDump 3331: Output details of memory usage to a file

Of course, there are many other commands such as jstack,jinfo, uptime, etc.

Start Phase 2 screening:

The second exception is accompanied by various redis query timeouts, resulting in all queries going to DB, resulting in a large increase in database stress, slow SQl alarms, and so on.

So again, what causes our redis query timeout?

The first step must be to check whether the performance indicators of redis service are abnormal, and the results do not change much, then the problem must be in the server itself.

Check out your own indicators:

That is, during the alarm period, cpu, cpu_iowait, indicators will soar, disk IO will show uninterrupted sawtooth, and other indicators basically do not fluctuate.

But what causes CPU fluctuations? What is the causal relationship between CPU fluctuations and thread pool depletion?

So it seems to get the answer disk IO is high, resulting in cpu_iowait becomes high, cpu IO, instant allocation of less than time slice, rt will jitter.

Finally, we found that the reason for our high disk IO is related to our log collection system. I believe that a large number of companies will now use TRACE as the final link management of the whole service.

Trace logs recorded in memory are written asynchronously to disk at one time, so IO jitter occurs.

Let's see how trace actually works.

Trace is a dynamic tracing tool similar to strace, belonging to problem diagnosis and debugging tools.1.Trace consists of two parts: collecting trace logs and parsing trace logs. 2. trace collects trace logs that are essentially logs of control flow information logged when the trace module is run

Key points: trace module organizes the above information into binary format and writes it into memory. Until trace trace is stopped, the information in memory is exported as binary file.

This is just to provide you with an idea, everyone encountered problems or scenarios are different that get different conclusions, can only be given to you as a reference bar!!!

At this point, the study of "Dubbo thread pool accident troubleshooting method steps" is over, hoping to solve everyone's doubts. Theory and practice can better match to help you learn, go and try it! If you want to continue learning more relevant knowledge, please continue to pay attention to the website, Xiaobian will continue to strive to bring more practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.