How to troubleshoot the problem of soaring CPU online 04/11 Update SLTechnology News&Howtos

How to troubleshoot the problem of soaring CPU online

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article is to share with you about how to solve the problem of online CPU soaring. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Some time ago, we launched a new application, because the traffic has been small, the cluster QPS is only about 5, and the rt for writing the interface is about 30ms.

Because of the recent access to new services, the data given by the business side is that the daily QPS can reach 2000, and the peak QPS may reach 10,000.

So, in order to evaluate the water level, we carried out a pressure test. In the process of pressure testing, it is found that when the QPS of the single machine reaches about 200, the rt of the interface does not change obviously, but the CPU utilization increases sharply until it is full.

After the pressure test stopped, the utilization rate of CPU decreased immediately.

So they began to investigate what caused the soaring CPU.

Problem troubleshooting and solution

During the stress test, log in to the machine and start troubleshooting.

The investigation process of this case is carried out using Ali's open source Arthas tool, not using Arthas, and using the commands that come with JDK.

Before you start troubleshooting, you can take a look at the usage of CPU. The easiest thing to do is to use the top command to check directly:

Top-10:32:38 up 11 days, 17:56, 0 users, load average: 0.84,0.33,0.18 Tasks: 23 total, 1 running, 21 sleeping, 0 stopped, 1 zombie% Cpu (s): 95.5 us, 2.2 sy, 0.0 ni, 76.3 id, 0.0 wa, 0.0 hi, 0.0 si, 6.1 st KiB Mem: 8388608 total, 4378768 free, 3605932 used 403908 buff/cache KiB Swap: 0 total, 0 free, 0 used. 4378768 avail Mem PID USER PR NI VIRT RES SHR S% CPU% MEM TIME+ COMMAND 3480 admin 20 0 7565624 2.9g 8976 S 241.2 35.8 649 admin 07.23 java 1502 root 20 401768 40228 9084 S 1.0 0.5 39 ilogtail 21.65 ilogtail 181964 root 20 03756408 104392 8464 S 0.7 1.20 java 496 root 20 0 2344224 14108 4396 S 0.3 0.2 52:22.25 staragentd 1400 admin 20 0 2176952 229156 5940 S 0.3 2.7 31:13.13 java 235514 root 39 19 2204632 15704 6844 S 0.3 0.2 55:34.43 argusagent 236226 root 20 0 55836 9304 6888 S 0.3 0.1 12:01.91 systemd-journ

As you can see, the Java process with a process ID of 3480 consumes a lot of CPU, so it can be concluded that a lot of CPU is consumed in the execution of the application code, and then we begin to find out which thread and which code consumes more CPU.

First, download the Arthas command:

Curl-L http://start.alibaba-inc.com/install.sh | sh

Start

. / as.sh

Use the Arthas command "thread-n 3-I 1000" to view the current "busiest" (consuming CPU) three threads:

From the stack information above, we can see that the thread that takes up CPU resources is mainly stuck on the TCP socket reading at the bottom of JDBC. It was executed many times in succession, and it was found that many threads were stuck in this place.

Through the analysis of the call chain, it is found that this place is the insert of the database in my code, and uses TDDL (Ali's internal distributed database middleware) to create sequence, which needs to interact with the database in the process of creating sequence.

However, based on the understanding of TDDL, every time TDDL queries sequence sequences from the database, it fetches 1000 entries by default and caches them locally, and only after using them will it get the next 1000 sequences from the database.

In theory, our stress test QPS is only about 300, so we should not interact with the database so frequently. However, after using Arthas many times, it is found that most of the CPU is exhausted here.

So began to troubleshoot code problems. In the end, we found a silly problem, that is, we have a problem with the creation and use of sequence:

Public Long insert (T dataObject) {if (dataObject.getId () = = null) {Long id = next (); dataObject.setId (id);} if (sqlSession.insert (getNamespace () + ".insert", dataObject) > 0) {return dataObject.getId ();} else {return null }} public Sequence sequence () {return SequenceBuilder.create () .name (getTableName ()) .sequenceDao (sequenceDao) .build ();} / * get the next primary key ID * * @ return * / protected Long next () {try {return sequence () .build () } catch (SequenceException e) {throw new RuntimeException (e);}}

Because each time we re-build a new sequence in the insert statement, the local cache is lost, so we pull 1000 entries again from the database every time, but only one is used, and the next time we get 1000 entries again and again, over and over again.

Therefore, the code is adjusted to change the generation of the Sequence instance to initialize once when the application starts. In this way, when you get the sequence, you will not interact with the database every time, but check the local cache first, and then interact with the database to get a new sequence when the local cache is exhausted.

Public abstract class BaseMybatisDAO implements InitializingBean {@ Override public void afterPropertiesSet () throws Exception {sequence = SequenceBuilder.create () .name (getTableName ()) .sequenceDao (sequenceDao) .build ();}}

Initialize the Sequence in this method by implementing InitializingBean and overriding the afterPropertiesSet () method.

After changing the above code, submit it for verification. Through the monitoring data, we can see that after optimization, the read RT of the database has decreased significantly:

The write operation QPS of sequence also decreased significantly:

So we started a new round of pressure testing, but found that the utilization rate of CPU is still very high, the QPS of pressure test still can not go up, so we re-use Arthas to check the situation of threads.

A new stack of threads that consume CPU is found, mainly because we use a joint debugging tool. The pre-release of this tool enables TDDL capture by default (described in the official document as pre-release does not enable TDDL collection by default, but actually does).

This tool desensitizes during log printing, and the desensitization framework calls Google's re2j to match regular expressions.

Because there are many TDDL operations in my operation, a large number of TDDL logs are collected and desensitized by default, which really consumes CPU.

Therefore, this problem can be solved by turning off the collection of TDDL by DP in the pre-release.

Thank you for reading! This is the end of this article on "how to check the problem of online CPU soaring". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.