The method and steps of one-time error tuning 04/20 Update SLTechnology News&Howtos

The method and steps of one-time error tuning

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

The main content of this article is to explain the methods and steps of one-time misalignment tuning. Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn the method and steps of "one misalignment and tuning".

(1) Preface

Recently, a very troublesome thing happened. It has been online for half a year and has been running smoothly for half a year. The system is on the verge of downtime when it is used in the morning rush hour after the year. The access speed is very slow, and a large number of time_wait connections have been found in the background. After days and nights of troubleshooting from the code level to the architecture level to the network level, we finally have a result.

(2) Architecture and problem description

First, briefly describe the architecture of the system. The public network IP corresponding to the public network domain name is connected to Huawei Cloud's ELB elastic load balancer service. Under ELB, two Nginx servers are active, four private network application cluster servers are connected under Nginx (via Direct Connect), Mysql servers are read-write separated, and Redis servers are configured as master and slave.

When downtime occurs, the Load Average of four cluster servers soars to 200. if the Load average of an 8-core server exceeds 40, you should pay attention to it. At the same time, the number of tcp connections in Nginx is also abnormal, and there are a large number of TIME_WAIT connections.

(3) troubleshooting

The main reason for this problem is that the system can not resist the sudden high concurrency. According to the above error data, the first thing is to find a way to lower the load average. If the request is larger than the current processing capacity, there will be a wait, causing the load average to rise. Therefore, the investigation is mainly carried out from the following aspects:

3.1 check Nginx configuration

First of all, the operation and maintenance personnel check the Nginx configuration to see if the configuration is incorrect, resulting in a large number of time_wait connections, and the result is no problem.

3.2 grasp the network request

The operation and maintenance staff helped grab the network packets of the application server in the case of high concurrency, and finally analyzed that the number of requests for an interface is 40% of all requests. This is a question as to why the request has been called so many times.

3.3 check thread stack information

Because there are so many requests, you need to check the information of the thread stack that takes up the highest CPU. Finally, it is found that the garbage collection thread actually takes up 30% + resources, indicating that there should be some problems with garbage collection.

3.4 check database connection pool

Check whether the database connection pool is normal, see if there is a deadlock or resource depletion and other problems, find out that everything is normal.

3.5 check garbage collection

Using the jstat-gcutil command to capture garbage collection, we found that YGC is very frequent, about once a second. The following figure shows the garbage collection situation after the peak. YGC is about 8 times a minute. Therefore, it is also suspected that there is a problem with the business code.

(4) locate the business code

From the above information, we initially suspect that there is a problem with the interface that is called frequently. After looking up the code, I found that this interface will be called 7 times every time the page is refreshed, which is called too frequently. At the same time, a large number of JSON conversion and serialization are found in the code. After checking some data, it is found that this behavior will indeed lead to frequent YGC operations.

So optimize the code logic, call the interface once when refreshing the page, and pass the data to the front end at once instead of calling it multiple times. At the same time, optimize the unnecessary JSON conversion code. In theory, the performance should be improved a lot. Package, upload, re-observe.

(5) the problem still exists.

It was smooth for four days, and during the rush hour on Friday morning, slow running and high load happened again. This is dumbfounded, is there any other place that has not been checked?

At this time, I thought of a question, why there was a problem after running these servers smoothly for half a year, so I thought in other ways that there was a problem with a link from the public network IP to the application server. So try to cut off the public network link, to pressure test these applications on the intranet, unexpectedly will not appear before the problem.

So check the request of the direct connect network from the public network to the private network, and find that the maximum bandwidth of the exit can only reach 200m, that is, during the morning peak, while the bandwidth of the direct connect bought at that time can reach 500m. Therefore, it is suspected that the bandwidth is full so that requests cannot be blocked on the Nginx server on the public network, and users frequently refresh when they find that the operation is slow, resulting in a large number of Tcp connections to time_wait. So quickly contact the relevant service providers to detect the network bandwidth, sure enough, 500m bandwidth is directly reduced, contact the relevant personnel to deal with this problem.

At the same time, putting some static resources on the server of the private cloud will also lead to large network traffic. Separate these static resources to the Nginx server of the public cloud to reduce the network pressure on this direct connect.

At this point, I believe that you have a deeper understanding of the methods and steps of "one-time misalignment tuning". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.