How to solve the problem of Scan 07/13 Update SLTechnology News&Howtos

How to solve the problem of Scan

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article focuses on "how to solve the problem of Scan". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to solve the problem of Scan.

Restart

When I joined the club, some colleagues were already helping to locate. As the saying goes, rebooting can solve 80% of the problems. If rebooting cannot be solved, it must be that the number of rebooting is not enough. Bah, no, rebooting cannot be solved, so we really have to locate it.

Facts have proved that it is still useless to take a wave of pressure test after restart, 1000 concurrent, the average response time is 3-4 seconds, several times in a row is the result.

Upgrade configuration

Restart seems to be ineffective, enter the second stage-upgrade configuration, 2 4-core 8G instances upgraded to 6 8-core 16G, database configuration has also doubled, money can be used to solve the problem, we generally do not invest too much manpower!

Facts have proved that the configuration is not useful, 1000 concurrent, the average response time of the stress test is still 3-4 seconds.

It's kind of interesting, at this point, I stepped in.

View Monitoring

After I went online, I checked the monitoring and found that there seemed to be no problem with the CPU, memory, disk, network IO and JVM heap memory usage of the instance. This is really a headache.

Local pressure test

We divided into two waves of students, one wave to prepare the local pressure test, the other wave to continue the analysis, after the local pressure test, we found that the local environment, single machine, 1000 parallel, proper, no Mao problem, the average response is basically maintained at a few hundred milliseconds.

It seems that there is no problem with the service itself.

Code check

There is really no way, take out the code, a group of men look at the code together, research and development students to explain business logic to us, of course, he has been scolded to death by the bosses, what broken code.

In fact, they had already changed the code before I intervened, and there was a place where the redis command scan was changed to keys *, and there was a hole, but it's not a major problem right now, as we'll talk about later.

Code read all the way down, found that there are a lot of redis operations, there is a for loop in the call redis get commands, the rest are regular database operations, and are indexed.

Therefore, preliminary investigation, there should be no problem with the database here, the main problem may still be concentrated in the redis, the call is too frequent.

Add log

Code check down, except that the scan has been changed to keys * (I don't know yet), there is basically no problem, add log, a short section plus log, OK, restart the service, a wave of stress testing.

Of course, there is no change in the result. Analyze the log.

From the log, we found that the call to redis is sometimes fast and sometimes slow, and it looks like there is not enough connection pool, that is, a batch of requests comes first, and a batch of requests are waiting for idle redis connections.

Modify the number of redis connections

Check the redis configuration, using stand-alone mode, 1 gigabyte memory, the default number of connections is 8, the client is still relatively old jedis, decisively change to the default lettuce of springboot, adjust the number of connections to 50 first, restart the service, press a wave.

The average response time has dropped from 3 / 4 seconds to 2 / 3 seconds, which is not obvious. Continue to increase the number of connections, because we have 1000 concurrency and there are many redis operations for each request, so there must be waiting. This time, we dry the number of connections directly to 1000, restart the service, and press a wave.

Facts have proved that there has been no significant improvement.

Check the log again

At this point, there is no good solution. We go back to the log again and check the time of redis-related operations. We find that 99% of get operations return quickly, basically between 0 and 5 milliseconds, but there are always a few that return after 800 milliseconds.

We thought there was nothing wrong with redis. However, the pressure test has been done several times, but the time has not been raised.

Very helpless, at this time, it is more than 3 o'clock in the middle of the night, the leader made a speech, shouted the people of XX cloud.

Cloud troubleshooting

Finally, we called up the XX cloud related personnel to troubleshoot the problem, of course, they are reluctant, but who asked us to pay!

The person in charge of XX Cloud got the experts from redis to take a look at the indicators of redis, and finally found that the bandwidth of redis was full, and then triggered the current limit mechanism.

They temporarily tripled the bandwidth of the redis and let's do another pressure test. Hold a grass, the average response time suddenly dropped to 200 million 300 milliseconds!

Really hold a grass, this is a bit of a pit, even if you limit the current, do not call the police when the bandwidth is full.

This is a real pain in the ass. At this point, we thought the problem was solved and the leaders went to bed!

Upper production

Now that the cause of the problem has been found, let's do a wave of production pressure. We asked XX cloud experts to triple the bandwidth of production.

Pull a hotfix branch from the production submission, close the signature, restart the service, and test a wave. Finished, the production environment is even worse, the average response time is 5-6 seconds.

The test environment we changed the connection pool configuration, the production environment or jedis, change it, take a wave. It doesn't have any practical effect, it's still 5 to 6 seconds.

It's a pain in the ass.

View Monitoring

Check the monitoring of redis in XX cloud. The bandwidth and traffic control are normal this time.

What is abnormal this time is that the CPU stress test of CPU,redis directly soared to 100%, which led to the slow response of the application.

Wake up XX Cloud redis experts again

It's more than four o'clock in the morning, and we don't have any ideas anymore. Redis expert of XX Cloud, get up again!

Wake up the redis expert of XX Cloud again, help us analyze the background, and find that 140,000 scan has been carried out in 10 minutes.

Evil scan

Ask the R & D staff where scan is used (they changed it earlier, I don't know), and find that every request will call scan to get the key at the beginning of a prefix, scan 1000 pieces of data each time, and check the total number of redis keys, about 110000.

In other words, a request needs to be scan 100 times, with 1000 concurrency, that is, more than 100,000 times scan.

We know that scan and keys * in redis need to scan the whole table, which consumes thousands of scan operations of CPU,14 and directly makes CPU go to heaven.

Why didn't the test environment CPU go up?

Compared with the total number of keys in the test environment and the production environment redis, there are only 900 key in the test environment, and there is only one scan or keys * per request, and there is no wool problem.

Why is there so much key in the production environment?

Ask the R & D staff why there are so many key in the production environment and do not set the expiration time?

The R & D personnel said that what was set was the code written by another colleague. Opening the code is really a piece of magic code. It is not convenient for me to post the specific code. There is a judgment on whether or not to set the expiration time according to the conditions. After analysis, in most cases, the expiration time is not set successfully.

Current solution

At this time, it was already 04:30 in the morning, and although everyone was still very excited, after the leader's decision, he stood still for the time being.

Because, at present, system A has suspended the call to system B, so system B can be said to have almost zero traffic at this time.

We fix this problem in two phases during the day:

The first step is to clean up the data from the production environment redis and keep only a small portion of the necessary data.

The second step is to modify the data at the beginning of a prefix in scan and change it to hash storage, which can reduce the scope of scanning.

Well, this is the end of the production accident investigation.

At this point, I believe you have a deeper understanding of "how to solve the problem of Scan". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.