How to troubleshoot Redis delay problem 04/21 Update SLTechnology News&Howtos

How to troubleshoot Redis delay problem

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to troubleshoot Redis delay problems". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to troubleshoot Redis delay problems".

Use complex commands

How to troubleshoot if you find a sudden increase in access latency when using Redis?

First of all, as a first step, I suggest you check Redis's slow log. Redis provides the statistical function of slow log commands, and we can see which commands have a large delay in execution by setting the following settings.

First, set the slow log threshold of Redis. Only commands that exceed the threshold will be recorded. The units here are subtle, such as setting the slow log threshold to 5 milliseconds, while setting only the last 1000 slow log records:

# Command to record slow logs for more than 5 milliseconds CONFIG SET slowlog-log-slower-than 5000 # keep only the last 1000 slow logs CONFIG SET slowlog-max-len 1000

After the setting is completed, all commands executed with a delay of more than 5 milliseconds will be recorded by Redis. We execute SLOWLOG get 5 to query the last 5 slow logs:

127.0.1 SLOWLOG get 6379 > integer 5 1) 1) (integer) 32693 # slow log ID 2) (integer) 1593763337 # execution time 3) (integer) 5299 # execution time (subtle) 4) 1) "LRANGE" # specific commands and parameters 2) "user_list_2000" 3) "0 "4)-1" 2) 1) (integer) 32692 2) (integer) 1593763337 3) (integer) 5044 4) 1) "GET" 2) "book_price_1000".

By looking at the slow log records, we can know when and which commands are more time-consuming. If your business often uses commands with complexity above O (n), such as sort, sunion, zunionstore, or when executing O (n) commands, it is very time-consuming for Redis to process data.

If you have a small number of service requests, but the CPU utilization of the Redis instance is very high, it is most likely due to the use of complex commands.

The solution is not to use these complex commands, and not to get too much data at a time, and to manipulate a small amount of data at a time so that Redis can process the return in a timely manner.

Store large key

If you query the slow log and find that it is not caused by complex commands, for example, SET and DELETE operations appear in the slow log record, then you should doubt whether Redis has written a large key.

When Redis writes data, it needs to allocate memory for new data, and when it deletes data from Redis, it frees up the corresponding memory space.

If a key writes a very large amount of data, Redis can also be time-consuming in allocating memory. Similarly, it takes a long time to free memory when deleting data from this key.

You need to check your business code, whether there is a situation to write a large key, you need to evaluate the amount of data written, and the business layer should avoid storing too much data in a key.

So is there any way to scan the data of large key in Redis?

Redis also provides a way to scan large key:

Redis-cli-h $host-p $port-- bigkeys-I 0.01

Using the above command, you can scan the distribution of the key size of the entire instance, which is shown in type dimensions. Why can Redis single thread reach millions + QPS? I recommend taking a look.

It should be noted that when we scan a large key on an online instance, the QPS of Redis will increase suddenly. In order to reduce the impact on Redis during scanning, we need to control the scanning frequency by using the-I parameter, which indicates the interval between each scan during the scanning process (in seconds).

The principle of using this command is that Redis executes the scan command internally, iterates through all the key, and then executes strlen, llen, hlen, scard, and zcard for different types of key to get the length of the string and the number of elements of the container type (list/dict/set/zset).

For the key of container type, only the key with the largest number of elements can be scanned, but the key with the most elements does not necessarily take up the most memory, which requires our attention. However, with this command, we can generally have a clear understanding of the distribution of key in the entire instance.

In order to solve the problem of large key, Redis officially launched the lazy-free mechanism in version 4.0, which is used to asynchronously release the memory of large key and reduce the impact on Redis performance.

Even so, we do not recommend using a large key, which will affect the performance of the migration during the migration process of the cluster, which will be described in more detail when introducing the cluster-related articles later.

Centralized expiration

Sometimes you will find that there is no large delay when using Redis, but there is a sudden wave of delay at a certain point in time, and the slow time is very regular, such as at a certain hour, or at regular intervals.

If this happens, you need to consider whether there are a large number of key set expiration cases.

If a large number of key expires at a fixed point in time, it is possible to increase latency when accessing Redis at that point in time.

The expiration policy of Redis adopts two strategies: active expiration and lazy expiration:

Active expiration: Redis maintains a scheduled task. By default, 20 key are randomly extracted from the expired dictionary every 100ms to delete expired key. If the proportion of expired key exceeds 25%, then continue to get 20 key, delete expired key, and cycle until the percentage of expired key drops to 25% or the execution time of this task exceeds 25 milliseconds.

Lazy expiration: determine whether the key has expired only when you visit the key, and delete it from the instance if it has expired

Note that the scheduled task of active expiration of Redis is also executed in the main thread of Redis, that is, if a large number of expired key needs to be deleted during the execution of active expiration, you must wait for the expiration task to finish before processing the business request. At this point, there will be a problem of increased business access latency, with a maximum delay of 25 milliseconds.

And this access delay will not be recorded in the slow log. Only the time spent executing a command is recorded in the slow log. Redis active expiration policy execution before the operation command, if the operation command time does not reach the slow log threshold, it will not be counted in the slow log statistics, but our business feels the delay increases.

At this point, you need to check your business to see if there is really centralized expired code. Generally, the command used for centralized expiration is the expireat or pexpireat command. Just search for this keyword in the code.

If your business does need to centrally expire some key, and you don't want to cause Redis jitter, what's the optimization?

The solution is to add a random time when the set expires, breaking up the time of the key that needs to expire.

The pseudo code can be written as follows:

# randomly expire redis.expireat within 5 minutes after the expiration point (key, expire_time + random (300))

In this way, when Redis deals with expiration, it will not block the main thread because of the pressure caused by the centralized deletion of key.

In addition, in addition to business use needs to pay attention to this problem, but also through the means of operation and maintenance to find this situation in time.

The way to do this is that we need to monitor the running data of Redis, and we can get all the running data by executing info. Here we need to focus on expired_keys, which represents the cumulative number of expired key deleted by the entire instance so far.

We need to monitor this indicator. When there is a sudden increase in this indicator in a very short period of time, we need to give an alarm in time, and then compare it with the slow time point of the business to confirm whether the time is consistent. If so, it can be considered that the delay caused by this reason has increased.

The memory of the instance reaches the upper limit

Sometimes when we use Redis as a pure cache, we set a memory limit of maxmemory for the instance, and then enable the LRU elimination policy.

When the memory of the instance reaches maxmemory, you will find that each time you write new data, it may slow down.

The reason for slowing down is that when Redis memory reaches maxmemory, before writing new data, you must kick out some of the data to keep the memory under maxmemory.

This logic of kicking out old data also takes time, and how long it takes depends on the configured elimination strategy:

Allkeys-lru: regardless of whether or not key is set to expire, eliminate the least recently accessed key

Volatile-lru: only eliminate the least recently accessed and set expired key

Allkeys-random: random elimination regardless of whether key is set to expire or not

Volatile-random: only key with expired settings will be randomly eliminated

Allkeys-ttl: eliminate expired key regardless of whether the key is set to expire or not

Noeviction: do not eliminate any key, and then write the error report directly after full capacity.

Allkeys-lfu: eliminate the least frequently accessed key (4.0 + support) regardless of whether the key is set to expire or not

Volatile-lfu: only obsolete expired key with the lowest access frequency (4.0 + support)

The specific strategy to use needs to be decided according to the business scenario.

The most commonly used policies are allkeys-lru or volatile-lru. Their processing logic is to randomly take a batch of key (configurable) from the instance at a time, then eliminate a least accessed key, then temporarily store the remaining key in a pool, continue to randomly select a batch of key, and then eliminate a least accessed key compared with the key in the previous pool. Cycle through this until the memory drops below maxmemory.

If you use allkeys-random or volatile-random policy, it will be much faster, because it is randomly eliminated, so it will take less time to compare key access frequency. A batch of key can be randomly selected and then eliminated directly, so this strategy is faster than the LRU policy above.

But all of the above logic is executed before the actual command is executed when accessing Redis, that is, it affects the command that we execute when we access Redis.

In addition, if there is a large key stored in the Redis instance at this time, it will take longer and the delay will be longer when eliminating the large key to free memory, which requires our special attention.

If you have a very large number of business visits, and you must set maxmemory to limit the memory limit of instances, and at the same time face the situation of increasing delay caused by key elimination, in order to alleviate this situation, in addition to avoiding storing large key and using random elimination strategy, you can also consider the method of splitting instances to alleviate. Splitting instances can spread the pressure of one instance to eliminate key to multiple instances. The delay can be reduced to some extent.

Fork takes a lot of time

If your Redis enables automatic generation of RDB and AOF rewrites, it is possible to increase the access latency of Redis when generating RDB and AOF rewrites in the background, and when these tasks are completed, the delay disappears.

This situation is usually caused by the execution of generating RDB and AOF rewriting tasks.

To generate RDB and AOF, the parent process fork a child process to persist the data. During the execution of fork, the parent process needs to copy the memory page table to the child process. If the whole instance takes up a lot of memory, then the memory page table that needs to be copied will be time-consuming, and this process will consume a lot of CPU resources. Before completing the fork, the whole instance will be blocked and unable to process any requests. If CPU resources are tight at this time, Then the time of fork will be longer, even reaching the level of seconds. This can seriously affect the performance of Redis.

The specific principle can also refer to the previous article: how is Redis persistence done? Comparative analysis of RDB and AOF.

We can execute the info command to see the time-consuming latest_fork_usec of the last fork execution, in subtle units. This is the time when the entire instance is blocked and unable to process the request.

In addition to generating RDB for backup reasons, when the master and slave nodes establish data synchronization for the first time, the master node will also generate RDB files for a full synchronization of the slave node, which will also have an impact on the performance of Redis.

To avoid this, we need to plan the data backup cycle, and it is recommended that the backup be performed on the slave node, preferably during the trough. If the business is not sensitive to lost data, it is not recommended to enable AOF and AOF rewriting functions.

In addition, the time consuming of fork is also related to the system, which can be increased if Redis is deployed on a virtual machine. Therefore, when using Redis, it is recommended to deploy on the physical machine to reduce the impact of fork.

Bind CPU

In many cases, when we deploy the service, in order to improve the performance and reduce the performance loss of context switching when using multiple CPU, we usually use the operation of process binding CPU.

However, when using Redis, we do not recommend doing this for the following reasons.

For the Redis bound to CPU, when data persistence is carried out, the child process produced by fork will inherit the CPU usage preference of the parent process. At this time, the child process will consume a lot of CPU resources for data persistence, and the child process will compete with the main process for CPU, which will also increase the access delay of the main process due to insufficient CPU resources.

So when deploying a Redis process, if you need to turn on RDB and AOF rewriting mechanisms, you must not bind CPU!

Turn on AOF

As mentioned above, when performing AOF file rewriting, the Redis latency will increase due to the time-consuming execution of fork. In addition to this, if the AOF mechanism is enabled, the policy set is unreasonable, which will also lead to performance problems.

When AOF is enabled, Redis will write the written commands to the file in real time, but the process of writing the file is to write to memory first, and the contents of memory will not be written to disk until the data in memory exceeds a certain threshold or reaches a certain time.

In order to ensure the security of writing files to disk, AOF provides three disk flushing mechanisms:

Appendfsync always: the disk is brushed every time it is written, which has the greatest impact on performance, occupies a high IO of disk, and has the highest data security.

Appendfsync everysec:1 swipes the disk once per second, which has relatively little impact on performance. When the node is down, it can lose up to 1 second of data.

Appendfsync no: flushing the disk according to the operating system mechanism has the least impact on performance and low data security. The loss of data due to node downtime depends on the operating system flushing mechanism.

When using the first mechanism, appendfsync always, every time Redis processes a write command, it writes the command to disk, and the operation is performed in the main thread.

The data in memory is written to the disk, which will increase the IO burden on the disk, and the cost of operating the disk is much higher than that of operating memory. If the amount of writing is large, then each update will be written to disk, and the machine's disk IO will be very high, slowing down the performance of Redis, so we do not recommend using this mechanism.

Compared with the first mechanism, appendfsync everysec will swipe every second, while appendfsync no depends on the operating system's flushing time, and the security is not high. Therefore, we recommend using appendfsync everysec. In the worst case, only 1 second of data will be lost, but it can maintain good access performance.

Of course, for some business scenarios, it is not sensitive to data loss, or you can turn off AOF.

Use Swap

If you find that Redis suddenly becomes very slow, and each visit takes hundreds of milliseconds or even seconds, check to see if Redis uses Swap, in which case Redis is basically unable to provide high-performance services.

As we know, the operating system provides a Swap mechanism so that when memory is insufficient, part of the data in memory can be exchanged to disk to buffer memory usage.

But when the data in memory is transferred to disk, accessing the data needs to be read from disk, which is much slower than memory!

Especially for the high-performance in-memory database such as Redis, if the memory in Redis is swapped to disk, the operation time is unacceptable for the database which is extremely sensitive to Redis.

We need to check the machine's memory usage to see if it is true that Swap is being used because of insufficient memory.

If you do use Swap, clean up the memory space in time, free up enough memory for Redis, and then release Redis's Swap to allow Redis to reuse memory.

The Swap process of releasing Redis usually requires restarting the instance. In order to avoid the impact of restarting the instance on the business, the master-slave switch is generally advanced, then the Swap of the old master node is released, the service is restarted, and then the data synchronization is completed and then switched back to the master node.

It can be seen that when Redis uses Swap, the high performance of Redis is basically abolished at this time, so we need to prevent this situation in advance. Follow the official account Java technology stack and reply to redis for a series of Redis tutorials.

We need to monitor the memory and Swap usage of the Redis machine, give an alarm in time when there is insufficient memory and use Swap, and deal with it in time.

The network card load is too high.

If you avoid all the above scenarios that cause performance problems, and Redis has been running steadily for a long time, but after a certain point in time, access to Redis begins to slow down and continues to this day, what is the cause of this situation?

We have encountered this problem before, and the characteristic is that it slows down after a certain point in time and continues. At this time, you need to check the network card traffic of the machine to see if the network card traffic is running full.

If the load of the network card is too high, there will be data transmission delay and data packet loss in the network layer and TCP layer. In addition to memory, the high performance of Redis lies in the network IO. A sudden increase in the number of requests will lead to a high load on the network card.

If this happens, you need to find out which Redis instance on this machine has excessive traffic and take up full network bandwidth, and then confirm whether the sudden increase in traffic is a normal business situation. If so, you need to expand or migrate the instance in time to prevent other instances of this machine from being affected.

At the operation and maintenance level, we need to increase the monitoring of various indicators of the machine, including network traffic, alarm in advance when the threshold is reached, timely confirmation with the business and capacity expansion.

Thank you for your reading, the above is the content of "how to troubleshoot Redis delay". After the study of this article, I believe you have a deeper understanding of how to troubleshoot Redis delay, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.