How to solve the key expiration in Redis 07/06 Update SLTechnology News&Howtos

How to solve the key expiration in Redis

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article introduces how to solve the expiration of key in Redis. The content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.

Preliminary investigation

The affected team and the cache team began to conduct a preliminary investigation. We found that the increase in delay is related to the key cleanup that is now taking place. When Redis receives a write request but has no memory to save the write, it stops what is being done, clears the key, and then saves the new key. However, we still need to find out the reasons for the increase in memory usage for these new purges.

We suspect that memory is full of expired but not yet deleted key. It is suggested to use scanning, which reads all key and causes expired key to be deleted.

In Redis, key has two ways to expire, active expiration and passive expiration. The scan will trigger the passive expiration of the key, and when reading the key, the TTL will be checked, and if the TTL has expired, the TTL will be deleted and nothing will be returned. The active expiration of key in version 3.2 is described in the Redis documentation. The active expiration of key begins with a function called activeExpireCycle. It runs on an internal timer called cron at a frequency of several times per second. The activeExpireCycle function traverses each key space, checks for random kry with TTL sets, and repeats the process until the time limit is met if the percentage threshold for expired kry is met.

This method of scanning all kry is effective, and when the scan is complete, memory usage is reduced. It seems that Redis no longer effectively expires key. However, the solution at the time was to increase the size of the cluster and more hardware, so that key would be more distributed and more memory would be available. This is disappointing because the previously mentioned Redis upgrade project reduces the size and cost of running these clusters by improving the efficiency of the clusters.

Redis version: what's changed?

The implementation of activeExpireCycle has changed between Redis versions 2.4 and 3.2. In Redis 2.4, each database is checked at each run, and in Redis3.2, the maximum number of databases that can be checked is reached. Version 3.2 also introduces a quick option to check the database. "Slow" runs on the timer and "fast" runs before the event on the check event loop. The fast expiration cycle returns early under certain conditions, and it also has low timeout and exit thresholds. Time limits will also be checked more frequently. A total of 100 lines of code have been added to this function.

Further investigation

Recently we have time to go back and re-examine this memory usage issue. We want to explore why regression appeared, and then see how we can better implement key expiration. Our first idea is that there are a lot of key in Redis, and sampling only 20 is not enough. Another thing we want to study is the impact of the introduction of database restrictions in Redi 3.2.

The way shard is scaled and handled makes running Redis on Twitter unique. We have key space that contains millions of key. This is not common for Redis users. Shard is represented by the key space, so each instance of Redis can have multiple shard. Our example of Redis has a lot of key space. Sharding combines with the size of Twitter to create a dense backend with a large number of key and databases.

Improvement of expired testing

The number of samples on each loop is determined by the variable

ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP

Configuration. I decided to test three values and run them in one of the problematic clusters, then scan them and measure the difference before and after memory usage. If there is a large difference before and after memory usage, there is a large amount of out-of-date data waiting to be collected. The test initially had positive results in terms of memory use.

The test has one control and three test instances to sample more key. 500 and 200 are arbitrary. A value of 300 is the output of a calculator based on the statistical sample size, where the total number of key is the overall size. In the chart above, even if you only look at the initial number of test instances, you can clearly see that they perform better. The difference between this and the percentage of scans running indicates that the cost of expired key is about 25%.

Although sampling more key helps us to find more expired key, the negative delay effect is more than we can bear.

The above figure shows a delay of 99.9% in milliseconds. This indicates that the delay is related to the increase of sampled key. Orange represents a value of 500, green represents 300, blue represents 200, and the control is yellow. These lines match the colors in the table above.

After seeing that the delay is affected by the sample size, I wonder if I can automatically adjust the sample size based on how much key is out of date. Latency is affected when more key expires, but when there is no more work to do, we scan fewer key and execute faster.

The idea is basically feasible, we can see that memory usage is lower, latency is not affected, and a metric trace sample size shows that it increases and decreases over time. However, we did not adopt this solution. This solution introduces some latency spikes that do not occur in our control instance. The code is also a bit complex, difficult to explain, and not intuitive. We also have to adjust for each undesirable cluster because we want to avoid increasing operational complexity.

Fitting between survey versions

We also want to investigate the changes between Redis versions. The new version of Redis introduces a variable called CRON_DBS_PER_CALL. This variable sets the maximum number of databases to check each time the cron is run. To test the impact of this variable, we simply commented out these lines.

/ / if (dbs_per_call > server.dbnum | | timelimit_exit)

Dbs_per_call = server.dbnum

This compares the effect between the two methods of checking all databases with restrictions and without restrictions on each run. Our benchmark results are very exciting. However, our test case has only one database, and logically, this line of code makes no difference between the modified version and the unmodified version. Variables are always set.

99.9% are in microseconds. The unmodified Redis is above and the modified Redis is below.

We began to study why commenting out this line makes such a big difference. Since this is an if statement, the first thing we suspect is branch prediction. We use

Gcc's__builtin_expect

To change the way the code is compiled. However, this has no impact on performance.

Next, we look at the generated assembly to see exactly what happened.

We compiled the if statement into three important instructions: mov, cmp, and jg. Mov will load some memory into registers, cmp will compare the two registers and set another register based on the result, and jg will perform a conditional jump based on the value of the other register. The code you jump to will be the code in the if block or else block. I take out the if statement and put the compiled assembly into Redis. Then I test the effect of each instruction by commenting on different lines. I tested the mov instruction to see if there was a performance problem with loading memory or cpu cache, but found no difference. I tested the cmp instructions and found no difference. When I run the test with the included jg directive, the latency rises back to the unmodified level. After finding this, I tested whether it was just a jump or a specific jg instruction. I added the unconditional jump instruction jmp, jumped and then jumped back to the code to run without performance loss.

We spent some time looking at different performance metrics and tried some custom metrics listed in the cpu manual. We don't have any conclusions about why an instruction can cause such a performance problem. When performing jumps, we have some ideas related to instruction cache buffers and cpu behavior, but there is not enough time, and we will return to this point in the future if possible.

Resolution

Now that we have a good understanding of the cause of the problem, we need to choose a solution to the problem. Our decision is to make simple changes so that we can configure a stable sample size in the startup options. In this way, we can find a good balance between latency and memory usage. Even if the deletion of the if statement causes such a big improvement, it will be difficult for us to make a change if we can't explain why.

On how to solve the expiration of key in Redis to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.