Example Analysis of Cache Avalanche in Redis Architecture 07/08 Update SLTechnology News&Howtos

Example Analysis of Cache Avalanche in Redis Architecture

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the example analysis of cache avalanche in Redis architecture, which is very detailed and has certain reference value. Friends who are interested must read it!

1 real case

After the optimized release of the real-time information query function of the cloud office system, the system experienced a downtime event (the system hung and the page could not be loaded).

1.1 background

One of the original features of our IM is that when the mouse moves over the user's profile picture, it will display the user's basic information. The information is relatively simple, including simple user name, nickname, gender, email, phone and other basic data.

This is a typical data query. The process is as follows. When accessing the user's basic information, you will first check it in Redis. If it does not exist, you will take out about 2W of user data at one time and save it in Redis. Because the user's basic information is on the same table and the amount of data in the user information table is very small, there has been no problem.

The process is shown on the left side of the figure below.

After the optimization of the function, the original information collected in addition to the basic information of the user, but also collected educational experience, work experience, medals and so on.

The information is stored in different tables, so the collection process is a complex join table query, especially some basic tables have a large amount of data, and the execution efficiency is relatively slow.

If all users are taken out and stored in a Redis node, it is obviously no longer applicable. One is that batch queries lead to slow database execution, and the other is that Redis single-node data is too large.

So the developer makes the following optimization, only taking the comprehensive information of a single user at a time and storing it in Redis, and each user builds a cache, as shown on the right side of the figure above.

1.2 problem handling

There seemed to be no problem with this practice, and after it was released that night, the system bottleneck stuttered from 10:00 to 11:00 the next day, and finally hung up, driving up the memory and CPU of the database.

The first way to deal with it is to degrade, the program rolls back to the stage where only the basic information is provided, and the other front ends display empty information by default. Then there is an analysis of the problem, and it is later confirmed that the cause is a cache avalanche.

In the newly released system, the cache pool is empty, at the peak at 10:00 in the morning, a large number of people visit the IM, the system begins to establish everyone's cache information for the first time, a large number of requests can not query the cache, directly through the cache pool to the database, resulting in an instantaneous DB request blowout. This is a typical cache avalanche.

At the same time, because the failure time is similar (8 hours failure), there is also a potential cache avalanche.

Emergency treatment plan: appropriate cache processing mechanism, using Bloom filter, null initial value, random cache failure time to prevent cache breakdown and cache avalanche.

The final solution: change back to the original way of caching company-wide employee information, optimize the SQL script to obtain employee information according to the execution plan and SlowLog, and remove unwanted fields and meaningless connections.

2 cache avalanche 2.1 concept

Cache avalanche means that a large number of key sets the same expiration time, resulting in the failure of all caches at the same time, resulting in a large number of instantaneous DB requests, a sharp increase in pressure, and an avalanche.

For which of the above problems, the data accessed for the first time is not cached, just as in the case of simultaneous invalidation, when the peak period comes, a large number of requests will fail to reach the cache and will be directly thrown into the database through the cache pool, resulting in an instantaneous DB request blowout.

2.2 solution Analysis 2.2.1 Cache Cluster + Database Cluster

When designing the system capacity, you should be able to foresee a large number of requests at a later stage, so achieve high availability for the cache cluster before the avalanche occurs. If you are using Redis, you can use master-slave + sentinel, Redis Cluster to avoid a complete crash of Redis.

Similarly, there is a need for high availability protection for the database, because after caching, the real test is the resilience of the database. So 1 master N slave or even database cluster is what we need to focus on.

2.2.2 appropriate current restriction and downgrade

You can use Hystrix to perform current limit + downgrade. For example, in the case above, 1W requests come at once, which is beyond the capacity of the current system. Assuming that the capacity of TPS per second can only be 5000, then the remaining 5000 requests can follow the current limit logic.

You can set some default values and then call our own downgrade logic to FallBack to protect the final MySQL from being suspended by a large number of requests. In addition to Hystrix, Ali's Sentinel and Google's RateLimiter are good choices.

Sentinel leaky bucket algorithm

RateLimiter token bucket algorithm

You can also consider using a local cache for buffering to avoid a full crash when the Redis Cluster is not available.

2.2.3 Random expiration time

You can add a random value to the cache when setting the expiration time, so that the expiration time of each key is distributed and does not expire at the same time.

Random value our team's practice is: n * 3 random 4 + n * value (). So, for example, you originally planned to set up an expiration time of 8 hours for a cache, which is a random value of 6 hours + 0 hours 2 hours.

This ensures a uniform distribution between 6 and 8 hours. As shown in the figure:

2.2.4 Cache warm-up

Similar to the case above, it is not that the new feature has not expired, but that the cache has not been built at all, so you can do part of the cache before the peak period to avoid too much instantaneous pressure.

So if 10:00 is the peak, then you can build up most of the cache gradually between 8: 00 and 10: 00 in advance. As shown in the figure:

3 cache traversal 3.1 concept

Cache traversal refers to accessing a key that does not exist. If the cache does not work, the request will penetrate to the DB, and the DB will die when the traffic blowout.

For example, if we query the user's information, the program will retrieve it in the cache according to the user's number, and then search in the database if it can't be found. If you give a non-existent number: XXXXXXXX, then every time the match does not match, enter the database through the cache.

This risk is very great, if for some reason a large number of non-existent numbers are queried, or even maliciously forged numbers to attack, it will be a disaster.

3.2 solution Analysis 3.2.1 Cache Null

The reason for interpenetration is that there is no key in the cache to store the empty data, or the key for this data does not exist at all, resulting in each query going into the database.

We can set the value of these key to null and write to the cache pool. When there is a request to query this key, return null directly, so that it is judged to be returned in the cache pool, and the pressure is in the cache layer and will not be transferred to the database.

3.2.2 BloomFilter

We call it a Bloom filter, and BloomFilter is similar to a hbase set used to determine whether an element (key) exists in a collection.

This method is widely used in big data scenarios, such as Hbase to determine whether the data is on disk. And in the crawler scene to determine whether the url has been crawled.

This scheme can be added to the first scheme, add a layer of BloomFilter before caching, record the existing key in BloomFilter, go to BloomFilter to query whether key exists when querying, return directly if it does not exist, check the cache again, and put into the database to query, thus reducing the pressure on the database.

The flow chart is as follows:

3.2.3 selection and judgment of two schemes

As mentioned earlier, there may be some malicious attacks that fake a large number of key that does not exist. In this case, if we adopt the method of caching null values, we will produce a large amount of null data that does not exist key. This is obviously not appropriate, so we can use the second option to filter out the key.

Therefore, the basis of judgment is:

For the data with a lot of key and low repetition rate of requests, it is not necessary to cache it and filter it out directly using BloomFilter.

For those with limited key and high repetition rate of empty data, we can use the method of caching null value to deal with it.

4 cache breakdown 4.1 concept

When an existing key expires, there are a large number of requests at the same time, and these requests will break through to the DB, resulting in a large number of instantaneous DB requests and a sharp increase in pressure. (pay attention to the difference from the above two)

4.2 solution 4.2.1 how to lock

In the distributed lock scenario, before accessing the key, use SETNX (set if not exists) to set another short-term key to lock the access of the current key, and then delete the short-term key after the access ends.

This phenomenon is that multiple threads query this data in the database at the same time, so we can use a mutex to lock it on the first request to query the data.

When other threads get to this point, they wait until the first thread queries the data, and then cache it. The subsequent thread comes in and finds that there is already a cache and goes straight to the cache.

The bad part of the lock is that other threads wait when they can not get the lock, which will reduce the overall throughput of the system and the user experience is not good.

4.2.2 null initial value

This is a form of temporary demotion:

If a cache fails, there are countless requests rushing in, and the first request does not exist in the process of entering the cache pool, judging null, and then retrieving the database, then querying the results and returning to set the cache.

This is dangerous, and the short process of ultra-high concurrency is enough to send tens of thousands of requests to the database. Not to mention this may be a slow query, the whole process may be more than 2 seconds, which is a great harm to the database.

There is a practice in the industry called null initial value, a short local downgrade to ensure that the entire database system is not broken down. The approximate process is as follows:

As you can see, we sacrificed the requests of A, B, C and D throughout the process, and they got back a null value or default value, but this local downgrade ensures that the entire database system will not be punctured by congested requests.

The above is all the content of the article "example Analysis of caching Avalanche in Redis Architecture". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.