How to use Go to implement robust memory caching 07/02 Update SLTechnology News&Howtos

How to use Go to implement robust memory caching

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article introduces the knowledge of "how to use Go to achieve robust memory cache". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Using Go to implement robust memory caching

This article introduces the common usage scenarios, selection and attention points of cache, which is more valuable.

Translated from: Implementing robust in-memory cache with Go

Memory caching is a way to consume memory in exchange for application performance and flexibility, while also delaying data consistency. When using memory cache, you need to pay attention to parallel updates, error caching, failover, background updates, expiration jitter, and cache warm-up and conversion.

Origin

Caching is the most convenient way to improve performance, but caching is not omnipotent, and in some scenarios, you cannot reuse the results of a task due to transaction or consistency constraints. Cache invalidation is one of the two most common problems in computer science.

If the operation is limited to immutable data, there is no need to worry about cache invalidation. At this point, caching is only used to reduce network overhead. However, if you need to synchronize with variable data, you must pay attention to cache invalidation.

The easiest way is to set cache invalidation based on TTL. Although this approach looks inferior to the event-based cache invalidation approach, it is simple and highly portable. Because there is no guarantee that events will be delivered immediately, events are not even as accurate as TTL in the worst-case scenario, such as event agents going offline or overloading for a short time.

A short TTL is usually a tradeoff between performance and consistency. It can be used as a barrier to reduce the load of high traffic to the data source.

Demo application

Let's look at a simple demo application that receives URL with request parameters and returns a JSON object based on the request parameters. Because the data is stored in the database, the entire interaction is slow.

The following will use a tool called plt to stress test the application, and plt includes parameters:

Cardinality-data of the unique URLs generated, which affects the cache hit ratio

Group-A similar number of URL requests sent at one time, simulating concurrent access to the same key.

Go run. / cmd/cplt-- cardinality 10000-- group 10000-- live-ui-- duration 10h-- rate-limit 5000 curl-- concurrency 5000-X 'GET'' http://127.0.0.1:8008/hello?name=World&locale=ru-RU'-H 'accept: application/json'

The above command starts a client, sending 10000 different URLs in a loop, sending 5000 requests per second, with a maximum of 200 concurrency. Each URL will be sent in batches of 100 requests to mimic the concurrency of a single resource. The real-time data is shown below:

The Demo application defines three modes of operation through the CACHE environment variable:

None: without caching, all requests will involve the database

Naive: use a simple map,TTL for 3 minutes

Advanced: using the github.com/bool64/cache library, implemented a number of features to improve performance and resilience, TTL is also 3 minutes.

The code of the Demo application is located at: github.com/vearutop/cache-story. You can use the make start-deps run command to start the demo application.

Under the condition of not using cache, the maximum 500RPS can be reached, and DB begins to block because of Too many connections after concurrent requests reach 130. this result is not optimal, although it is not serious, but performance needs to be improved.

The results of using advanced caching are as follows, with a 60-fold increase in throughput and reduced request latency and pressure on DB:

Go run. / cmd/cplt-- cardinality 10000-- group 100th-- live-ui-- duration 10h curl-- concurrency 100X 'GET'' http://127.0.0.1:8008/hello?name=World&locale=ru-RU'-H 'accept: application/json'Requests per second: 25064.03Successful requests: 15692019Time spent: 10m26.078sRequest latency percentiles:99%: 28.22ms95%: 13.87ms90%: 9.77ms50%: 2.29ms byte VS structure

Which is better?

Depending on the usage scenario, the advantages of byte cache ([] byte) are as follows:

The data is immutable and needs to be decoded when accessing the data

Because of less memory fragmentation, less memory is used.

Be friendly to garbage collection, because there is nothing to traverse

Easy to transmit on the line

Allow precise memory restrictions

The biggest disadvantage of byte cache is the overhead caused by codec, which can be very large in a hot loop.

Advantages of structures:

No encoding / decoding is required when accessing data

Better expressive ability to cache content that cannot be serialized

Disadvantages of structural caching:

Because the structure can be easily modified, it may be inadvertently modified

The memory of the structure is relatively sparse

If a large number of long-standing structures are used, GC may spend some time traversing to ensure that these structures are still in use, thus putting some pressure on the GC collector

It is almost impossible to limit the total memory of the cache instance, and items of dynamic size are stored in the heap along with all other items.

The structure cache is used in this article.

Native caching

Map protected by mutex is used. When you need to retrieve the value of a key, first check to see if the data exists in the cache and whether it has expired, and if not, you need to construct the data from the data source, put it in the cache, and then return it to the caller.

The whole logic is simple, but some defects can lead to serious problems.

Concurrent update

When multiple callers miss the same key at the same time, they try to build data, which can lead to deadlocks or resource exhaustion due to cache stampedes. In addition, if the caller tries to build the value, it will cause an additional delay.

If some builds fail, even if there may be valid values in the cache, the parent caller will fail.

You can use low cardinality and high group to simulate the above problems:

Go run. / cmd/cplt-- cardinality 100-- group 1000-- live-ui-- duration 10h-- rate-limit 5000 curl-- concurrency 5000-X 'GET'' http://127.0.0.1:8008/hello?name=World&locale=ru-RU'-H 'accept: application/json'

The figure above shows an application that uses naive caching, with a blue flag indicating restart and using advanced caching. You can see that locks seriously affect performance (Incoming Request Latency) and resource usage (DB Operation Rate).

One solution is to block parallel builds so that only one build can be done at a time. However, if there are a large number of concurrent callers requesting various keys, it can lead to serious lock competition.

A better approach is to lock the build of each key separately so that one caller can acquire the lock and perform the build, while other callers wait for the built value.

Background update

When the cache expires, a new value is required, and it may be slow to build the new value. If synchronized, you can slow down the tail delay (more than 99%). Cache items that are highly needed can be built ahead of time (even before the data expires). If you can tolerate old data, you can continue to use it.

In this scenario, old / outdated data can be used to provide services and updated in the background. It is important to note that if the build relies on the parent context, the context may be canceled after using the old data (such as satisfying the parent HTTP request), and if we use such a context to access the data, we will get a context canceled error.

The solution is to separate the context from the parent context and ignore the cancellation behavior of the parent context.

Another strategy is to proactively build cache items that are about to expire without a parent request, but this may lead to a waste of resources by constantly eliminating cache items that no one cares about.

Synchronous expiration

Suppose an instance that uses the TTL cache is started, and since the cache is empty at this time, all requests cause the miss to be cached and values created. This results in a sudden increase in the load of the data source, and the expiration time of each saved cache item is very close. Once the TTL is exceeded, most cache entries will expire almost synchronously, resulting in a new load surge, and the updated value will have a very close expiration time, to and fro.

This problem is common in hot cache items, which eventually update synchronously, but take a while.

The solution to this problem is to add jitter to the expiration time.

If the expiration jitter is 10%, it means that the expiration time is from 0.95 * TTL to 1.05 * TTL. Although this jitter is relatively small, it can also help reduce the problems caused by synchronization expiration.

In the following example, this situation is simulated with high cardinality and high concurrency. It requests a large number of table items in a short period of time to construct an expired peak.

Go run. / cmd/cplt-- cardinality 10000-- group 1-- live-ui-- duration 10h-- rate-limit 5000 curl-- concurrency 5000-X 'GET'' http://127.0.0.1:8008/hello?name=World&locale=ru-RU'-H 'accept: application/json'

As can be seen from the above figure, the synchronization expiration problem cannot be avoided by using naive cache. The blue identifier indicates that the service is restarted and the advanced cache with 10% jitter is used. You can see that the peak value has been reduced, and the overall service is more stable.

Cache error

When the build value fails, the easiest way is to return the error to the caller, but this approach can cause serious problems.

For example, when the service is working properly, 10K RPS can be processed with cache, but suddenly cache construction fails (possibly due to database overload in a short period of time, network problems, or logic errors such as error checking), and all 10K RPS will hit the data source (because there is no cache at this time).

For high-load systems, it is critical to use a short TTL to cache errors.

Failover mode

Sometimes it is better to use out-of-date data than to return errors directly, especially if the data has just expired, and there is a good chance that such data is equal to later updated data.

Failover is a trade-off between accuracy and flexibility, which is usually a compromise in distributed systems.

Cache transfer

Caching works best when there is relevant data.

When a new instance is started, the cache is empty. Because it takes a certain amount of time to generate useful data, caching efficiency is greatly reduced during this time.

There are some ways to solve the problems caused by "cold" caching. For example, you can preheat data that may be useful by traversing the data.

For example, you can pull recently used content from a database table and save it to a cache. This approach is complex and may not work.

You can also customize the code to decide which data to use and ReFactor these table items in the cache. But this may put some pressure on the database.

You can also circumvent this problem by sharing cache instances such as redis or memcached, but this creates another problem where reading data over the network is much slower than reading data from the local cache. In addition, network bandwidth may also become a performance bottleneck, and the codec of network data also increases the delay and resource consumption.

The easiest way is to transfer the cache from the active instance to the newly started instance.

The data cached by the active instance is highly relevant because it is generated in response to a real user request.

The transport cache does not require refactoring of data, so the data source is not abused.

In a production system, there are usually multiple application instances in parallel. During deployment, these instances are restarted sequentially, so one instance is always active and has a high-quality cache.

Go has a built-in binary serialization format, encoding/gob, which helps transfer data at minimal cost. The drawback is that this approach uses reflection and requires exposed fields.

Another consideration for using cache transfer is that different versions of the application may have incompatible data structures. to solve this problem, you need to add fingerprints to the cache structure and stop the transfer if it is inconsistent.

Here is a simple implementation:

/ / RecursiveTypeHash hashes type of value recursively to ensure structural match.func recursiveTypeHash (t reflect.Type, h hash.Hash74) Met map [reflect.type] bool) {for {if t.Kind ()! = reflect.Ptr {break} t = t.Elem ()} if met [t] {return} met [t] = true switch t.Kind () {case reflect.Struct: for I: = 0 I < t.NumField (); iTunes + {f: = t.Field (I) / / Skip unexported field. If f.Name! = "" & (f.Name [0:1] = = strings.ToLower (f.Name [0:1])) {continue} if! f.Anonymous {_, _ = h.Write ([] byte (f.Name))} recursiveTypeHash (f.Type, h, met)} case reflect.Slice Reflect.Array: recursiveTypeHash (t.Elem (), h, met) case reflect.Map: recursiveTypeHash (t.Key (), h, met) recursiveTypeHash (t.Elem (), h, met) default: _, _ = h.Write ([] byte (t.String ()}}

Cached data can be transferred through HTTP or other appropriate protocol, and in this case HTTP is used with the code / debug/transfer-cache. Note that the cache may contain sensitive information that should not be exposed to the public.

In this case, the transfer can be performed with the help of a single application instance with different ports enabled:

CACHE_TRANSFER_URL= http://127.0.0.1:8008/debug/transfer-cache HTTP_LISTEN_ADDR=127.0.0.1:8009 go run main.go2022-05-09T02:33:42.871+0200 INFO cache/http.go:282 cache restored {"processed": 10000, "elapsed": "12.963942ms", "speed": "39.564084 MB/s" "bytes": 537846} 2022-05-09T02:33:42.874+0200 INFO brick/http.go:66 starting server, Swagger UI at http://127.0.0.1:8009/docs2022-05-09T02:34:01.162+0200 INFO cache/http.go:175 cache dump finished {"processed": 10000, "elapsed": "12.654621ms", "bytes": 537846, "speed": "40.530944 MB/s", "name": "greetings" "trace.id": "31aeeb8e9e622b3cd3e1aa29fa3334af", "transaction.id": "a0e8d90542325ab4"}

In the image above, the blue sign indicates that the application is restarted, and the last two are cached transfers. You can see that performance is not affected, and there is a severe warm-up penalty in the absence of cached transfers.

A less obvious benefit is that cached data can be transferred to the local development machine to reproduce and debug problems in the production environment.

Lock contention and underlying performance

Almost every cache implementation uses key-value mapping to support concurrent access (usually read).

The impact of underlying performance can be ignored in most scenarios. For example, if you use memory-based caching to handle HTTP API, the simplest map+mutex is sufficient, because IO operations take much longer than memory operations. It is important to keep this in mind so as not to optimize prematurely and add unreasonable complexity.

If applications that rely on memory caching are CPU-intensive, lock contention may affect overall performance.

To avoid data conflicts under concurrent reads and writes, lock contention may be introduced. In the case of a single mutex, this synchronization may limit only one operation at a time, which means that multicore CPU may not work.

For read-based loads, the standard sync.Map can meet the performance requirements, but for write-based loads, it will degrade its performance. There is a higher performance way than sync.Map github.com/puzpuzpuz/xsync.Map, which uses the Cache-Line Hash Table (CLHT) data structure.

Another common way is to reduce lock contention through map fragmentation (fastcache, bigcache, bool64/cache), which is based on the key to spread the values into different buckets, making a tradeoff between ease of use and performance.

Memory management

Memory is a limited resource, so the cache cannot grow indefinitely.

Expired elements need to be eliminated from the cache, which can be performed synchronously or in the background. Using background recycling does not block the application itself, and if the background recycling process is configured to delay recycling, expired data can be used when a failover is needed.

If the above method of phasing out expired data does not meet the requirements of memory recycling, you can consider using other phase-out strategies. When choosing a phase-out strategy, you need to balance CPU/ memory usage and hit / loss rates. In summary, the purpose of phase-out is to optimize hit / loss rates within an acceptable performance budget, which is also an indicator to pay attention to when evaluating a phase-out strategy.

The following are common principles for choosing an elimination strategy:

Least recently used (LFU), counters need to be maintained every time you visit

Least recently used (LRU), you need to update the timestamp or order of elements each time you visit

First-in, first-out (FIFO). Once you create a cache, you can use the data in the cache, which is relatively light.

Random elements have the best performance and do not require any sorting, but with the lowest accuracy

The above shows how to choose an elimination strategy, and the next question is "when and how many elements should be eliminated?" .

For [] byte caching, this problem is easier to solve because most implementations provide a precise way to control memory.

But it's tricky for structural caching. During the execution of an application, it is difficult to reliably determine the impact of a specific structure on heap memory. GC may obtain this memory information, but the application itself cannot. The following two indicators for obtaining the internal storage of the structure are not accurate, but are available:

Number of elements in the cache

Total memory used by the application

Because these metrics are not linearly proportional to the cache memory used, the elements that need to be eliminated cannot be calculated accordingly. A more appropriate way is to phase out some elements (such as elements that account for 10% of the memory used) when the elimination is triggered.

The heap impact of cached data is largely related to the mapping implementation. As you can see from the performance tests below, map [string] struct {...} takes up four times more memory than binary serialized (uncompressed) data.

Benchmark test

The following is a benchmark for saving 1m small structures (struct {int, bool, string}), which includes 10% read operations and 0.1% write operations. The byte cache is verified by the codec structure.

Goos: darwingoarch: amd64cpu: Intel (R) Core (TM) i7-9750H CPU @ 2.60GHzname MB/inuse time/op (0.1%) sync.Map 192 ±0% 142ns ±4% 29.8ns ±10% / / Great for read-heavy workloads.shardedMap 196 ±0% 53.3ns ±3% 28.4ns ±11% mutexMap 182 ±0% 226ns ±3 % 207ns ±1% rwMutexMap 182 ±0% 233ns ±2% 67.8ns ±2 / / RWMutex perf degrades with more writes.shardedMapOf 181 ±0% 50.3ns ±3% 27.3ns ±13% ristretto 346 ±0% 167ns ±8% 54.1ns ±4% / / Failed to keep full working set ~ 7-15 of the items are evicted.xsync.Map 380 ±0% 31.4ns ±9% 22.0ns ±14% / / Fastest But a bit hungry for memory.patrickmn 184 ±0 373ns ±1% 72.6ns ±5% bigcache 340 ±0% 75.8ns ±8% 72.9ns ±3% / / Byte cache.freecache 333 ±0% 98.1ns ±0% 77.8ns ±2% / / Byte cache.fastcache 44.9 ±0% 60.6ns ±8% 64.1ns ±5% / A true champion for memory usage, while having decent performance.

If the actual scenario supports serialization, then fastcache can provide the best memory usage (fastcache uses dynamic requests to allocate memory)

For CPU-intensive applications, you can use xsync.Map.

As you can see from the above tests, byte caching does not necessarily mean efficient use of memory, such as bigcache and freecache.

Developer friendly

Programs do not always allow in the way we expect, and complex logic can lead to a lot of unexpected problems and difficult to locate. Unfortunately, caching makes the program worse, which is why it's so important to make caching more friendly.

Caching can be a cause of many problems, so you should clean up the relevant caches safely as soon as possible. For this reason, you can consider verifying all cached elements. In the case of high load, invalidation does not necessarily mean "delete". Once all caches are deleted at once, the data source may fail because of overload. A more elegant approach is to set the expiration time for all elements and update them in the background, using old data to provide services during the update process.

If someone is investigating a specific data source problem, the cached item may mislead the user because it expires. You can disable caching for specific requests so that inaccuracies caused by caching can be eliminated. It can be implemented through specific request headers and in the context of middleware. Note that this type of control does not apply to external users (which can lead to DOS attacks).

That's all for "how to use Go to implement a robust memory cache". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.