Because Redis opened the AOF, the problem of hang residence was dealt with. 07/04 Update SLTechnology News&Howtos

Because Redis opened the AOF, the problem of hang residence was dealt with.

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Problem description

Business feedback normally has an API that can be accessed within the 100ms, and sometimes the call time will take more than 10 seconds. Check the redis log according to the time provided by the business, and record as follows:

8788 M 24 Aug 01 21 disk is busy 26.008 * Asynchronous AOF fsync is taking too long. Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.8788:M 24 Aug 01 virtual 21 Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

View redis aof related configuration

127.0.1 config get * append*1) "no-appendfsync-on-rewrite" 2) "yes" 3) "appendfsync" 4) "everysec" 5) "appendonly" 6) "yes"

View the configuration of rdb:

127.0.1 6390 > config get save1) "save" 2) "

View the version of redis:

127.0.0.1 6390 > info server# Serverredis_version:3.2.4

Malfunction analysis

When AOF persistence is turned on, after Redis handles each event, write (2) is called to write the change to kernel's buffer. If write (2) is blocked at this time, Redis cannot handle the next event.

Linux stipulates that when write (2) is executed, if fdatasync (2) is being executed on the same file, the kernel buffer is written to the physical disk, or if there is system wide sync execution, write (2) will be occupied by Block and the whole Redis will be occupied by Block.

If the system IO is busy, for example, there are other applications in the write disk, or Redis itself is in AOF rewrite or RDB snapshot (although another temporary file is being written at this time, although each is writing continuously, but the switch between the two files makes the disk head seek time longer), it may cause fdatasync (2) to fail to complete Block write (2) and Block to live the entire Redis.

In order to see more clearly the execution time of fdatasync (2), you can use "strace-p (pid of redis server)-T-e-f trace=fdatasync", but it will affect system performance.

Redis provides a way to save oneself. When it is found that the file is executing fdatasync (2), it will not call write (2) first. It will only be stored in cache to avoid being Block. However, if it has been like this for more than two seconds, it will be forced to execute write (2), even if the redis will be held by Block.

At this point the deadly log will print: "Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis."

You can then use redis-cli INFO to see that the value of aof_delayed_fsync is incremented by 1.

Therefore, the most rigorous statement about the possibility of data loss when fsync is set to everysec is that if fdatasync is executing for a long time, accidentally shutting down redis will result in no more than two seconds of data loss in the file.

If the fdatasync is running properly, the unexpected shutdown of the redis has no effect, and data loss of less than 1 second will occur only when the operating system crash.

Solution method

Method 1: turn off aof

This method needs to confirm with the business whether it is feasible. I think that if the master node of the redis master-slave + sentinel mode hangs up, the slave node will be promoted as the master node, and the master node will synchronize the data once after the master node is restored, and the relationship is not too big.

Method 2: modify the system configuration

The original AOF rewrite has been immersed in the call write (2), by the system itself to trigger sync. In RedHat Enterprise 6, vm.dirty_background_ratio=10 is configured by default, which takes up 10% of the available memory to start background flush, while my server has 8 GB of memory.

It is obvious that flush too much data at a time will cause blocking, so finally decisively set sysctl vm.dirty_bytes=33554432 (32m), the problem is solved.

Then when the issue,AOF rewrite is mentioned, the fdatasync is also executed regularly. In the new version of the antirez reply, 32m will rewrite the active call to fdatasync when AOF rewrite.

Check the system kernel parameters

> sysctl-a | grep dirty_background_ratiovm.dirty_background_ratio = 10 > sysctl-a | grep vm.dirty_bytesvm.dirty_bytes = 0

Try to modify the configuration file / etc/sysctl.conf and make the configuration take effect immediately

Echo "vm.dirty_bytes=33554432" > > / etc/sysctl.confsysctl-p

Verify that the modification is successful

> sysctl-a | grep vm.dirty_bytesvm.dirty_bytes = 33554432

Reference:

Https://ningyu1.github.io/site/post/32-redis-aof/

Https://redis.io/topics/latency

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.