Uncover the secret of tens of billions of requests for highly available Redis (codis) distributed clusters 07/01 Update SLTechnology News&Howtos

Uncover the secret of tens of billions of requests for highly available Redis (codis) distributed clusters

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Abstract: as the king of kv database in noSql, redis is favored by developers because of its high performance, low latency and rich data structure, but because redis is limited in horizontal scalability, how to scale horizontally and non-intrusively to business is a problem that many developers who use redis will face, and an open source product [codis] of redis distributed solution can make up for this weakness. This article mainly explains how codis is unaware of business, smooth migration, high migration performance, migration exception handling, high availability and common redis pit avoidance guides. Although codis is coming to an end as the company's nosql products become more and more mature, in view of the fact that there are still many students who are interested in the principles of codis, they reorganize the content they shared before. Of course, codis is still relatively widely used outside the company.

Catalogue

I. background

II. Overview of Redis related fundamentals

Introduction to 2.1Redis

Characteristics of 2.2Redis

2.3Redis application scenario

III. Comparison of Redis distributed solutions inside and outside the company

IV. The architecture design of Codis

4.1 overall architectural design of Codis

Architecture Design and implementation of 4.2Codisproxy

Fifth, data reliability & high availability & disaster recovery & failure transfer & brain fissure processing

5.1 data reliability

5.2 High availability & disaster recovery & failover

VI. Details of horizontal expansion of codis & handling of migration exceptions

6.1 details of Codis expansion and migration

6.2 Migration exception handling

VII. Codis-related data

8. Operation and maintenance manual and pit avoidance guide

IX. Reference materials

I. background

With the first year of live broadcasting, more and more live broadcast products are springing up like bamboo shoots. In the process of boosting revenue, products try their best to think about various activities to stimulate users' desire for consumption. The basic form of such activities is the list. In 2016, we achieved the ranking of the list based on cmem and scanning the flow meter. Since 2017, we have reconstructed the original system, using redis as the basic storage of our list. After receiving the task of investigating redis distributed solution in the process of refactoring, I compared various open source products in the industry, finally decided on Codis, and did some research on the details. During the process of communicating with the author of Codis, I was fortunate to know that simotang of the value-added products Department had introduced codis in the department for nearly 2 years, so I joined the operation and maintenance work of codis. At present, 15 sets of operation and maintenance codis clusters with 2 T capacity and 10 billion daily visits are deployed in the department. Supported the interactive video products department basic storage, operational activities, list business for more than 2 years, a total of more than 100 activities, thousands of lists. At the same time, I would like to thank spinlock, the author of codis, for his guidance and help in the process of accessing codis. See spinlock github and codis address

II. Overview of Redis related fundamentals

2.1 introduction to Redis

Redis is a KV database with high performance and low latency based on memory and data persistence. The data structure of value can be string,hash table, list (list), set (collection) and sortedset (ordered collection).

Redis (RemoteDictionary Server)

Redis is anopen source (BSD licensed), in-memory data structure store, used as adatabase, cache and message broker. It supports data structures suchas strings, hashes, lists, sets, sorted sets with rangequeries,Practice: http://try.redis.io/

2.2 Features of Redis

1. Single-threaded asynchronous architecture (single-threaded, receiving, sending, parsing, execution, multiplex io to receive file events)

2. KMurv structure, value supports rich data structures (string,hash,list,set,sortset)

3. High performance, low latency, based on memory operation, Get/Set10w+, high performance, based on RDB, AOF landing to ensure data reliability

4. Rich features for caching, message queuing, TTL expiration

5. Transactions are supported, and operations are atomic, either all commit or none commit.

2.3 Redis application scenario

String

Counters, user Information (id) Mapping, uniqueness (e.g. user qualification), bitmap

Hash

Common scenarios: storing attribute information of objects (user profile)

List

Common scenarios: comment storage, message queue

Set

Common scenarios: qualification judgment (such as user reward claim judgment), data de-duplication, etc.

Sorted set

Common scenarios: rankings, delay queues

Other

2 articles are recommended for distributed lock design:

Is Redis-based distributed Lock secure in the end (part I)

Http://zhangtielei.com/posts/blog-redlock-reasoning.html

Is Redis-based distributed Lock secure in the end (part two)

Http://zhangtielei.com/posts/blog-redlock-reasoning-part2.html

[figure codis architecture diagram]

As shown in the figure above, the overall codis belongs to a two-tier architecture, and proxy+ storage is relatively simple compared to the ckv+ design without proxy. At the same time, when the client connection data is gradually increasing, there is no need to expand the copy capacity of the data layer, but only need to do the proxy layer expansion. From this point of view, the cost will be lower, but when the number of connections is small, the proxy needs to be deployed separately. The cost will be higher.

Among them, the registration discovery of open source codisproxy services is realized through zk, and the current department is based on L5.

From the overall architecture design diagram, the overall architecture of codis is relatively clear, in which codisproxy is the core part of distributed solution design. Storage routing and shard migration are inseparable from codisproxy. Let's take a look at the design and implementation of codisproxy.

Architecture Design and implementation of 4.2Codisproxy

The architecture implementation of codisproxy is divided into two parts, the details of the route map of 4.2.1 and the details of proxy request processing of 4.2.2

4.2.1 Route map details

As shown in the following figure: this section is mainly concerned with the routing details of codis, mainly about how to map a key to a specific physical node.

[figure] Route map details

As shown in the figure above: this section mainly deals with the routing details of codis.

| | related vocabulary description |

Slot: sharding information, which only represents a number in redis and represents the sharding index. Each shard belongs to a specific redis instance

Group: it is mainly a virtual node, which is composed of several redis machines to form an one-master-multi-slave mode. It is a logical node.

To help you have a deeper understanding of the details of proxy route maps, I have sorted out several common questions related to route maps to help you understand

Question 1: how does proxy map the request to a specific redis instance?

Codis gets the corresponding slot,slot based on crc32's algorithm 24, which is called logical shard. At the same time, CoDIS maps the corresponding logical shard to the corresponding virtual node. Each virtual node is composed of physical redis nodes with one master and multiple slaves. As for why crc32 is used, there is no detail about this, and the author also introduced it from the implementation in rediscluster. Through the introduction of logical storage node group, even if the underlying host machine instance changes, the mapping data of the upper layer is not mapped, which is transparent to the upper layer mapping and is convenient for slicing management.

Question 2, how does proxy achieve the separation of reading and writing?

As shown in the figure above, when key is mapped to a specific virtual node, the master and slave instances corresponding to the virtual node can be sensed. At this time, the redisproxy layer can recognize the specific redis command and get the corresponding command to read and write. According to whether the cluster configuration supports the read-write separation feature, if the configuration is supported, it is randomly routed to the master and slave instances. If the configuration is not supported, the route is routed to the host to complete.

Question 3: which commands are currently supported by proxy, whether batch commands are supported, and how to ensure atomicity

Command support links

Commands are not supported

Semi-supported command

Command support part: there are three kinds of commands supported by Prxoy: do not support commands, semi-support commands, support commands, except the commands shown in the table above, other commands proxy are supported, mainly because there is no key in these command parameters, so you cannot identify routing information and do not know which instance to route to. The semi-supported command part usually operates multiple key,codis based on a simple implementation, based on the route of the first key, so it requires the business to keep multiple key routing to the same slot. Of course, the business can also bear the specific consequences, which is a weak check mode, while the company-level product ckv+ has a strong check for multi-key operations, if multiple key is not on the same slot. Is returned in the wrong form.

Multi-key operation & atomicity part: Redis itself is atomic for some operations of multi-key, such as mset, but under distributed operation, multi-key will be distributed to multiple redis instances, involving distributed transactions, so it is simplified in codis, and multi-key operation is split into multiple single key command operations, so mset multi-key operation in codis does not have atomic semantics.

Question 4, how to ensure that multiple key are in one slot

In some scenarios, we want to use lua or some semi-supported commands to ensure the atomicity of our operations, so we need to ensure that multiple key is in one slot, and codis uses the same model as rediscluster, based on hashtag. For example, if I want all seven-day VJ lists to be routed in the same slot, {anchor_rank} day1, {anchor_rank} day2, {anchor_rank} day3 can be supported. For patterns that use curly braces, codis recognizes the braces and only takes the string in the braces for hash operation.

4.2.2Proxy request processing details

As shown in the following figure: this section is mainly concerned with the processing details of proxy, which involves the process of how to accept a request and respond to a response packet.

[figure] Proxy request processing details

As shown in the figure above: this section mainly deals with the processing details of proxy.

Codisproxy is mainly based on go, a language that naturally supports collaborative programs at the language level.

1) after proxy receives the connection from the client, create a new session and start both reader and writer in session at the same time. Reader is mainly used to receive and parse the client request data, split the commands in multi-key scenarios, then distribute the request to a specific redis instance through router, and write the data processed by redis to the channel. Writer receives the corresponding result from the channel and writes it back to the client.

Loop reader

Loop writer

2) the Router layer mainly obtains the routing information corresponding to key through the crc command. You can see the characteristics of hashtag from the source code, and codis actually supports it.

Hash source code

At this point, the proxy-related route mapping and request processing details have been completed, is it very simple on the whole?

Fifth, data reliability & high availability & disaster recovery & failure transfer & brain fissure processing

As a storage layer, data reliability and high availability of services are the core indicators of stability, which directly affect the stability of the upper core services. This section will focus on these two indicators.

5.1 data reliability

As far as the implementation of codis is concerned, the high reliability of data is mainly due to the ability of redis itself. usually, the data in the storage layer is highly reliable, which is mainly realized by high reliability of stand-alone data + hot backup of remote data + regular cold backup archiving.

The high reliability of stand-alone data is mainly based on the persistence ability of redis itself, rdb mode (regular dum) and aof mode (flow log), which can be understood by referring to the two books shown above, in which aof mode is more secure. At present, we also turn on the aof switch online, which will be described in detail at the end of the article.

Remote data hot standby is mainly based on the characteristics of master-slave synchronization of redis itself, the realization of full synchronization and incremental synchronization, and the ability of redis specific remote hot backup.

Regular cold backup archive due to the possible misoperation of data by personnel during the operation of the storage service, network failure in the computer room, and data loss caused by hardware problems, we need some backup solutions. At present, we mainly back up the data of the last 48 hours on a single machine and sng's Liu Bei system for cold backup, in case data is lost due to unexpected problems and can be recovered quickly.

5.2 High availability & disaster recovery & failover

The architecture of codis itself is divided into proxy cluster and redis cluster. The high availability of proxy cluster can be based on zk or L5 for failover, while the high availability of redis cluster is realized by redis open source sentinel cluster. As a non-redis component, one of the problems that need to be solved is how to integrate redis Sentinel cluster. This section divides the problem into three parts and describes how redis Sentinel clusters ensure the high availability of redis, how codisproxy perceives the failover actions of redis Sentinel clusters, and how redis clusters reduce the probability of brain fissure.

5.2.1 how does the Sentinel Cluster ensure the High availability of redis

Sentinel (outpost, sentry) is a highly available solution for Redis: a Sentinel system consisting of one or more Sentinel instances can monitor any number of master servers and all slave servers under these master servers, and automatically upgrade a slave server under the offline master server to a new master server when the monitored master server goes offline Then the master server continues to process command requests instead of the offline master server.

Generally speaking, two things need to be done to achieve the high availability of services: fault detection and failover (that is, selecting the master and switching between the master and the slave).

Malfunction

Detection

Sentinel cluster failover

1) Select a Sentinel-leader for failover operation (raft protocol, more than half of the election)

If (winner & & (max_votes)

< voters_quorum || max_votes < master->

Quorum))

2) lead sentinel in the offline slave server, select a slave server and convert it to the master server

3) change all slave servers under the offline master server to copy the new master server

4) set the offline master server as the slave server of the new master server, and when the old master server comes online again, it will become the slave server of the new master server.

Note: steps to select a new primary server from the server

1) Slave servers that remove all offline or disconnected status from the election list

Remove all slave servers that have not replied to the info command of the lead Sentinel in the last five seconds

Remove all slave servers that exceed down-after-millisenconds * 10 (ms) with offline servers

2) according to the priority of slave server (highest), copy offset (maximum), run ID (minimum) 1) every 1 second, send ping commands to master server, slave server, and other sentinel instances

Valid reply: + PONG,-Loading,+MASTERDOWN one kind of reply

Invalid reply: a reply other than the above three responses, or a reply that has not been returned within the specified time limit

Sentinel.conf-> Sentinel down-master-millsenconds master 50000

(when sentinel receives invalid requests or no reply for 50 consecutive seconds, master will be marked as subjectively offline)

2) after subjective offline, send query commands to other sentinel. If the number specified in the configuration is reached, the master is marked as objective offline.

Sentinel monitor master xx.xx.xx.xx 2

Malfunction

transfer

Sentinel cluster failover

1) Select a Sentinel-leader for failover operation (raft protocol, more than half of the election)

If (winner & & (max_votes)

< voters_quorum || max_votes < master->

Quorum))

2) lead sentinel in the offline slave server, select a slave server and convert it to the master server

3) change all slave servers under the offline master server to copy the new master server

4) set the offline master server as the slave server of the new master server, and when the old master server comes online again, it will become the slave server of the new master server.

Note: steps to select a new primary server from the server

1) Slave servers that remove all offline or disconnected status from the election list

Remove all slave servers that have not replied to the info command of the lead Sentinel in the last five seconds

Remove all slave servers that exceed down-after-millisenconds * 10 (ms) with offline servers

2) according to the priority of the slave server (highest), copy offset (maximum), run ID (minimum)

5.2.2 how does codis perceive the failover action of a Sentinel cluster

The architecture of codis is divided into proxy cluster and redis cluster. The high availability of redis cluster is guaranteed by Sentinel cluster, so how does proxy sense the failure of redis host and then switch to the new master to ensure the high availability of service?

As shown in the figure above, proxy itself listens for the + switch-master event of the sentinle cluster, which means that there is a problem with the host of the redis cluster, and the sentinel cluster begins to elect and switch hosts. Proxy listens for the master-slave handover event of the sentinel. After receiving the master-slave handoff event, proxy will do an action to pull out the currently perceived hosts perceived by all the clusters on the sentinel. Select more than half of the hosts considered by sentinel as the current cluster hosts.

At this point, you may ignore a problem, that is, the configuration storage, the storage in the configuration center is still the old host, and once the proxy is restarted, the failed host is still pulled. In fact, dashboard and proxy do the same thing. After receiving the master-slave switching event, the new host will be persisted to the storage (currently zk).

5.2.3 treatment of cerebral fissure

Brain fissure (split-brain) the brain fissure in a cluster is usually caused by the unreachability between some nodes in the cluster. If the following situation occurs, different split clusters will independently select master nodes, resulting in multiple master nodes in the original cluster. The result will be system confusion and data corruption

On this issue, simotang students have explained very well, the governance and practice of large-scale codis clusters. Here to say briefly, because redis clusters cannot simply rely on the mode of more than half of the election, because redismaster itself does not test its own health status and downgrade action, so we need a master health status to assist in judging degradation. The specific implementation is

1) downgrade the probability of dual hosts, which makes the Quorums judgment more stringent and the time of host offline judgment more stringent. We have deployed 5 sentinel machines to cover the IDC of major operators, and only 4 hosts are offline when they subjectively believe that the host is offline.

2) the isolated master is degraded. Based on the way of judging shared resources, the agent on the redis server will regularly and continuously check whether the zk is normal. If the connection is not available, the downgrade instruction will be sent to the redis, which can not be read or written, sacrificing availability and ensuring consistency.

VI. Details of horizontal expansion of codis & handling of migration exceptions

As codis is a distributed solution for redis, it is bound to face the problem of horizontal capacity expansion in the case of insufficient single point capacity of redis. This section mainly explains the details of horizontal expansion and migration exceptions of codis. We first take a look at two questions: question one, how to handle the read and write requests of the migrated key during migration, and problem two, how to deal with exceptions (such as failure, timeout) in the process of migration.

6.1 details of Codis expansion and migration

[figure] Migration proc

Child process overhead

Monitoring and optimization

Do not deploy with other CPU-intensive services, resulting in excessive competition in CPU

If you deploy multiple Redis instances, try to ensure that only one child process performs the rewrite work at a time

1G memory fork time is about 20ms

Memory

Background: the child process is generated by the fork operation, which takes up as much memory as the parent process, and theoretically needs twice as much memory to complete the persistence operation, but Linux has a write-time copy mechanism (copy-on-write). The parent and child processes share the same physical memory page, and when the parent process processes the write request, it creates a copy of the page to be modified, while the child process shares the entire parent process memory snapshot during the fork operation.

Memory related logs consumed by Fork: AOF rewrite: 53 MB of memory used by copy-on-write,RDB: 5 MB of memory used by copy-on-write

After closing the giant page and opening it, the copy page unit will be changed from the original 4KB to 2MB, which will increase the burden of fork and slow down the execution time of write operations, resulting in a large number of slow queries of write operations.

"sudo echo never > / sys/kernel/mm/transparent_hugepage/enabled

Hard disk

Do not deploy with other services with high hard disk load. Such as: storage service, message queue

8.6 AOF persistence details

A common strategy for synchronizing hard drives is everysec, which balances performance and data security. In this way, Redis uses another thread to perform fsync synchronization of the hard disk every second. When the system hard disk resources are busy, it will cause the Redis main thread to block.

1) the main thread is responsible for writing the AOF buffer (source code: flushAppendOnlyFile)

2) the AOF thread is responsible for performing a synchronous disk operation every second and recording the last synchronization time.

3) the main thread is responsible for comparing the last AOF synchronization time:

If the time from the last successful synchronization is within 2 seconds, the main thread returns directly.

If the time since the last synchronization was successful is more than 2 seconds, the main thread will call write (2) to block until the synchronization operation is completed.

Note: after AOF persistence is turned on, Redis will call write (2) to write the change to kernel's buffer after each event is processed. If write (2) is blocked at this time, Redis cannot handle the next event. Linux stipulates that when write (2) is executed, if fdatasync (2) is being executed on the same file and kernel buffer is written to the physical disk, write (2) will be resided by Block and the whole Redis will be occupied by Block.

Two problems can be found by blocking the AOF process:

1) everysec configuration may lose up to 2 seconds of data, not 1 second.

2) if the system fsync is slow, it will cause the main thread of Redis to block and affect the efficiency.

Redis provides a way to save oneself. When it is found that the file is executing fdatasync (2), it will not call write (2) first. It will only be stored in cache to avoid being Block. However, if it has been like this for more than two seconds, it will be forced to execute write (2), even if the redis will be held by Block.

Asynchronous AOF fsync is taking too long (disk is busy). Writing the AOF buffer,without waiting for fsync to complete, this may slow down Redis

8.7 accidentally shake hands and execute flushdb

If you configure appendonlyno, quickly increase the rdb trigger parameters, and then back up the rdb file. If the backup fails, run away. Appedonlyyes is configured to increase the AOF rewriting parameters auto-aof-rewrite-percentage and auto-aof-rewrite-minsize, or to directly kill the process so that Redis cannot generate AOF automatic rewriting. Manual bgrewriteaof is rejected. Back up the aof file, kill the flushdb command written in the backed up aof file, and then restore it. If the original cannot be restored, it depends on the cold standby.

8.8 online redis wants to change rdb mode to aof mode

Never. Modify conf directly and restart it.

Correct way: back up the rdb file, open the aof by configset, write back the configuration by configrewrite, execute bgrewriteof, back up the data in memory to the file

IX. Reference materials

Redis development and operation (Fu Lei)

Design and practice of Redis (Huang Jianhong)

Governance and practice of large-scale codis Cluster

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.