Exploration and practice of distributed Cache Redis Cluster in Huatai Securities 07/06 Update SLTechnology News&Howtos

Exploration and practice of distributed Cache Redis Cluster in Huatai Securities

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This article is selected from the 30th issue of the Frontier of Trading Technology (March 2018)

Author: fan Jian Chen Ying GE Baolei / Huatai Securities Co., Ltd.

Abstract: as the most popular open source distributed cache, what kind of application scenario will Redis Cluster have in the brokerage field? Starting from the present situation of the application of Huatai Securities, this paper introduces the large-scale practical experience of Redis Cluster in Huatai Securities.

I. introduction

Redis is an open source (BSD licensed) in-memory Key-Value storage system that can be used as database, cache, and messaging middleware. It supports many types of data structures, such as strings, hashes, lists, collections, ordered collections and range queries. Redis has built-in features such as replication, LRU-driven events, transactions, disk persistence, and provides high availability through Redis Sentinel (master-slave mode) and automatic partitioning (Redis Cluster mode).

Before the official Redis Cluster was launched, the common open source solutions for Redis Cluster were mainly Codis and Twemproxy, both of which were distributed in the way of Proxy. By introducing the Proxy layer to shield the distribution of the underlying data, the implementation of the client can be simplified, but the cluster architecture becomes complex and the maintenance cost increases. Redis supports automatic partitioning from 3. 0, and implements Cluster mode with no central node. When accessing Redis Cluster, there is no need for a Proxy agent, and clients with Smart features connect directly to each node in the Redis Cluster.

The advantages brought by Redis's introduction of Cluster mode are:

1. Reliability: with partition mechanism, copy mechanism and automatic fault tolerance mechanism

two。 High performance: it can scale linearly to thousands of nodes under the premise of high throughput of Redis

3. Scalability: automatic expansion and reduction based on partition, transparent data migration of client.

At present, Redis has been widely used in Internet, finance and traditional industries. With the increase of Internet access business in the financial industry, events, promotions, holidays, hot events and other situations that may bring sudden or even dozens of times the peak of access occur from time to time. Redis Cluster is an effective means to resist sudden mass access.

two。 Basic principles and concepts

The overall design of Redis Cluster is relatively simple, the cluster architecture is implemented without central nodes, and the nodes in the cluster exchange cluster states with each other through the Gossip protocol. The client directly accesses the server without an agent. The client calculates the hash slot corresponding to the Key through the Hash algorithm, and then directly accesses the server node corresponding to the hash slot.

The topology of Redis Cluster is shown in the following figure:

Figure 1 Redis Cluster architecture diagram

Cluster construction:

Redis Cluster provides a set of cluster building and management commands, such as CLUSTER INFO, CLUSTER NODES, CLUSTER MEET, etc. In the actual use process, you can use the command line tool redis-trib.rb, you can easily build a cluster, balance the distribution of cluster hash slots, delete and add nodes, and so on.

It takes only two steps to build a Redis Cluster: 1. Node preparation, publish the compiled Redis to at least three servers, modify the configuration file and start the Redis node; 2. The node shakes hands and uses redis-tribcreate host1:port1. The hostN:portN command completes the node handshake and confirms slot allocation. When there are multiple Redis instances on the server, be careful to modify the port, working directory, AOF, and RDB file names of the service. You can specify the number of replicas when you create the cluster, or you can add the slave nodes to the cluster one by one after the cluster is created.

Data distribution:

Redis Cluster distributes data to 16384 slots based on hash slot (slicing), and each Master node is responsible for storing part of the data in hash slot. Each key in Redis is mapped to one of these hash slots, and the cluster uses the Hash formula CRC16 (key) 384 to calculate which slot the key key belongs to.

When the Smart client of Redis accesses the cluster, it first obtains and caches the mapping between the hash slot and the node, and then finds the node that should be visited by calculating the hash slot number corresponding to the Key. In order to cope with the operations that the hash slot mapping needs to be changed, such as cluster scaling and data migration, the Redis server adds two response strategies: MOVED and ASK. The former notifies the client of the new node where the hash slot is located, and the latter notifies the client which the hash slot is being migrated to.

Master-slave replication:

The use of asynchronous redundant backup (Asynchronous Replication) between Redis Cluster nodes can not guarantee strong consistency of data. There may be a scenario of data loss: the modification operation completes the update on the master node, and when the master node replies to the client successfully, the incremental changes cannot be synchronized to the slave node, and the master node is abnormal (downtime, failover, etc.), and the slave node becomes the master node. During the update window of the client routing table, there may be two rapid switching between master and slave roles in the cluster, and the data may still be written to the wrong node, resulting in data loss.

Although Redis master-slave replication can not guarantee strong consistency, data inconsistencies are difficult to occur without master-slave switching. In the actual production environment, the probability of master-slave switching is small, but it is still recommended that the business system should have the ability to tolerate cache data loss.

Fault detection:

Each node in Redis Cluster stores a list of identities for other known nodes, two of which are used for failure detection, namely PFAIL and FAIL. When a node is unable to access a node after NODE_TIMEOUT time, it will identify the detected node as PFAIL, indicating that it may fail; if a node is confirmed as unreachable by most master nodes, it will be marked as FAIL, indicating that it has failed.

Each node periodically sends Gossip messages to other nodes, including some random known node status. Eventually, each node will receive an identity from the other node. When a node is marked as FAIL, it is necessary to promote a slave node as the master node.

Failover:

When a master node whose number of slots is greater than 0 is in the FAIL state, his slave node can automatically initiate an election. Once a slave node receives a response from most master nodes, it will be promoted to the new master node. In addition, Redis Cluster provides a manual failover command CLUSTER FAILOVER, which can be used in operation and maintenance.

3. Background introduction and construction status of Redis Cluster in Huatai Securities.

In 2015, with the large-scale investment in the independent research and development of Internet finance of Huatai Securities, in the face of the concurrent scenario of a large number of users, there is an urgent need to build a unified and service-oriented distributed cache platform.

By verifying and comparing the open source Redis cluster solutions such as Redis Cluster, Codis and Twemproxy, and finally considering the performance, easy maintenance and high availability, we choose the Cluster mode of Redis 3.2.0 as the company-level cache solution. Redis Cluster has received continuous support from the open source community, and its functions and features have been iteratively improved. In contrast, Codis and Twemproxy communities are less active, maintenance costs are relatively high, and throughput is slightly lower than Redis Cluster.

At present, there are many sets of Redis Cluster resource pools and more than 20 cluster servers in Huatai Securities. During the trading session, the peak number of visits exceeded 200000 / s, serving more than 30 application systems, including Market Center, Zengle Fortune, Internet user Center, and so on. It is used in business scenarios such as cache, distributed lock, memory storage, task queue and so on.

4. Practical experience

(1) highly available multi-active architecture

As shown in figure 2, Redis Cluster data nodes are deployed in three data centers in the same city, with two data centers deploying an equal number of machines and the other deploying a single machine. In order to accelerate the speed of redoing slave nodes, the host uses a 10-gigabit network card. In order to ensure that the delay of access cache is small enough, the network communication between data centers uses an independent 10 Gigabit wavelength division channel.

Figure 2 Redis Cluster deployment architecture diagram

In the actual deployment, you need to adjust the Master node distribution of the Redis Cluster to ensure that the number of Master nodes in any data center is less than half of the cluster. With such a deployment architecture, if there is a problem with a single data center, another center can automatically take over, and the business system can switch without awareness.

(2) tuning at Java client level

1. It is recommended to use Jedis2.8.x or above, and close TestOnReturn and TestOnBorrow

2. It is recommended to use JedisPoolConfig of Jedis, which is an optimized version of GenericObjectPoolConfig.

3. Rational use of O (N) operations such as HGETALL and SMEMBERS.

(3) Optimization at the server level.

1. Rename KEYS, FLUSHALL, FLUSHDB and other time-consuming and dangerous operations

2. Increase the client-output-buffer-limitslave moderately to avoid unnecessary redo from the slave node

3. Increase repl-backlog-size and repl-backlog-ttl moderately. The higher the value, the longer the slave can be lost.

4. AOF, close RDB to reduce stutters caused by server fork operations.

5. Configure cluster-require-full-coverage as yes according to the actual scenario to reduce the unavailable time of the cluster.

(4) functional limitations of Redis Cluster

Redis cluster is a distributed Redis implementation with a certain degree of fault tolerance and linear scalability, which sacrifice the following features:

1. SELECT command cannot be used, and operations on KEY in multiple bays, such as MSET and SUNION, are not supported.

2. Publish and subscribe is not recommended. The larger the cluster size, the greater the network traffic generated.

3. In the application of Redis master-slave mode, the client code needs a little modification before it can be upgraded to Cluster mode.

(5) follow-up and version update

Open source middleware will inevitably have Bug and other performance problems, most of the open source community can find solutions to the problems, actively follow up the community discussion is an effective way to avoid production accidents. In practical use, we have found several Bug of Redis, and there are solutions in the community. At present, we have upgraded some Redis nodes in the production environment to version 3.2.7, mainly due to the following problems:

1. After synchronizing the Ziplist from the slave node, the List index is updated incorrectly, resulting in the slave node Crash

2. The length of members in Ziplist increases and the List index is updated incorrectly, resulting in the failure of AOF rewriting of both master and slave nodes, resulting in a large number of temporary files.

(6) continuous follow-up

The PSYNC mechanism has been introduced in Redis version 2.8.0. PSYNC caches the increment of data changes during the disconnection from the node by adding a buffer queue. When the slave node is reconnected and the cache queue does not overflow, it can avoid the problem that after the slave node is reconnected, the SYNC operation is necessary to synchronize the master node data.

PSYNC can effectively solve the problem of temporary disconnection of slave nodes caused by network jitter, but it can not avoid full data synchronization and recovery when the master node and slave node successively appear network disconnection, restart and process rollout. Redis 4.0introduces new features such as PSYNC 2 and PSYNC 3 to solve related problems. At present, Redis 4.0 is still in the verification stage and needs continuous verification and close attention.

5. Typical scene

Compared with other open source key-value memory storage systems, Redis supports more abundant data. Common value data types include: strings, hash tables, linked lists, collections, ordered collections. At the same time, Redis has built-in common operations on these data structures. At present, Redis has been widely used, and common usage scenarios include: cache hot data, counters, queues, distributed locks, rankings, news lists, comments and other scenarios. Redis Cluster has also been widely used in the newly built information system of Huatai Securities. Some of the scenarios used are as follows:

Market cross section

Some application scenarios may need to get the latest quotes of a market or multiple stocks, which can be achieved by using Redis's Hash structure. The sample code is as follows:

Add or update the price of a stock

HSETMD:XSHG:STOCKTYPE "601688.SH" 17.88

Get the latest quotes of several stocks

HMGET MD:XSHG:STOCKTYPE "601688.SH"601689.SH"

Get the latest quotes of all stocks in a certain exchange. The HGETALL operation is O (N). It is not recommended to call frequently.

HGETALL MD:XSHG:STOCKTYPE

K line

The common K line is the daily K line or the minute K line. Take the daily K line as an example, you can use the ordered collection of Redis to implement it. The example code is as follows:

Add the K line of a stock on March 1, 2018

ZADD KLINE:1DAY:601688.SH 20180301 kline_value

Get the K line of a stock for many days

ZRANGEBYSCORE KLINE:1DAY:601688.SH 20180301 20180302

Task queue

The task queue is used to transfer tasks between the producers and consumers of the task, realizing the loose coupling between the generation of the task and the task execution module. The example code is as follows:

The producer generates a task task01

RPUSH TASK:QUEUE "task01"

Consumers are stuck waiting for the task for 100 seconds. BLPOP is the blocked version of LPOP.

BLPOP TASK:QUEUE 100

6. Future planning

With the continuous development of business, Redis Cluster has become the core component within Huatai Securities. In the future, we will focus on the construction of PaaS platform, strengthen cluster automatic disaster preparedness, establish a hierarchical guarantee system, and independently manage key businesses. At present, the latest version of Redis 4.0.x solves the situation that the master and slave nodes where some fragments are located are not available occasionally when Redis 3.2.x is faced with severe network jitter. Verifying the stability of Redis 4.0.x version as soon as possible and formulating an effective upgrade plan and plan will also be one of the priorities of the future work.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.