Example Analysis of vivo based on Native RabbitMQ High availability Architecture 07/09 Update SLTechnology News&Howtos

Example Analysis of vivo based on Native RabbitMQ High availability Architecture

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article shares with you the content of the sample analysis of vivo based on the native RabbitMQ high availability architecture. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

I. background note

Vivo introduced RabbitMQ in 2016 and extended based on open source RabbitMQ to provide message middleware services to the business.

From 2016 to 2018, all businesses use one cluster. With the growth of business scale, the cluster load becomes heavier and heavier, and cluster failures occur frequently.

In 2019, RabbitMQ entered the high-availability construction phase, completing the high-availability component MQ name service and the city-to-city dual-live construction of RabbitMQ clusters.

At the same time, the physical division of the business usage cluster is carried out, and the distribution and dynamic adjustment of the business usage cluster are carried out strictly according to the cluster load and business traffic.

Since the high availability construction in 2019, the business traffic has increased tenfold, and there has been no serious failure in the cluster.

RabbitMQ is an open source message broker software that implements the AMQP protocol, which originated from the financial system.

Has a wealth of features:

Message reliability is guaranteed. RabbitMQ ensures the reliability of message delivery by sending confirmation, ensuring the reliability of message in the cluster through clustering, message persistence and mirror queue, and ensuring the reliability of message consumption through consumption confirmation.

RabbitMQ provides clients in multiple languages.

Multiple types of exchange are provided, and messages are sent to the cluster and routed to a specific queue via exchange.

RabbitMQ provides perfect management background and management API, which can be quickly integrated with self-built monitoring system by managing API.

The problems found by RabbitMQ in the specific practice:

In order to ensure the high availability of business, multiple clusters are used for physical isolation, and multiple clusters are managed without a unified platform.

Native RabbitMQ clients use cluster addresses to connect. When using multiple clusters, businesses need to care about cluster addresses, resulting in confusion.

Native RabbitMQ only has a simple user name / password verification, and does not authenticate the business application used. Different businesses are easy to mix exchange/queue information, resulting in abnormal use of business applications.

A large number of business applications are used, and there is no platform to maintain the associated information of the message sender and consumer, and the counterparty cannot be determined after multiple versions are iterated.

The client has unlimited flow, and the sudden abnormal traffic of the business shocks or even destroys the cluster.

The client has no exception message resending policy, which needs to be implemented by the user.

When a cluster is blocked by memory overflows, it is impossible to quickly and automatically transfer to other available clusters.

Using a mirror queue, the master node of the queue will fall on a specific node. When there are a large number of queues in the cluster, it is easy to cause node load imbalance.

RabbitMQ does not have the ability of automatic queue balancing, so it is easy to cause uneven load of cluster nodes when there are many queues.

II. Overall structure

1. MQ-Portal-- supports application for application.

In the past, when the business team applied RabbitMQ, the application application traffic and interfacing applications and other information were recorded offline in the form, which was scattered and not updated in time, so it was impossible to accurately understand the current real use of the business. Therefore, the metadata information used by the application was established through a visual and platform access application process.

Through the application process of MQ-Portal (as shown in the figure above), it is determined that the application for sending messages, consuming applications, using exchange/queue, sending traffic and other information will be submitted to the internal work order process of vivo for approval.

After the approval of the work order process, the application cluster is assigned through the callback of the ticket API, and the exchange/queue binding relationship is created on the cluster.

Due to the use of multi-cluster physical isolation to ensure the high availability of business in the formal environment, it is impossible to simply locate the used cluster through the name of an exchange/queue.

Each exchange/queue and cluster is associated with rmq.secret.key through a unique pair of rmq.topic.key, so that the specific cluster can be located during the SDK startup process.

Rmq.topic.key and rmq.secret.key will be assigned in the callback API of the work order.

2. Overview of client SDK capabilities

Client SDK is encapsulated based on spring-message and spring-rabbit, and on this basis, it provides application authentication, cluster addressing, client current restriction, production and consumption reset, blocking transfer and other capabilities.

2.1. Application usage authentication

Open source RabbitMQ only determines whether to connect to the cluster by user name and password, but whether the application allows the use of exchange/queue is not verified.

In order to avoid mixed use of exchange/queue in different services, it is necessary to authenticate the application.

Application authentication is done jointly by SDK and MQ-NameServer.

When the application starts, the rmq.topic.key information of the application configuration is first reported to the MQ-NameServer, and the MQ-NameServer determines whether the application is consistent with the application application, and it will be checked again when the SDK sends the message.

/ * check before sending, and get the real sending factory, so that the service can declare multiple messages, * but use one of the bean to send all messages without causing any exception * @ param exchange check parameter * @ return sending factory * / public AbstractMessageProducerFactory beforeSend (String exchange) {if (closed | | stopped) {/ / context has been turned off to prevent sending from continuing. Reduce sending critical state data throw new RmqRuntimeException (String.format ("producer sending message to exchange% s has closed, can't send message", this.getExchange () } if (exchange.equals (this.exchange)) {return this;} if (! VIVO_RMQ_AUTH.isAuth (exchange)) {throw new VivoRmqUnAuthException (String.format ("send topic check exception, do not send data to unauthorized exchange% s, send failed", exchange));} / / get the real bean sent to avoid sending error return PRODUCERS.get (exchange);} 2.2, cluster addressing

As mentioned earlier, applications use RabbitMQ to allocate clusters strictly according to the load and business traffic of the cluster, so the different exchange/queue used by a specific application may be allocated to different clusters.

In order to improve the efficiency of business development, it is necessary to shield the impact of multiple clusters on the business, so the cluster is automatically addressed according to the rmq.topic.key information configured by the application.

2.3. Client current limit

Native SDK clients do not send traffic restrictions, and when some applications continue to send messages to MQ, the MQ cluster may be destroyed. And a cluster is commonly used by multiple applications, and the cluster impact caused by a single application will affect all applications that use abnormal clusters.

Therefore, it is necessary to provide the ability of the client to limit the current in the SDK. If necessary, you can restrict the application from sending messages to the cluster to ensure the stability of the cluster.

2.4. Production and consumption reset

(1) with the growth of the business scale, the load of the cluster continues to increase, so the business of the cluster needs to be split. In order to reduce the need to avoid business restart during the split process, the production and consumption reset function is needed.

(2) if there is an exception in the cluster, it may cause consumers to drop the line. At this time, business consumption can be quickly pulled up through production and consumption reset.

In order to reset production and consumption, you need to implement the following process:

Reset connection factory connection parameters

Reset connection

Make a new connection

Restart production and consumption

CachingConnectionFactory connectionFactory = new CachingConnectionFactory (); connectionFactory.setAddresses (address); connectionFactory.resetConnection (); rabbitAdmin = new RabbitAdmin (connectionFactory); rabbitTemplate = new RabbitTemplate (connectionFactory)

At the same time, there is an exception message resending policy in MQ-SDK, which can avoid abnormal message delivery caused by production reset.

2.5. Blocking transfer

RabbitMQ blocks sending messages when memory usage exceeds 40%, or when disk usage exceeds the limit.

Since the vivo middleware team has completed the construction of RabbitMQ dual-live users in the same city, the rapid transfer of blocking can be completed by resetting production and consumption to the dual-active cluster when there is a cluster transmission blocking.

2.6. Multi-cluster scheduling

With the development of the application, the single cluster will not be able to meet the traffic needs of the application, and the cluster queues are all mirror queues, so it is impossible to realize the horizontal expansion of business support traffic single cluster simply by adding cluster nodes.

Therefore, SDK is required to support multi-cluster scheduling capability to meet the needs of large business traffic by distributing traffic to multiple clusters.

3. MQ-NameServer-- supports MQ-SDK for fast fault switching.

MQ-NameServer is a stateless service that ensures its high availability through cluster deployment. It is mainly used to solve the following problems:

MQ-SDK enables authentication and applications use cluster positioning.

Handle the timing metrics reporting of MQ-SDK (number of messages sent, number of messages consumed), and return the current available cluster address to ensure that SDK reconnects according to the correct address when the cluster is abnormal.

Control MQ-SDK to reset production and consumption.

4. MQ-Server high availability deployment practice

RabbitMQ clusters adopt the dual-active deployment architecture in the same city, and rely on the cluster addressing and fast failover capabilities provided by MQ-SDK and MQ-NameServer to ensure the availability of the cluster.

4.1. Treatment of brain fissure in clusters

RabbitMQ officially offers three strategies for cluster brainfissure recovery.

(1) ignore

Ignore the problem of brain fissure and do not deal with it, and human intervention is needed to recover when brain fissure occurs. Due to the need for human intervention, some messages may be lost, which can be used when the network is very reliable.

(2) pause_minority

When a node loses contact with more than half of the cluster nodes, it will automatically pause until communication with more than half of the cluster nodes is detected. In extreme cases, all nodes in the cluster are suspended, making the cluster unavailable.

(3) autoheal

The minority node will restart automatically, and this strategy is mainly used to give priority to ensuring the availability of the service, rather than the reliability of the data, because messages on the restart node will be lost.

Since all RabbitMQ clusters are deployed in the same city, even abnormal business traffic in a single cluster can be automatically migrated to a dual-live server room cluster, so the pause_minority strategy is chosen to avoid brain fissure.

In 2018, the cluster brain fissure was caused by network jitter many times. After modifying the cluster brain fissure recovery strategy, the problem of brain fissure no longer appeared.

4.2. Cluster highly available solution

RabbitMQ adopts cluster deployment, and because the cluster brain fissure recovery strategy adopts pause_minority mode, each cluster requires at least 3 nodes.

It is recommended to deploy highly available clusters with 5 or 7 nodes and control the number of cluster queues.

Cluster queues are all mirror queues to ensure that messages are backed up to avoid message loss caused by node anomalies.

Exchange, queue and messages are all set to persistence to avoid loss of node abnormal restart messages.

Queues are set to lazy queues to reduce the fluctuation of node memory usage.

4.3. Construction of double living in the same city

The equivalent cluster is deployed in the dual computer room, and the two clusters are formed into an alliance cluster through the Federation plug-in.

The application machines in this computer room are preferred to connect to the MQ cluster in this computer room to avoid abnormal application usage caused by direct connect jitter.

Obtain the latest available cluster information through the MQ-NameServer heartbeat, and reconnect to the double-active cluster in case of an exception to achieve rapid recovery of application functions.

III. Challenges and prospects for the future

At present, the enhancement of the use of RabbitMQ is mainly on the MQ-SDK and MQ-NameServer side, and the implementation of SDK is more complex. In the later stage, we hope to build a proxy layer of message middleware, which can simplify SDK and manage business traffic in more detail.

Thank you for reading! This is the end of the article on "sample Analysis of vivo based on Native RabbitMQ High availability Architecture". I hope the above content can be helpful to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.