How to practice the high availability of MSHA and Chaos disaster tolerance 04/15 Update SLTechnology News&Howtos

How to practice the high availability of MSHA and Chaos disaster tolerance

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to carry out the high availability practice of MSHA and Chaos disaster recovery. The content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

Preface

Due to the complexity of the external environment and the unreliable hardware, the high availability of Internet services is facing great challenges, and there are many cases in which the services of major Internet companies are unavailable due to network outages, power outages and other accidents. Business is not available, as small as economic losses affect corporate reputation, as large as Wechat, Alipay these national-level applications, affecting the national economy and people's livelihood. In the face of unavoidable natural and man-made disasters, the construction of disaster recovery architecture has become an urgent demand of digital enterprises.

In December 2020, Aliyun App High availability Product AHAS (Application High Availability Service) released a new functional module, AHAS-MSHA, which is a multi-live disaster recovery architecture evolved from the Abba e-commerce business environment. In this article, we first introduce several important concepts in the field of disaster recovery, and then we will share the high availability practice of how to help businesses achieve disaster recovery architecture based on AHAS's remote active capability (AHAS-MSHA) and chaos Engineering capability (AHAS-Chaos) with an e-commerce micro-service case.

Disaster tolerance and evaluation index

1. What is disaster recovery?

Disaster recovery (Disaster Tolerance) refers to the establishment of two or more systems with the same function in distant places, which can monitor the health status and switch functions between them. When one system stops working due to accidents (such as fire, flood, earthquake, man-made sabotage, etc.), the whole application system can be switched to another place, so that the system function can continue to work normally.

two。 How to evaluate the disaster recovery capacity?

The main purpose of the disaster recovery system is not to interrupt the business in the event of a disaster, so how to evaluate and quantify the disaster recovery capacity? Here, we need to introduce the disaster recovery capability evaluation indicators commonly used in the industry:

RPO (Recovery Point Objective)

That is, the data recovery point goal, in terms of time, that is, the point-in-time requirements that the system and data must recover in the event of a disaster. RPO marking system can tolerate the maximum amount of big data loss. The smaller the amount of data lost, the smaller the value of RPO.

RTO (Recovery Time Objective)

That is, the recovery time goal, in terms of time, that is, the time requirements for information systems or business functions from stopping to being restored after a disaster. RTO marks the maximum time for service stops that the system can tolerate. The higher the urgency of the system service, the smaller the value of RTO.

AHAS-MSHA

1. Introduction

MSHA (Multi-Site High Availability) is a multi-active disaster recovery architecture solution (solution = technical products + consulting services + ecological partners), which can decouple business recovery from fault recovery, support rapid business recovery in failure scenarios, and help enterprises build disaster recovery stability.

1) Product Architecture

MSHA adopts remote and multi-active disaster recovery architecture, and the core idea is "isolated redundancy". We call each redundant logical data center a unit. MSHA closes the business traffic within the unit, isolates between units, and controls the fault explosion radius in one unit, which can not only solve the problem of disaster recovery, improve business continuity, but also achieve capacity expansion.

2) comparison of mainstream disaster recovery architectures

two。 Functional characteristics

Rapid fault recovery

Adhering to the principle of first recovery and then location, MSHA provides disaster recovery and flow-cutting capability to decouple business recovery time and failure recovery time under the premise of data protection to ensure business continuity.

Capacity expansion in different places

The rapid development of business is limited by the limited resources in a single place, and there are also some problems such as database bottlenecks. Using MSHA, you can rapidly expand business units in other regions and computer rooms to achieve rapid horizontal expansion.

Traffic distribution and error correction

MSHA provides layer-by-layer error correction and verification of traffic from the access layer to the application layer, reforwarding calls that do not meet the traffic routing rules, and controlling the failure explosion radius within one unit.

Data anti-dirty writing

Multi-unit write data may cause dirty write overwrite. MSHA provides write-ban protection when traffic enters error cells, and write-ban / update-ban protection during tangent data synchronization delay.

3. Application scenario

MSHA can be applied to the construction of multi-active disaster recovery architecture for the following typical business scenarios:

Read more and write less business scenarios: typical business scenarios are information and shopping guide services (such as commodity browsing, news and information). Data characteristics: read more and write less business, the core is the read business, can accept the temporary unavailability of the write business.

Pipelined document business scenarios: typical business scenarios are e-commerce transactions, billing pipelining services (such as orders, phone records, etc.). Data characteristics: the data can be sliced according to a certain dimension and can accept the final consistency of the data.

Practice of business disaster recovery

Next, we introduce the disaster recovery architecture construction cases of different scenarios through an e-commerce micro-service case.

1. Business background of e-commerce

1) Business applications

Frontend, portal WEB application, responsible for interacting with users

Cartservice, shopping cart application. Record users' shopping cart data and use self-built Redis

Productservice, commercial application. Provide goods, inventory services, using RDS MySQL

Checkoutservice, the application for placing orders. Generate a purchase order from the shopping cart, using RDS MySQL

2) Technology stack

SpringBoot

RPC framework: SpringCloud, registry uses self-built Eureka

3) E-commerce application architecture 1.0

In the early days of e-commerce business, like many Internet companies, disaster recovery was not considered and only deployed in a single region.

two。 Case 1: read more and write less business disaster recovery case

1) the occurrence of a fault

E-commerce business developed rapidly in the initial stage, and the small and beautiful single-region deployment mode remained unchanged until a commodity application failure occurred, resulting in the paralysis of e-commerce business and the inaccessibility of the page for a long time. The fault was finally solved, but the customer loss and corporate word-of-mouth caused by the failure caused a big blow to the fast-growing business, forcing us to start to consider the construction of high availability capacity.

E-commerce business is mainly divided into shopping guides, shopping carts, transactions and other business scenarios, the first to bear the brunt is shopping guides. It is a typical read-more-write-less business scenario, and the core is the display of the shopping guide page (read link), which can usually accept the temporary unavailability of goods and services on the shelf (write link). Combined with our own disaster recovery demands, we first set a small goal of improvement-"read more in different places".

2) Reconstruction of multi-reading disaster recovery architecture in different places

Based on MSHA, the shopping guide business is transformed into "multi-reading in different places".

Multi-active transformation & MSHA access:

Partition dimension: use userId for shunt identification.

Scope of transformation: the import WEB applications and commodity applications related to the shopping guide link are deployed in two regions.

Control configuration: enter the MSHA console to configure multiple active resources at each layer.

3) failure recurrence

After the transformation of the disaster recovery architecture is completed, it is not over, and it is necessary to verify whether the disaster recovery capability is in line with expectations. Next, we reproduce the historical faults and verify the disaster recovery ability by creating real faults.

[exercise preparation]

Service monitoring indicators: based on MSHA traffic monitoring or other monitoring capabilities, determine the service steady-state monitoring indicators, so as to judge the fault impact surface when the fault occurs and the actual recovery of the service after fault recovery.

The exercise is expected to:

The shopping guide link is weakly dependent on the shopping cart application (the guide page shows the number of items the user put into the shopping cart), and the weak dependency failure does not affect the business.

The shopping guide link is strongly dependent on the commodity application, the strong dependence failure will lead to the unavailability of the business, and the explosion radius of the fault should be controlled within the unit.

[fault drill]

By using the function of AHAS-Chaos fault drill, it is convenient to drill a variety of fault scenarios.

Phase I: weak dependency failure walkthrough

Fault injection: fault injection for shopping cart applications expected: shopping guide business is not affected results: the shopping guide page can be opened normally, in line with expectations

The second phase: strong dependency failure drill

The routing rules configured before the walkthrough are as follows (after userId000, they are matched according to the following routing range rules):

Fault injection: fault injection for merchandise applications in Beijing unit expected: userId=6000 users routed to Beijing unit will be affected by the failure result: abnormal access to the shopping guide page, in line with expectations

Explosion radius verification: verify whether the guarantee radius is controlled within the fault unit expected: userId=50 users are routed to Hangzhou unit, not affected by the failure of Beijing unit result: the access to the shopping guide page is normal, in line with expectations

4) tangent recovery

In a failure scenario, use the MSHA stream-cutting feature to verify disaster recovery capability.

Disaster recovery switchover verification: the userId=6000 is cut to Hangzhou unit. It is expected that the user will be routed to Hangzhou unit after cutting the stream, and will not be affected by the failure of Beijing unit. Results: the access to the shopping guide page is normal (see the dynamic figure below for the actual call chain of the shopping guide request), and the disaster recovery capability is in line with expectations.

Follow-up: failure revocation

Fault injection termination

Feedback the results of the exercise and record the risk problems identified by the exercise

Traffic failback

Check whether steady-state business indicators are restored

3. Case 2: disaster recovery case of pipelined document business

1) New failure

After the above transformation, the shopping guide business has the ability to resist regional-level failures. The large-scale failure of the order application has become the last straw of the order business. As a result, the construction of a highly available architecture for the ordering business has also been put on the agenda.

Issuing an order is a typical pipelined document business scenario. Compared with the shopping guide, it is a more complex combination of reading and writing business. Combined with the business scenario and business disaster recovery requirements, we choose a disaster recovery construction solution suitable for the business-"live more in different places".

2) the transformation of the structure of disaster recovery in different places

Transform the order business into "living in different places" based on MSHA.

Note: the order chain strongly depends on the shopping cart application, complete multi-activity disaster recovery construction, and the subsequent shopping cart application should also be transformed into "live more in different places".

Multi-active transformation & MSHA access:

Scope of transformation: order application and order database are deployed in two regions.

MSHA access: install Agent on the application that issues the order link, so as to realize the SpringCloud RPC cross-cell routing function and data anti-dirty writing function without intrusion.

Control configuration:

3) failure recurrence

After the transformation of the disaster recovery architecture is completed, we will reproduce the historical faults and verify the disaster recovery capability by creating real faults.

[exercise preparation]

Business monitoring indicators: determine business steady-state monitoring indicators based on MSHA traffic monitoring or other monitoring capabilities.

It is expected that the order issuing link is strongly dependent on the order application, the failure affects the business is not available, and the fault explosion radius is controlled within the unit.

[fault drill]

The routing rules configured before the walkthrough are as follows (after userId000, they are matched according to the following routing range rules):

Fault injection: fault injection for the order application of the Beijing unit expected: userId=6000 users routed to the Beijing unit will be affected by the failure result: an abnormal order was issued, as expected

4) tangent recovery

Use the MSHA stream-cutting feature to verify the disaster recovery and switching capability in fault scenarios.

Disaster recovery switchover verification: it is expected that the user will be routed to the Hangzhou unit after the userId=6000 is cut, which will not be affected by the failure of the Beijing unit. As a result: the order is issued normally (see the dynamic figure below for the actual call chain of the order request), and the disaster recovery capability meets the expectations.

On how to carry out MSHA and Chaos disaster recovery high availability practice is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.