How to understand MongoDB High availability 07/06 Update SLTechnology News&Howtos

How to understand MongoDB High availability

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

How to understand the high availability of MongoDB, in view of this problem, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible way.

Server disaster recovery has always been an unavoidable problem in the operation and maintenance of cloud services. We often discuss how to restore the database of the failed machine, but we seldom consider what kind of process is used to restore the three-node replica set after the machine failure.

What method does MongoDB adopt to ensure the high availability of user services even if there is a machine failure?

For MongoDB database, the MongoDB kernel is like a car engine, which is the core part of the whole database operation, and management and control is like the process of assembling a car. The management and control department is responsible for such tasks as how the car runs, how efficient it runs, whether it is safe to run, and how to maintain it in the event of a failure. Ensuring the high availability of users' business is the top priority of operation and maintenance tasks:

So, what is high availability?

The MongoDB service adopts the three-node replica set architecture, and the three data nodes are located on different physical servers and automatically synchronize the data. The replica set provides three roles: Primary node (supports read-write requests), Secondary node (supports read-only requests), and Hidden node (provides the role of standby node. Access is not supported by default).

High availability is the process of disaster recovery and failover for this service. This process has a high degree of automation, through the Primary,Secondary and more backup nodes to form disaster recovery, when the Primary node failure, the system will automatically elect a new Primary node. The Secondary node is not available, and the standby node takes over and resumes the service, ensuring the availability of the service in many ways. This is the high availability that MongoDB itself brings.

The highest level of high availability is: "what does disaster recovery have to do with me? I just want business ok"-- so as to provide the most stable service to users. For users, what they can see are the Primary and Secondary nodes and the related access links exposed. But on the server, there is another Secondary node in the Hidden state, which is usually used for data backup and performance optimization, and switches to the Secondary node to continue to undertake the work of the user when the primary node fails.

Under unpredictable circumstances, the server will always have a variety of hardware failures that are difficult to troubleshoot, and in extreme cases there will even be strikes: for example, memory ECC anomalies cannot be automatically repaired, hard disk IO abnormal read and write failures, raid card status problems, battery power outage, network card network full load, and so on. In the face of these various fault types, Ali provides monitoring to all external service servers, and uses the monitoring system to collect these points in real time. Once problems are found, they will report to the police in time.

In addition, such as the replacement or extension of the warranty period of the server, the system upgrade OS, the repair of service program vulnerabilities, many reasons may lead to the server needs to be offline.

When the server is offline, the user service will continue to be used. How to remove the machine for offline upgrade without affecting the running business of the user, this needs to be handed over to the MongoDB control team to deal with.

What strategy does MongoDB use to respond:

The high availability implementation process for MongoDB is divided into the following three parts:

Fault detection: a variety of detection systems are used to detect various items, and there is a linkage effect in each system.

Failover: how to transfer business on the failed machine from the machine after a failure.

Mainframe offline: fault machine offline maintenance and the corresponding follow-up process.

Fault detection:

There are a large number of different models of machines in the MongoDB service cluster, such as D13 and H43. There is a corresponding detection program on each server, which is monitored by a large number of Monitor to obtain information: whether it is the part that belongs to Aliyun itself or the part implemented by the user in the user's business, there is a corresponding interface. Aliyun will obtain the instance and understand the status of the server by push or self-fetch. If it is necessary to know that a machine is offline, the resource manager will mark the machine to confirm the exception and move on to the next stage. Between the two steps of detection and failover is not a direct step, in fact, there are many detailed inspection processes.

Failover:

For the three-node replica set architecture provided by Aliyun, in many cases, such as the machine reaches the warranty period and the failure of the D13 raid card, it is necessary to offline maintain the Primary node machine of the task. In the face of these situations, the resource manager will mark the machines that need to be offline, and the marked machines will actually go offline. And these machines that need to be offline often have business in operation. In order to ensure that the business will not be affected, MongoDB will use its own mechanism to replace the Secondary node with the Primary node, so as to make the marked node become Secondary, and then change the marking node from Secondary to Hidden, that is, hide the service node. The original Hidden node is replaced as a disaster recovery node.

At this time, the data of the instance is still stored on the Hidden (marking) node, and the machine cannot be removed easily. Here, you will enter the step of node reconfiguration-- select an additional machine from the resource pool to produce a Hidden node. When the new node joins the replica set and completes the synchronization of the three nodes, the marked machine will be removed and officially enter the offline process. This process often takes a period of time to complete. Moreover, there may be multiple instances on the marked machine. There may be not only Primary nodes of one instance, but also Secondary or Hidden nodes of other instances, but the main body process is similar. All nodes on the marked machine will eventually be replaced with Hidden status until the machine achieves the effect that there are no user access requests.

In order not to affect users' normal use of cloud services, the whole switching process will be carried out in the operation and maintenance time window provided by users.

The host is offline:

In the face of the offline machine, the system will not directly put it in the host resource pool, but will have a detention period of 24H. During the detention period, the monitoring system will detect whether there are other access requests or IO read and write operations on the stranded machine. When the test is over, the machine will not be put into the host resource pool until it is sure that the machine can be offline. The machines in the resource pool will enter another system for subsequent operations, which is no longer related to the MongoDB business. The machine will be restored through a dedicated IDCfree system. After we confirm that there is no problem with the machine, we will put it back into the resource pool and rejoin the MongoDB cluster through the auto-online system. This part of the content is taken care of by the automatic resource control platform. Next, we will take the actual failover business scenario as an example to illustrate the more specific process of implementing high availability.

Failover business scenario:

There is a problem with a replica set:

After the faulty machine is marked and confirmed to enter the transfer process, the automatic operation and maintenance system named Robot will first obtain the instance information on the machine, and then start the formal transfer within the OPS time set by the user (even if it is not within the usage time set by the user, the user will still be informed by SMS). In the case of determining that Role is a Primary node, first replace the Primary and Secondary nodes. If it is found that it is already a Secondary node, then switch roles between Secondary and Hidden nodes. This step is completed by sending the task flow, and the replacement speed of the backend is very fast, and the impact on users is negligible. When it is determined that all the failed machines have become Hidden nodes, the rematch Hidden process is triggered and the newly created nodes are added to the replica set. At this point, there are no instances of the faulty node, and the automatic operation and maintenance platform will place this idle problem machine in the offline list and will no longer continue to conduct immediate instance checking.

There is a unique saying during fault migration: to be storm-proof and calm.

This is the end of the answer to the high availability question about how to understand MongoDB. I hope the above content can be of some help to you. If you still have a lot of doubts to solve, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.