How to go deep into the internal mechanism of replica set in mongodb cluster 07/11 Update SLTechnology News&Howtos

How to go deep into the internal mechanism of replica set in mongodb cluster

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

How to go deep into the internal mechanism of mongodb cluster replica set? in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

When the replica set fails over, how is the primary node elected? Can you manually interfere with one of the main nodes off the shelf?

Officials say the number of copy sets is preferably odd. Why?

How is the mongodb replica set synchronized? What happens if you don't synchronize in time? Will there be inconsistencies?

Will mongodb failover happen automatically for no reason? What conditions will trigger? Frequent triggers may lead to increased system load?

The failover function of mongodb replica set of Bully algorithm benefits from its election mechanism. The election mechanism uses the Bully algorithm, which can easily select the master node from the distributed nodes. A distributed cluster architecture generally has a so-called master node, which can be used for many purposes, such as caching machine node metadata, serving as an access portal for the cluster, and so on. If there is a master node, why do we need the Bully algorithm? To understand this, let's take a look at these two architectures:

Specify the architecture of the master node, which generally declares that one node is the master node and the other nodes are slave nodes, as in our commonly used mysql. But in this architecture, we said in the first section that if the master node of the whole cluster dies, it has to be operated manually. It is not very flexible to put up a new master node or recover data from the slave node.

Without specifying the master node, any node in the cluster can become the master node. Mongodb uses this architecture, once the master node is hung up and other slave nodes are automatically replaced to become the master node. As shown below:

Well, the problem is this place, since all nodes are the same, once the primary node is dead, how to choose who is the next node to be the primary node? This is the problem solved by Bully algorithm.

What is the Bully algorithm? the Bully algorithm is a coordinator (master) election algorithm. The main idea is that each member of the cluster can declare that it is the master node and notify other nodes. Other nodes can choose to accept the claim or reject it and enter the master node to compete. Only the node that is accepted by all other nodes can become the primary node. The node determines who should win according to some attributes. This attribute can be a static ID or an updated metric like the most recent transaction ID (the latest node wins).

Election. So how does mongodb conduct the election? The official description goes like this:

We use a consensus protocol to pick a primary. Exact details will be spared here but that basic process is:

Get maxLocalOpOrdinal from each server.

If a majority of servers are not up (from this server's POV), remain in Secondary mode and stop.

If the last op time seems very old, stop and await human intervention.

Else, using a consensus protocol, pick the server with the highest maxLocalOpOrdinal as the Primary.

Roughly translate to select the primary node to use the consistent protocol. The basic steps are:

Get the last operation timestamp of each server node. Every mongodb has an oplog mechanism that records local operations, which makes it easy to compare data synchronization with the main server and can also be used for error recovery.

If most of the servers in the cluster are down, the nodes that remain alive are in the secondary state and stop, and there is no election.

If the last synchronization time of the elected master node or all slave nodes in the cluster looks old, stop the election and wait for someone to operate.

If there is no problem above, select the server node with the latest timestamp of the last operation (to ensure that the data is up-to-date) as the primary node.

A consensus protocol (in fact, the bully algorithm) is mentioned here, which is still somewhat different from the database consistency protocol, which mainly emphasizes some mechanisms to ensure that everyone can reach a consensus, while the consistency protocol emphasizes the sequential consistency of operations, such as whether dirty data will appear when reading and writing a data at the same time. Consistent protocol has a classic algorithm called "Paxos algorithm" in distribution, which will be introduced later.

There is a question above, that is, what if the final operation time of all slave nodes is the same? That is, whoever becomes the master node first will be chosen as soon as possible.

Election trigger conditions elections are not triggered all the time, there are the following situations that can be triggered.

When initializing a replica set.

The replica set is disconnected from the primary node, which may be a network problem.

The primary node is dead.

There is also a prerequisite for the election, the number of nodes participating in the election must be more than half of the total number of nodes in the replica set, and if it is already less than half, all nodes remain read-only.

The log will appear:

Can't see a majority of the set, relinquishing primary

Can human intervention be made when the primary node is dead? The answer is yes.

You can remove the primary node from the shelf by using the replSetStepDown command. This command can be used to log in to the primary node

Db.adminCommand ({replSetStepDown: 1})

If you can't kill it, you can use a forced switch.

Db.adminCommand ({replSetStepDown: 1, force: true})

Or you can use rs.stepDown (120) to achieve the same effect, with the number in the middle indicating that you cannot become the master node during the period when the service is stopped, in seconds.

Setting a slave node has a higher priority than the master node.

First check the priority in the current cluster. Through the rs.conf () command, the default priority of 1 is not displayed, which is marked here.

Rs.conf ()

{

"_ id": "rs0"

"version": 9

"members": [

{

"_ id": 0

"host": "192.168.1.136 purl 27017"}

{

"_ id": 1

"host": "192.168.1.137VR 27017"}

{

"_ id": 2

"host": "192.168.1.138virtual 27017"}

]

}

Let's set it up so that the host with an id of 1 can be the primary node first.

Cfg = rs.conf ()

Cfg.members [0] .priority = 1

Cfg.members [1] .priority = 2

Cfg.members [2] .priority = 1

Rs.reconfig (cfg)

Then execute the rs.conf () command to see that the priority has been set successfully and the primary node election will be triggered.

{

"_ id": "rs0"

"version": 9

"members": [

{

"_ id": 0

"host": "192.168.1.136 purl 27017"}

{

"_ id": 1

"host": "192.168.1.137VR 27017"

"priority": 2

}

{

"_ id": 2

"host": "192.168.1.138virtual 27017"}

]

}

What can I do if I don't want a slave node to be the master node?

A, using rs.freeze (120) to freeze the specified number of seconds cannot be elected as the primary node.

B. Set the node to Non-Voting type according to the previous article.

When the master node cannot communicate with most of the slave nodes. Unplug the network cable of the host node, hey:)

Priority can also be used this way. what if we don't want to set up a hidden node, just use the secondary type as the backup node and don't want it to be the primary node? Looking at the figure below, a total of three nodes are distributed in two data centers. The node in data center 2 with a priority of 0 cannot become the master node, but can participate in elections and data replication. The architecture is still very flexible!

The number of members of the officially recommended replica set is odd, with a maximum of 12 replica set nodes and up to 7 nodes participating in the election. There are a maximum of 12 replica set nodes because it is not necessary to make so many copies of data, and too many backups increase the network load and slow down the performance of the cluster. A maximum of 7 nodes participate in the election because the number of nodes in the internal election mechanism is too large, so the master node cannot be selected within 1 minute, as long as it is appropriate. This "12" and "7" numbers are OK and can be understood by their official performance test definition. Refer to the official document "MongoDB Limits and Thresholds" for specific restrictions. But here has not understood why the entire cluster should be odd, through the test that the number of clusters is even, it can also be run, refer to this article http://www.itpub.net/thread-1740982-1-1.html. Then I suddenly read an article by stackoverflow and finally realized that mongodb itself is designed to be a distributed database across IDC, so we should look at it in a larger environment.

Suppose the four nodes are divided into two IDC, with two machines for each IDC, as shown in the following figure. But this gives rise to a problem. If the two IDC networks are cut off, this is a problem that can easily occur on the wide area network. It was mentioned in the above election that as long as the main node and most of the nodes in the cluster are disconnected, a new round of election operation will be started. However, there are only two nodes on both sides of the mongodb replica set, but the number of nodes participating in the election must be more than half, so that all cluster nodes cannot participate in the election. Will only be read-only. But if it is an odd number of nodes, this problem will not occur. Suppose there are three nodes, as long as there are two nodes alive, they can be elected, three out of five and four out of seven.

To sum up, the whole cluster needs to maintain a certain amount of communication to know which nodes are alive and which are dead. The mongodb node sends pings packets to other nodes in the replica set every two seconds, which is marked as inaccessible if the other nodes do not return within 10 seconds. A state mapping table is maintained inside each node, indicating the current role of each node, log timestamp and other key information. If it is a master node, in addition to maintaining the mapping table, you also need to check whether you can communicate with most of the nodes in the cluster, and if not, demote yourself to a secondary read-only node.

Synchronization, replica set synchronization is divided into initialization synchronization and keep replication. Initialization synchronization means that all data is synchronized from the master node. If the amount of data of the master node is relatively large, the synchronization time will be longer. Keep replication means that after initializing synchronization, the real-time synchronization between nodes is generally incremental synchronization. Initialization synchronization is not only penalized for the first time, but can be triggered in the following two situations:

Secondary joined for the first time, that's for sure.

The backward amount of data in secondary exceeds the size of oplog, so it will also be fully copied.

So what is the size of the oplog? As mentioned earlier, oplog keeps a record of the operations of the data, and secondary copies the oplog and performs the operations in secondary. But oplog is also a collection of mongodb, saved in local.oplog.rs, but this oplog is a capped collection, that is, a fixed-size collection, and new data will be overwritten if the size of the collection is exceeded. Therefore, it is important to note that cross-IDC replication should be set up with the appropriate oplogSize to avoid frequent full replication in the production environment. OplogSize can be set by-oplogSize, and for linux and windows 64-bit, oplog size defaults to 5% of the remaining disk space.

Synchronization can not only be synchronized from the master node. Suppose there are three nodes in the cluster, node 1 is the master node in IDC1, node 2 and node 3 are in IDC2, initialization node 2 and node 3 will synchronize data from node 1. Subsequent nodes 2 and 3 replicate from the current replica set of IDC using the nearest principle, as long as one node replicates data from node 1 of IDC1.

Also pay attention to the following points when setting up synchronization:

Secondary does not copy data from delayed and hidden members.

As long as synchronization is required, the buildindexes of the two members must be the same regardless of whether it is true and false. Buildindexes is mainly used to set whether the data of this node is used for query. The default is true.

If the synchronization operation does not respond for 30 seconds, a node is reselected for synchronization.

At this point, all the problems mentioned in this chapter have been solved, and I have to say that the design of mongodb is really powerful!

Continue to address the issues in the previous section:

Can I switch connections automatically when the primary node is dead? Currently, manual switching is required.

How to solve the problem of excessive reading and writing pressure on the master node?

There are also two problems to be solved in the future:

Each of the above data from the slave node is a full copy of the database, will the pressure from the slave node be too great?

Can the data be expanded automatically when the data pressure is so great that the machine cannot support it?

This is the answer to the question about how to go deep into the internal mechanism of the replica set of mongodb cluster. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.