How to realize distributed Cluster in MongoDB 07/15 Update SLTechnology News&Howtos

How to realize distributed Cluster in MongoDB

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article is about how to achieve distributed clustering in MongoDB. I think it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it.

Overview of Cluster

Mongodb-related processes fall into three categories:

Mongo process-this process is a shell client process provided by mongodb that allows you to send commands and operate on the cluster

Mongos process-the routing process of mongodb, which is responsible for connecting with clients, forwarding client requests to the backend cluster, and shielding the internal structure of the cluster from clients.

Mongod process-A mongodb instance process that provides data reading and writing.

Compared with the banking service, the mongo process is equivalent to the customer, the mongos process is the counter waiter, and the mongod process is the person or process that the bank backstage actually handles the business. Customers only need to communicate with the counter clerk and tell them what business to do. The counter attendant will transfer the business to the background, and the background will actually deal with it.

As shown in the figure, the nodes of a mongodb cluster are divided into three categories:

Mongos routing nodes: handle client connections, act as access routers, distribute requests to the correct data nodes, and shield clients from the concept of distribution

Config configuration node: configure the service to save the metadata of the data structure, such as the data range on each shard, the list of data blocks, etc. The configuration node is also a mongod process, but the data it stores is cluster-related metadata

Shard sharding node: data storage node, sharding node consists of several replica sets, each replica set stores part of the whole data, all the replica set data constitutes all data, while the internal nodes of the replica set store the same data for data backup and high availability.

Or take the banking business as an analogy, when the customer handles the policy preservation business

The counter attendant accepts the policy service request of the customer (the mongos routing node receives the operation request of the client)

The counter attendant queries the file directory system to see which warehouse the policy should be saved to (the mongos node communicates with the config configuration node and queries the sharding node where the relevant operation data is located)

After knowing which warehouse, the counter waiter will give the insurance policy to the warehouse manager, and the warehouse administrator will put the policy in the designated warehouse (the mongos node sends the request to the sharding node where the data resides, and the sharding node reads and writes).

Mongos routing service

The mongos service is like a gateway, connecting the mongodb cluster and the application, and shielding the internal structure of the mongodb. The application only needs to send the request to the mongos without paying attention to the information such as the fragmentation of the replica inside the cluster.

Mongos itself does not save data and index information, but obtains it by querying the config configuration service, so you can consider deploying mongos and the application on the same server, and mongos will fail together when the server goes down to prevent mongos from being idle.

A mongos node can also be a single node, but multiple nodes are generally deployed for high availability. Just like the counter clerk, there can be more than one, there is no main / standby relationship between each other, and all can handle the business independently.

It should be noted that when sharding is enabled, the application should avoid directly connecting the sharding node for data modification, because in this case, it is likely to cause serious consequences such as data inconsistency, but operate through the mongos node.

Config configuration Service

The config configuration node is essentially a replica set, which stores the cluster metadata, such as the list of data blocks, data range, authentication and other information on each shard. Below, you can see the database config, where the collection holds the important metadata of the cluster.

one

two

three

four

five

six

seven

eight

nine

ten

eleven

twelve

thirteen

fourteen

fifteen

Mongos > use config

Switched to db config

Mongos > show collections

Changelog

Chunks

Collections

Databases

Lockpings

Locks

Migrations

Mongos

Shards

Tags

Transactions

Version

In general, users should not directly change the data of config, otherwise it is likely to cause serious consequences.

Shard sharding service

Distributed storage solves two problems:

With the continuous development of business, the amount of data is increasing, and stand-alone storage is limited by physical conditions, so it is necessary to increase servers to support the growing data. Therefore, under the distributed environment, it is impossible to store all the data on one node, so it is necessary to divide the data and put part of the data on this node and the other part of the data on another node. That is, the scalability of data.

Consider high availability. If the same data exists on only one node, the data is not available when an exception occurs on that node. This requires that the same data needs to be stored on multiple nodes in order to achieve high availability.

In mongodb cluster, the scalability of data is realized by sharding set, and high by replica set.

As shown in the figure, the total data is 1-6, which is divided into 3 parts, 1-2 is a fragment, 3-4 is a fragment, and 5-6 is a fragment. Each shard is stored on a different node. Each shard has three copies, each of which is an independent mongod instance.

So replica set is a vertical concept, which describes the same data stored on multiple nodes, while slicing is a horizontal concept, which describes that all data is cut into different fragments, and each fragment is stored independently. This fragment is the fragment, and the fragment is stored through the replica set.

Replica set

The replica set contains three roles:

Primary node (Primary)

Secondary node (Secondary)

Arbitration node (Arbiter)

A replica set consists of a primary node, multiple secondary nodes, and 0 or more arbitration nodes.

The primary node and secondary node are data nodes. The primary node provides the write operation of the data. after the data is written to the primary node, it will be synchronized to the secondary node through the synchronization mechanism. The default read operation is also provided by the primary node, but you can manually set the read preference to read from the secondary node first.

The quorum node is not a data node, does not store data, and does not provide read and write operations. The arbitration node exists as a voter. When the master node is abnormal and needs to be switched, the arbitration node has the right to vote, but does not have the right to be voted. The arbitration node can still support fault recovery in the case of limited resources. For example, if there are only two nodes of hard disk resources, in this case, an arbitration node that does not occupy storage can be added to form a replica set architecture of "one master, one pair, one arbitration". When the primary node is down, the secondary node can be switched automatically.

Nodes communicate with each other through a "heartbeat" to know each other's state. When the primary node is unavailable, vote for an upgraded primary node from other nodes that have the right to vote, and continue to keep the service highly available. Voting here adopts the "majority" principle, that is, more than half of the total number of nodes need to agree in order to be elected as the primary node. Therefore, it is not recommended to use an even number of nodes to form a replica set, because in the case of an even number, if half of the nodes are isolated, the isolated half of the nodes can not meet the "majority" requirements and cannot elect a new primary node.

You can view the copy set through rs.status () and refer to "how to build a mongodb Cluster quickly"

Fragment set

Slicing is to divide all data into a subset of data without intersection according to certain rules, each subset is a fragment, and different slices are stored on different nodes. Here are a few questions:

What is the partition rule, that is, the slicing strategy?

How is the fragmented data stored?

As the amount of data is getting larger and larger, how to adjust the fragmentation dynamically?

Block Chunk

An chunk consists of multiple documents, and a shard contains multiple chunk. Chunk is the smallest unit of inter-shard data migration. In fact, the document calculates which chunk it should be stored in through the sharding strategy, while the chunk is stored on the shard.

As shown in the figure, it is assumed that the document is sliced according to the value of the x field and stored in different data blocks according to the range of values, such as 25-175on chunk 3.

Compare books to documents in mongodb, bookcases to data blocks, and rooms to fragments. Each book is put on a bookcase according to certain rules, and there are many bookcases in the room. When there are too many bookcases in a room, you need to move to a relatively loose room in terms of bookcases.

The size of chunk defaults to 64MB and can also be customized. The existence of chunk has two meanings:

When a chunk exceeds its size, it triggers a chunk split.

When the number of chunk between shards is not balanced, chunk migration will be triggered.

Chunk migration is operated by mongodb's balancer, which is on by default, is a process running in the background, or can be turned off manually.

You can view the balancer status with the following command:

one

Sh.getBalancerState ()

The impact of chunk size on the cluster:

Relatively small, the number of chunk is relatively large, and the data distribution is more uniform, but it will cause frequent data block splitting and migration.

When it is relatively large, the number of chunk is relatively small, the data is easy to be distributed unevenly, and the network transmission capacity is large during migration.

Therefore, when you want to customize the block size, you must consider it completely, otherwise it will greatly affect the performance of the cluster and applications.

Chip key Shard Key

The mongodb cluster does not automatically fragment the data. You need the client to tell mongodb which data needs to be sharded and what are the rules for sharding.

A database enables sharding:

one

Mongos > sh.enableSharding ()

Set the sharding rules for the collection:

one

two

Mongos > sh.shardCollection (,)

# unique and options are optional parameters

For example, open sharding the database mustone and set the documents in the myuser collection in the library to be sliced according to the hash value of the _ id field.

one

two

Sh.enableSharding ("mustone")

Sh.shardCollection ("mustone.myuser", {_ id: "hashed"})

Here, the partition rule is reflected in the above, and the slicing strategy is defined, which is composed of the slicing key Shard Key and the slicing algorithm. A slice key is a field of a document, or it can be a composite field. There are two slicing algorithms:

Based on scope. If it is set to id:1, it is sliced based on the ascending order of the field id. Id:-1 means the sharding is based on the reverse order of the field id, and the field id is shard key (chip key). When the document in the collection is empty, after sharding is set, the range of a single chunk,chunk is initialized to (- ∞, + ∞). When the data is continuously inserted into it and reaches the upper limit of chunk size, chunk splitting and necessary migration will occur.

Based on hash. For example, the chestnut above is set to _ id: "hashed", which means that the chestnut is sliced according to the hash of the field _ id, and the slice key is _ id. During initialization, several chunk will be initialized according to the number of sharding nodes. For example, 3 sharding nodes will initialize 6 chunk, and each shard will have 2 chunk.

Each database will be assigned a primary shard, and the initialized chunk or the collection with no sharding enabled will be placed on this primary shard by default.

The choice of sharding strategy is very important, and it will be troublesome to change the sharding strategy when there is a large amount of data. Principles of slicing strategy:

The principle of uniform distribution. The goal of slicing is to make the data evenly distributed on each shard, and the access pressure of the data is also decomposed to each shard. For example, if you use the ascending order of the self-growing id as the slice key, the new data will always be written on the last chunk, and the chunk splitting and migration will also fall on the chunk shard, resulting in excessive pressure on the shard.

The principle of large cardinality. The number of different values that the chip key of the collection may contain, called the cardinality. The larger the cardinality, the finer the data can be divided. The smaller the cardinality, the limited the number of chunk. For example, gender, only men and women, if used as chip keys, at most two chunk, and so on when the data becomes larger and larger, it will not be able to expand horizontally.

Principle of proximity. As far as possible, the data of a query is distributed on the same chunk, which improves disk read performance. Avoid meaningless random chip keys, although evenly distributed, but each query has to be completed across multiple chunk, which is inefficient.

The above is how to achieve distributed clustering in MongoDB. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.