MongoDB Sharding Learning Theory 07/06 Update SLTechnology News&Howtos

MongoDB Sharding Learning Theory

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

MongoDB Sharding technology is to solve the problem that a single MongoDB instance can not deal with with the increase of data and read and write requests in MongoDB. By using Sharding,MongoDB to divide the data into multiple parts and store the data distribution on multiple shard, sharding technology reduces the processing requests and storage capacity of a single shard. At the same time, with the expansion of the cluster, the throughput and capacity of the whole cluster will expand.

Sharded cluster sharding cluster has the following components: shards,query routers,config servers.

Shards: used to store data, providing high availability and data consistency for this sharding cluster. In a production environment, each shard is a replica set.

Query routers: or an mongos instance that interacts with the application, forwards the request to the back-end shards, and then returns the result of the request to the client. A sharding cluster can have multiple query router, that is, mongos instances, to share the request pressure on the client. If you use multiple mongos instances, you can use proxies such as HAProxy or LVS to forward client requests to the backend mongos. You must configure client affinity mode to ensure that requests from the same client are forwarded to the same mongos at the backend. Mongos instances are typically deployed to the application server.

Config servers: used to store metadata for sharding clusters. This metadata contains the correspondence between the data set data sets of the entire cluster and the back-end shards. Query router uses this metadata to locate client requests to the corresponding shards at the back end. The sharding cluster in the production environment happens to have three config servers. The data in config servers is very important. If all config servers is dead, the whole sharding cluster will not be available. In a production environment, each config server must be placed on a different server, and the config server of each sharding cluster cannot be shared and must be deployed separately.

MongoDB Sharding distributes and stores data at the Collection level, and Sharding distributes the data of a collection according to shard key.

To slice the data of a collection, you first need to select a shard key. An shard key can be an index field or an index field that exists for each document in a collection. MongoDB splits the value of this shard key into multiple blocks, and then distributes these blocks evenly over the back-end shard. MongoDB uses range based partitioning or hash based partitionning to say the value of a shard key for segmentation. Once shard key is selected, it cannot be changed.

Range Based Sharding

Given a range based partitioning system, documents with "close" shard key values e likely to be in the same chunk, and therefore on the same shard.

Hash Based Sharding

For hash based partitionning,MongoDB, a hash value of a field is calculated first, and then these hash values are used to create blocks.

With hash based partitioning, two documents with "close" shard key values are unlikely to be part of t same chunk. This ensures a more random distribution of a collection in the cluster.

Performance Distinctions between Range and Hash Based Partitioning

Range based partitioning supports more efficient range queries. For a shard key given a scope query, query router can more easily determine that the request is routed only to the shard that contains the corresponding database.

Range based partitioning can lead to uneven distribution of data, which can have a negative effect on sharding, such as when most of the requests are distributed to the same shard.

Hash based partitioning ensures that the data is evenly distributed, but this causes hashed values to be randomly distributed across blocks and shard, so that the range query range query can not locate some shard but query on each shard.

Customized Data Distribution with Tag Aware Sharding

MongoDB allows you to use tag aware sharding to create and associate some tag to the back-end shards based on the scope of the shard key. It is mainly used when the data of the same shard cluster is distributed to multiple data centers.

Maintaining a Balanced Data Distribution

With the increase of data or the increase of servers, the data distribution of the whole sharding cluster will be uneven. For example, one shard is much more than the database chunk on other shard, or the size of a data block chunk is much larger than that of other chunk.

MongoDB uses two background processes to ensure a balanced sharding cluster, namely splitting and balancer.

Splitting

Splitting is a background process that prevents the chunk from becoming too large. When the size of a chunk exceeds the specified size, MongoDB will split the chunk in half. Both insert and update operations trigger split.

Balancing

Balancer is a background process for managing chunk migrations.

When the data distribution of a sharding set in a sharding cluster is uneven, the balancer process will migrate the chunk on the shard with the most chunk to the shard with the least chunk until the data distribution of the set is balanced. For example, the collection user has 100 chunk on shard1, and 50 chunk,balancer processes on shard2 will migrate the chunk on shard1 to shard2 until the number of chunk on the two shard is balanced.

Adding or removing shard from a sharded cluster will affect the balance of the entire cluster.

Primary shard

Each database has a primary shard that stores all the unshredded collection data in the database.

Application scenarios of MongoDB Sharding technology:

a. If the dataset data set size will or has exceeded the capacity size of a single MongoDB instance.

b. The active workset working set size will exceed the maximum physical memory size

c. A single MongoDB instance cannot satisfy frequent write operations.

If the above three situations are not met, there is no need to deploy sharding, which will only increase the complex system. At the same time, when designing the data model, we should also take into account the situation of slicing in the future.

Data Quantity Requirements

Sharding only works when there is a large amount of data. The default chunk size is 64MB. Only when certain conditions are met, the balancer process will migrate the data to other shard, otherwise the data will be stored on a single shard all the time.

Broadcast Operations and Targeted Operations

In general, sharding clusters process requests from clients in the following other ways:

Broadcast the operation request to all shard in the entire cluster that contain the documents in the collection.

Locate operation requests to a single shard or a group of shard based on shard key

Reference documentation:

Http://docs.mongodb.org/manual/sharding/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.