The practical process of improving the performance of million-level high-concurrency mongodb clusters by tens of times 04/20 Update SLTechnology News&Howtos

The practical process of improving the performance of million-level high-concurrency mongodb clusters by tens of times

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Million-level high concurrency mongodb cluster performance tens of times to improve the optimization of the practice process, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

Background

The peak TPS of an online cluster exceeds about 1 million / s (mainly write traffic, and read traffic is very low). The peak tps has almost reached the upper limit of the cluster, and the average latency also exceeds 100ms. With the further increase of read and write traffic, delay jitter seriously affects business availability. The cluster adopts mongodb natural sharding mode architecture, and the data is evenly distributed in each shard. Add shard keys to enable sharding function to achieve perfect load balancing. The traffic monitoring of each node in the cluster is shown in the following figure:

As can be seen from the above figure, the cluster traffic is relatively large, and the peak value has exceeded 1.2 million / s, in which the traffic deleted after delete expiration is not included in the total traffic (delete is triggered by the master, but it will not be displayed above, but will only be displayed when the oplog is pulled from the slave node). If the delete traffic of the primary node is taken into account, the total tps is more than 1.5 million / s.

Software optimization

Without increasing server resources, the following software-level optimizations have been made, and the ideal performance improvement has been achieved several times:

Business level optimization

Mongodb configuration optimization

Storage engine optimization

Business level optimization

The total number of documents in the cluster is nearly 10 billion, and each document record is saved by default for three days, and the business randomly hashes the data to be expired at any time point after three days. Due to the large number of documents, Pingfeng monitoring can find that there are often a large number of delete operations from slave nodes during the day, and even the number of delete deletion operations at some time points has exceeded the business side read and write traffic. Therefore, consider putting the delete expiration operation into the night. The method of adding expired index is as follows:

Db.collection.createIndex ({"expireAt": 1}, {expireAfterSeconds: 0})

The expireAfterSeconds=0 in the expired index above means that the expiration point of the document in the collection collection expires at the expireAt point in time, for example:

Db.collection.insert ({/ / indicates that the document will be deleted at 1: 00 a. M. at night. "expireAt": new Date ('July 22, 2019 01pur00'), "logEvent": 2, "logMessage": "Success!"})

By randomly hashing expireAt at any time point in the morning after three days, you can avoid a large number of cluster delete introduced by triggering expired indexes during the daytime peak, thus reducing the cluster load during the peak period and ultimately reducing the average service delay and jitter.

Delete expired Tips1: meaning of expireAfterSeconds

Expires at the absolute time specified by expireAt, that is, 2:01 on the 22nd

Db.collection.createIndex ({"expireAt": 1}, {expireAfterSeconds: 0}) db.log_events.insert ({"expireAt": new Date (Dec 22, 2019 02 expireAfterSeconds 01hand 00'), "logEvent": 2, "logMessage": "Success!"})

Postpone the expiration of expireAfterSeconds seconds after the time specified by expireAt, that is, 60 seconds after the current time

Db.log_events.insert ({"createdAt": new Date (), "logEvent": 2, "logMessage": "Success!"}) Db.collection.createIndex ({"expireAt": 1}, {expireAfterSeconds: 60})

Delete expired Tips2: why can mongostat only monitor delete operations in the slave node, but not in the master node?

The reason is that the expired index is only triggered on the master master node. After the trigger, the master node will directly delete and call the corresponding wiredtiger storage engine API for deletion operation, and will not follow the normal client link processing process, so delete statistics can not be seen on the master node.

The master node will survive the delete oplog information after the expiration of the delete. The slave node pulls the master node oplog and then simulates the playback of the client, which ensures that the master data is deleted and the slave data is deleted at the same time, ensuring the final consistency of the data. Simulating the client playback process from a node will follow the normal client link process, so delete count statistics will be recorded, as shown in the following code:

The official reference is as follows: https://docs.mongodb.com/manual/tutorial/expire-data/

Mongodb configuration optimization (network IO reuse, network IO and disk IO separation)

Due to the high tps of the cluster and a large number of pushes on the hour, the concurrency of the hour will be even higher. The default one-request-one-thread mode of mongodb will seriously affect the system load, and this default configuration is not suitable for read-write applications with high concurrency. The official introduction is as follows:

Implementation principle of Mongodb Internal Network Thread Model

The default network model architecture of mongodb is a client-side link, and mongodb creates a thread to handle all read and write requests and disk IO operations of the link fd.

The default network threading model of Mongodb is not suitable for highly concurrent reads and writes for the following reasons:

In the case of high concurrency, a large number of threads will be created instantly, such as this online cluster, and the number of connections will increase to about 10,000, that is, the operating system needs to create 10,000 threads in an instant, so the system load load will be very high.

In addition, when the link request is processed and enters the traffic trough, the client connection pool reclaims the link, and the mongodb server needs to destroy the thread, which further aggravates the system load and further increases the database jitter, especially in the short link business such as PHP, where frequent creation of thread destruction thread results in high debt. A link is a thread, which is not only responsible for sending and receiving the network, but also responsible for writing data to the storage engine. The whole network Imax O processing and disk Imax O processing are responsible for the same thread, which is a defect in its own architecture design.

Optimization method of Network Thread Model

In order to adapt to the read and write scenario with high concurrency, mongodb-3.6 began to introduce serviceExecutor: adaptive configuration, which dynamically adjusts the number of network threads according to the number of requests, and tries to achieve network IO reuse to reduce the high system load caused by thread creation consumption. In addition, after serviceExecutor: adaptive configuration, the network IO reuse is realized with the help of boost:asio network module, and the network IO and disk IO are separated at the same time. In such a high concurrency case, the number of disk IO access threads is controlled by network link IO reuse and mongodb lock operation, which ultimately reduces the high system load caused by a large number of thread creation and consumption, and finally improves the high parallel read and write performance in this way.

Performance comparison of Network Thread Model before and after Optimization

After adding serviceExecutor: adaptive configuration to realize network IO reuse and separating network IO from disk IO in the large traffic cluster, the delay of the high traffic cluster is greatly reduced, and the system load and slow log are also greatly reduced, as shown below:

Comparison of system load before and after optimization

Verification method:

The cluster has multiple shards, and the load load of one shard optimized master node is compared with that of the master node that is not optimized at the same time:

Unoptimized configuration of load

Optimized configuration of load

Comparison of slow logs before and after optimization

Verification method:

The cluster has multiple shards. The slow log number of a shard-optimized master node is compared with that of a master node that is not optimized at the same time:

Statistics of slow logs at the same time:

Number of slow logs not optimized (19621):

Number of slow logs after optimized configuration (5222):

Comparison of average delay before and after optimization

Verification method:

The average latency of all nodes in the cluster with the network IO multiplexing configuration and the default configuration is compared as follows:

As can be seen from the above figure, the delay of the network IO is reduced by 1-2 times after multiplexing.

Wiredtiger storage engine optimization

From the previous section, we can see that the average delay has been reduced from 200ms to about 80ms. Obviously, the average delay is still very high. How to further improve the performance and reduce the delay? Continuing to analyze the cluster, we find that the disk IO is 0 for a moment, 100% for a while, and a drop of 0, as shown below:

As can be seen from the figure, if IAccord O is written to 2G at once, it will continue to block in the next few seconds. Read and write Imax O will drop 0 completely. Avgqumursz and awit are huge, and the order of util is 100%. In the process of IAccord O falling 0, the TPS of the business side dropped 0 at the same time.

In addition, util persisted at 0% for a long time after a large number of IO writes, as follows:

The overall IO load curve is as follows:

You can see from the figure that IO persisted to 0% for a long time, and then soared to 100% for a long time. When IO util reached 100%, the analysis log found that a large number of logs were full. At the same time, mongostat monitoring traffic found the following phenomena:

From the above, we can see that when we regularly obtain the status of a node through mongostat, we often time out, and the timeout is just when the io util=100%, when the IO can not keep up with the client write speed, resulting in blocking.

With the above phenomena, we can determine that the problem is caused by IO not keeping up with the write speed of the client. In Chapter 2, we have optimized the mongodb service layer, and now we start to optimize the wiredtiger storage engine, mainly through the following aspects:

Cachesize adjustment

Adjustment of the proportion of dirty data elimination

Checkpoint optimization

Cachesize tuning optimization (why the larger the cacheSize, the worse the performance)

Then look at the mongod.conf configuration file and find that the cacheSizeGB: 110G configured in the configuration file shows that the total amount of KV in the storage engine has almost reached 110G. According to the proportion of 5% dirty pages starting to flush, the larger the cachesSize setting at the peak, the more dirty data will be in it, while the disk IO capacity can not keep up with the speed of dirty data generation, which is likely to cause the disk IxO bottleneck to be full. And the reason that caused the drop of Imap O by 0.

In addition, if you look at the memory of the machine, you can see that the total size of the memory is 190g, of which about 110g has been used, which is almost caused by the storage of mongod, which will result in the reduction of page cache in the kernel state. when writing a large amount of kernel cache, the lack of kernel cache will cause disk page breakage and cause a large number of writes.

Solution: according to the above analysis, the problem may be a scenario of a large number of writes. Too much dirty data can easily lead to a large number of write to cacheSize at one time, so we can consider reducing the amount of write caused by storage to 50G to reduce the number of writes to it at the same time, so as to avoid the full blocking problem of disk Imax O written in large numbers at one time at the peak.

Optimization of dirty data elimination in storage engine dirty

Resizing cachesize solves the problem of 5s request timeout, and the corresponding alarm disappears, but the problem still exists. 5s timeout disappears, and 1s timeout occurs occasionally.

Therefore, it is the key to solve the problem that how to further avoid the problem of overwriting in the case of adjusting cacheSize. Further analysis of the principle of storage engine and how to solve the balance between memory and Icano become the key to solve the problem. Several configurations of mongodb default storage due to wiredtiger's cache elimination policy are as follows:

After adjusting the cacheSize from 120g to 50g, if the proportion of dirty data reaches 5%, in extreme cases, if the elimination speed can not keep up with the client write speed, it is still easy to cause the Imax O bottleneck and eventually cause blocking.

Solution: how to further reduce persistent cache O writes, that is, how to balance the relationship between disk and memory is the crux of the problem. As can be seen from the above table, if the dirty data and the total memory occupancy reach a certain proportion, the background thread begins to choose page to eliminate the write disk, and if the dirty data and memory footprint proportion further increases, then the user thread will begin to do page elimination, which is a very dangerous blocking process, resulting in blocking user requests for verification. The method of balancing cache and page O: adjust the elimination strategy to allow background threads to eliminate data as soon as possible, avoid a large number of flushing, and reduce the threshold of user threads to avoid blocking caused by page elimination. The configuration caused by optimizing the adjustment of storage is as follows:

Eviction_target: 75%

Eviction_trigger:97%

Eviction_dirty_target:% 3

Eviction_dirty_trigger:25%

Evict.threads_min:8

Evict.threads_min:12

The general idea is to let the background evict eliminate dirty pages page to disk as early as possible, and adjust the number of evict elimination threads to speed up dirty data elimination. After adjustment, the mongostat and client timeout phenomenon is further alleviated.

Storage engine checkpoint optimization tuning

The checkpoint detection point of the storage engine is actually a snapshot to record all the dirty data of the current storage engine to disk. There are two conditions for triggering checkpoint by default. The trigger conditions are as follows:

Take a checkpoint snapshot on a fixed cycle. Default is 60s.

The increment of redo log (i.e. journal log) reaches 2G

When the journal log reaches 2G or the redo log does not reach 2G and the last time interval reaches 60s, checkpoint will be triggered. If there is less dirty page eliminated by threads during the interval between two checkpoint, the more dirty data will be overstocked, that is, the more dirty data will be generated during checkpoint, resulting in a large number of IO disk writing operations during checkpoint. If we shorten the checkpoint cycle, the dirty data during the two checkpoint periods will be correspondingly reduced, and the disk IO 100% duration will be shortened.

The adjusted value of checkpoint is as follows:

Checkpoint= (wait=25,log_size=1GB)

Comparison of IO before and after Storage engine Optimization

After optimizing the storage engine in the above three aspects, the disk IO starts to average to different time points, and the optimized IO load of iostat monitoring is as follows:

As can be seen from the io load chart above, the previous IO is 0% for a while and alleviates for a while, as shown in the following figure:

Delay comparison before and after Storage engine Optimization

The delay before and after optimization is compared as follows (Note: several services in the cluster are used at the same time, and the delay before and after optimization is as follows):

As can be seen from the above figure, the time delay after storage engine optimization is further reduced and stabilized, from average 80ms to average 20ms, but it is still not perfect and there is jitter.

Resolve the problem of disk IO in server system

Server IO hardware problem background

As described in section 3, when wiredtiger eliminates a large number of data, it is found that as long as the number of disk writes per second exceeds 500M/s, util will continue to last 100% less than 0 in the next few seconds, so it is suspected that the disk hardware is defective.

From the figure above, you can see that the disk is an nvMe SSD disk. Viewing the relevant data, we can see that the IO performance of this disk is very good, supporting 2G writes per second, and the iops can reach 2.5W/S, while our online disk can only write up to 500m per second.

Performance comparison of server IO after hardware problem resolution

So consider migrating all the master nodes of the sharding cluster to another server, which is also an io disk, and the io performance achieves 2G/s writes (note: only the master node is migrated, and the slave node is still the server of the previous SSL). After the migration is completed, it is found that the performance has been further improved, and the time delay has been reduced to 2-4ms/s. The delay monitoring seen at three different business levels is shown below:

As can be seen from the delay in the figure above, after migrating the master node to a machine with better IO capability, the delay is further reduced to an average of 2-4ms.

Although the latency has been reduced to an average of 2-4ms, there are still many spikes of dozens of ms. In view of the fact that we will share the reasons in the next issue, we will eventually keep all delays within 5ms and eliminate dozens of ms spikes.

In addition, after confirming and analyzing the ssd io bottleneck of nvme, it is finally located to be caused by the mismatch of the linux kernel version. If you have the same problem with the nvme ssd disk, remember to upgrade the linux version to 3.10.0-957.27.2.el7.x86_64 version. After the upgrade, the IO capacity of the nvme ssd is more than 2G/s.

Summary and remaining problems

After optimizing the configuration of mongodb service layer, storage engine and hardware IO, the average latency of writing to the cluster with high traffic is reduced from several hundred ms to 2-4ms, and the overall performance is improved tens of times.

However, from the optimized delay in Chapter 4.2, we can see that there is still occasional jitter in the cluster. In view of the length, the next issue will share that if the delay jitter in Chapter 4.2 is eliminated, the final hold time is controlled at 2-4ms, and there is no jitter that exceeds the 10ms. Please look forward to the next article.

In addition, some pits were collected in the process of cluster optimization, and the mining records of large-flow clusters will continue to be analyzed in the next issue.

Note: some of the optimization methods in this article are not necessarily suitable for all mongodb scenarios. Please optimize them according to the actual business scenario and hardware resource capabilities, rather than step by step.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.