Sharing of MongoDB optimization experience 07/03 Update SLTechnology News&Howtos

Sharing of MongoDB optimization experience

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Here is a summary of the experience of using mongo during this period, and a few points to pay attention to.

1. System parameters and mongo parameter settings

The mongo parameters are mainly storageEngine and directoryperdb, which cannot be changed later if they are not selected at the beginning.

Directoryperdb mainly stores the database in sub-folders to facilitate subsequent backup and data migration.

StorageEngine (storage engine) uses MMAPv1 by default, and the newly added 3. 0 engine wiredTiger is recommended. The disk space occupied by the actual use of wiredTiger is 1 wiredTiger 5 of that of MMAP, and the size of the index is 1 max 2. The query speed is also greatly improved. More importantly, the engine provides a lock at the document level, and there is no need to block read operations when the collection inserts or updates data. The only problem is that there are not many tools on the market that support the engine query. MongoVUE cannot find the collection stored by the engine. NosqlManager-mongo can find it but requires. Net environment support. Personally, I think it is enough to be familiar with mongo command using mongo shell, so it is highly recommended to use wiredTiger engine.

two。 There is no need to split the collection horizontally

Since relational databases have been used in the past, the method often used in relational databases when the amount of data in a single table is too large is to divide the data tables into tables. When using mongo, it is natural to think that this trick still works. As the sub-tables of the system are dynamically generated, it is found that the performance improvement brought by this move to mongo is far less than the increase in maintenance costs.

The biggest reason why analyzing the sub-table of relational database can improve the performance is that a table is a file in many relational databases, and the sub-table can avoid the slow speed of data extraction caused by the large file. But mongo is not stored in this way, so this is not true.

Anyone who has used it knows that mongo is very dependent on indexes, and if the collection cannot be designed in the first place, subsequent indexes will have to be created by writing scripts. Here is a script to dynamically create an index for a large mongo table:

Eval (function () {var infos = []; var collNames = db.getCollectionNames (); for (var I = 0; I)

< collNames.length; i++) { var collName = collNames[i]; var collSize = db.getCollection(collName).count(); if (collSize >

1000000 & & collName.indexOf ("info_") = 0) {db.getCollection (collName) .ensureIndex ({publishDate:-1,blendedScore:-1,publishTime:-1,isRubbish:1}, {name: "ScoreSortIdx", background:true}); db.getCollection (collName) .ensureIndex ({similarNum:-1,publishTime:-1,isRubbish:1}, {name: "HotSortIdx", background:true}) Db.getCollection (collName) .ensureIndex ({publishTime:-1,isRubbish:1}, {name: "TimeSortIdx", background:true}); infos.push ("name:" + collName + "index created successfully");}} return infos;} ())

From this point of view, dynamic indexing can barely be solved, but the worst part is that sharding can't do it at all. Shard needs to specify the collection and partition keys to shard, so this cannot be specified dynamically in advance. Therefore, the mongo collection does not need to be split horizontally (at least 10 million levels are not needed, and the larger shard is dropped directly), it just needs to be separated by business.

3. Use Capped Collection

Some people use mongo for data caching, and they cache a fixed amount of data, still use normal collections, and then clean up the data regularly. In fact, the performance is much better with capped collection at this time.

4. Replica sets must be used in the production environment

Many online environments still use stand-alone versions. Although the deployment is fast, many features provided by mongo naturally are not used, such as automatic failover and read-write separation, which are too important for subsequent system expansion and performance optimization. I think mongo should be used when the amount of data reaches a certain level, and query performance will be very important, so it is strongly recommended to use replica sets directly when you go online.

5. Learn to use explain

I have been used to using tools to query, but now I find that I should use the mongo shell command to query more often, and use explain to view the query plan. The hint command is also very useful when looking for the best index.

Db.info.find ({publishDate: {$gte:20160310,$lte:20160320}, isRubbish: {$in: [0executionStats 1]}, title: {$regex: ". * test.*"}, $or: [{useId:10}, {groupId:20}]}) .explain ("executionStats")

6. Read-write separation cannot be used for frequent write operations

Due to the large number of write operations in the system, various w-level locks often appear (this lock is usually block read) and the system does not require too much data consistency (mostly background writes and foreground reads, so there is a certain delay allowed), so we want to use replica sets for read-write separation. When actually tested, it is found that reads on the replica set are often blocked. Through db.currentOp (), it is found that there is often an op:none operation applying for global write lock, when all operations are in waitingForLock:true. Google has not found a solution to this problem for a long time. Later, the following big hole was found in the concurrency FAQ of the official documentation:

How does concurrency affect secondaries?

In replication, MongoDB does not apply writes serially to secondaries.

Secondaries collect oplog entries in batches and then apply those

Batches in parallel. Secondaries do not allow reads while applying the

Write operations, and apply write operations in the order that they

Appear in the oplog.

It turns out that when the copy of mongodb replicates the primary node data to execute oplog, the read is blocked, which basically declares that the data cannot be read on the replica, and it takes several days of effort in vain. Therefore, mongo officially does not recommend read-write separation. It turns out that the pit is here. In fact, the separation of read and write is not very effective in the case of more writing and less reading, because the performance bottleneck is mainly in writing, and reading generally does not consume much resources (in addition, the lock of the wiredTiger engine achieves the doc level, so the lock is relatively rare). The officially recommended practice is shard, which can effectively allocate writes to multiple servers to improve write speed and enable the system to expand horizontally.

Never let the disk be full

Start to pay attention to splitting pieces from the set at 80%. If your data is growing very fast, it is likely that you will be full before you split the disk, causing the MongoDB to hang up. If the amount of data is large, use sharding as much as possible, do not use replica sets, and make a good disk capacity planning. Even if sharding is used, the capacity will be expanded in advance. After all, chunk migration is still so slow.

8. Security risk

MongoDB does not prompt users to set passwords by default, so if you put MongoDB on the public network without configuring a password, then "congratulations", you may have become a chicken.

9. Database level lock

The lock mechanism of MongoDB is very different from that of general relational databases such as MySQL (InnoDB) and Oracle. InnoDB and Oracle can provide row-level granularity locks, while MongoDB can only provide library-level granularity locks, which means that when a MongoDB write lock is occupied, other read and write operations have to be dry.

At first glance, library-level locks have serious problems in a large concurrency environment, but MongoDB can still maintain large concurrency and high performance. This is because although the lock granularity of MongoDB is very extensive, there are great differences between lock handling mechanism and relational database locks, mainly as follows:

MongoDB does not have full transaction support, and the operation atomicity is only up to a single document level, so the operation granularity is usually small.

The actual occupancy time of MongoDB lock is the memory data calculation and change time, which is usually very fast.

MongoDB lock has a temporary abandonment mechanism, when there is a need to wait for slow IO to read and write data, it can be temporarily abandoned, and then re-acquire the lock after the IO is completed.

Usually, if there is no problem, it does not mean that there is no problem. If the data is not operated properly, it will still occupy the write lock for a long time, such as the foreground indexing operation mentioned below. When this happens, the entire database is completely blocked. Unable to perform any read and write operations, the situation is very serious.

To solve the problem, try to avoid taking up write lock operations for a long time. If some collection operations are unavoidable, you can consider putting this collection in a separate MongoDB library, because MongoDB different library locks are isolated from each other, and separating collections can avoid global blocking problems caused by a collection operation.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.