MongoDB lsm reduces disk lantency 10/21 Update SLTechnology News&Howtos

MongoDB lsm reduces disk lantency

2025-10-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

MongoDB lsm lowers disk lantency background

Part1: write at the front

In the replica set architecture, when we are faced with writing more and reading less, and most of the writes are update operations, the bottleneck of the WT engine is emerging. This directly leads to exceptions such as business feedback writing operations that take a long time. For this reason, the Percona version of MongoDB supports the rocksDB storage engine, which makes it more leisurely to cope with too many writes.

Part2: background

In the scenario of a large number of business updates, we find that the disk lantency of the WT storage engine is relatively high, and the effect is not good after trying to increase the number of cache_size, concurrency and eviction, so we try to use the rocksDB engine instead.

Part3: measures

Change write concern from majority to 1 to observe the effect

W: the data is written to number nodes before confirming to the client that {w: 0} writing to the client does not need to send any acknowledgement. It is suitable for scenarios that require high performance but do not care about correctness.

{w: 1} default writeConcern. When data is written to Primary, an acknowledgement is sent to the client.

{w: "majority"} data is written to most members of the replica set and an acknowledgment is sent to the client, which is suitable for scenarios that require high data security. This option reduces write performance.

J: confirm to the client only after the journal of the write operation is persisted. The default is "{j: false}. If the Primary write is required to be persistent before confirming to the client, specify this option as true.

Majority has been turned on before, but at the same time, after version 3.2.6, disk writing of journal logs will also be enabled, resulting in increased disk time, which in turn leads to slower writes. Changing writeconcern to 1 can increase the write rate.

Related parameters

WriteConcernMajorityJournalDefault

Part4: log crawling

2018-08-21T01:00:50.096+0800 I COMMAND [conn4072412] command kgcxxxt.$cmd command: update {update: "col", ordered: true, writeConcern: {w: "majority"}, $db: "kgcxxxt"} numYields:0 reslen:295 locks: {Global: {acquireCount: {r: 2, w: 2}}, Database: {acquireCount: {w: 2}}, Collection: {acquireCount: {w: 1}} Oplog: {acquireCount: {w: 1} protocol:op_query 137ms

....

2018-08-21T01:00:50.096+0800 I COMMAND [conn4072176] command kgcxxxt.$cmd command: update {update: "col", ordered: true, writeConcern: {w: "majority"}, $db: "kgcxxxt"} numYields:0 reslen:295 locks: {Global: {acquireCount: {r: 2, w: 2}}, Database: {acquireCount: {w: 2}}, Collection: {acquireCount: {w: 1}} Oplog: {acquireCount: {w: 1} protocol:op_query 137ms

Part5: monitorin

After being replaced by write concern 1, qps is increased to 15k

Part6: adjust the relevant parameters

Try enlarging cache_size and evictiondb.adminCommand ({setParameter: 1, wiredTigerEngineRuntimeConfig: "cache_size=90G"}) db.adminCommand ({setParameter: 1, wiredTigerEngineRuntimeConfig: "eviction= (threads_min=1,threads_max=8)"})

It can be seen that the queuing situation has come down after adjusting the parameters.

But it didn't last long, and it didn't take long for us to see what we didn't want to see, so we decided to use the rocksDB engine later.

Actual combat

Part1: overall Architectur

The original cluster is a 3-node replica set architecture, all of which use WT storage engine. We changed one of them from the library to the rocksDB storage engine to observe the disk lantancy situation.

As shown in the figure above, you can see the QPS of the primary node.

Part2:rocksDB engine slave library

We configure one of the slave libraries as a rocksDB engine, and a big improvement over WT in RocksDB is the use of LSM tree storage engine. Wiredtiger organizes data based on btree structure. In some extreme scenarios, problems with Cache eviction and write magnification may lead to Write hang. Details can be found on MongoDB jira to learn about the relevant issue. In view of these problems, the official MongoDB team has been optimizing, and we can also see that the stability of Wiredtiger continues to improve. On the other hand, RocksDB organizes data based on LSM tree structure, which optimizes writing and converts random writes into sequential writes, which can ensure continuous and efficient data writing. After replacing one of the slave libraries with the rocksDB engine, we compare it with the slave library of the other WT engine for disk lantency.

The following figure shows the rocksDB storage engine using the lsm Tree structure from the disk lantency of the library, which can be seen in the 1ms.

Its qps is shown in the following figure, and the repl_delete is about 3k and continuously stable.

Part3:WT engine slave library

The following picture shows the slave library using the WT storage engine. Its disk lantency is the same as the main library. The lantency written to it reaches 8ms, and the read also has 4ms.

Its qps is about 2.3k, and there is a breakpoint in monitoring, and there is a login timeout from the library due to high pressure.

Part4:RocksDB engine configuration parameters

Storage: engine: rocksdb useDeprecatedMongoRocks: true dbPath: / home/work/mongodb/mongo_28000/data/db_28000 indexBuildRetry: true journal: enabled: true commitIntervalMs: 100 rocksdb: cacheSizeGB: 10 compression: snappy maxWriteMBPerSec: 1024 configString: write_buffer_size=512M;level0_slowdown_writes_trigger=12;min_write_buffer_number_to_merge=2; crashSafeCounters: false counters: true singleDeleteIndex: false

Note 1

# storage Optionsstorage: engine: "rocksdb" useDeprecatedMongoRocks: true # # percona version 3.6 requires the addition of dbPath: / home/work/mongodb/mongo_10001/data rocksdb: cacheSizeGB: 10 # by default, the speed of 30% of physical memory compression: snappy maxWriteMBPerSec: 1024 # unit MB,rocks writes to storage. Reducing this value will reduce the read delay spike, but if this value is too low, it will reduce the write speed configString: write_buffer_size=512M;level0_slowdown_writes_trigger=12 Min_write_buffer_number_to_merge=2; crashSafeCounters: false # specifies whether to count correctly after crash. Turning this option on may affect the performance of counters: true # (on by default) specify whether to use advanced counters, and turn it off to improve write performance singleDeleteIndex: false

After adding the configString: write_buffer_size=512M;level0_slowdown_writes_trigger=12;min_write_buffer_number_to_merge=2; parameter, the disk latency and iops are further reduced.

It should be noted that rocksdb is not recommended for percona from version 3.6 and may be removed in the next major version. As for whether to choose it or not, according to the actual situation, it is best to pull the business together for a stress test to meet the business needs.

Https://www.percona.com/doc/percona-server-for-mongodb/LATEST/mongorocks.html

Note 2

The adjustment of rocksdb parameters is generally balanced between three factors: write magnification, read magnification, spatial magnification 1.flush option: write_buffer_size:memtable 's maximum size, if this value is exceeded, RocksDB will turn it into immutablememtable, and another new memtablemax_write_buffer_number will be used: the maximum number of memtable, if the number of activememtablefull, and the number of activememtable plus immutablememtable has reached this threshold, RocksDB will stop subsequent writes. Usually this is caused by writing too fast but not in time by flush. Min_write_buffer_number_to_merge: the minimum number of memtable that need to be merge before flush to level0. If the value is 2, then when there are at least two memtable for immutable, RocksDB will merge the two immutablememtable first, and then flush them to level0. Pre-merge can reduce the amount of key data that needs to be written. For example, if a key is modified in different memtable, then we can merge it at once. But this value is too large to affect read performance, because Get traverses all the memtable to see if the key exists. For example: write_buffer_size=512MB;max_write_buffer_number=5;min_write_buffer_number_to_merge=2; assumes that the write rate is 16MB/s, then a new memtable is generated every 32 seconds, and two memtable start merge every 64 seconds. Depending on the actual data, the size of the required flush to level0 may be between 512MB and 1024MB, and a flush may take several seconds (depending on the sequential write speed of the disk). There are up to 5 memtable, and when this threshold is reached, the RocksDB will organize subsequent writes. 2.LevelStyleCompaction:level0_slowdown_writes_trigger: when the file data of level0 reaches this value, compaction from level0 to level1 begins. So usually the size of level0 is write_buffer_size*min_write_buffer_number_to_merge*level0_file_num_compaction_trigger. Max_background_compactions: refers to the maximum number of concurrent threads compressed in the background. The default is 1, but in order to make full use of your CPU and storage, you can configure this value to the number of machine cores max_background_flushes: the maximum number of threads that perform flush operations concurrently. Setting it to 1 is usually sufficient.

The following is a simple insert test using sysbench. The collection of insert has a secondary index by default. At the beginning, the write performance of Wiredtiger is much higher than that of RocksDB, but as the amount of data increases, the write capacity of WT begins to decline, while the write of RocksDB has been relatively stable.

For more comparisons between Wiredtiger and Mongorocks, please refer to Facebook's technology sharing on Percona Live.

Https://www.percona.com/live/17/sessions/comparing-mongorocks-wiredtiger-and-mmapv1-performance-and-efficiency?spm=a2c4e.11153940.blogcont231377.21.6c457b684BOXvj

-- Summary.

Through this article, we learned about the characteristics of the RocksDB engine and compared with the disk lantency of the WT storage engine, different business scenarios are different, so the specific storage engine needs to be evaluated in combination with the specific business. As the author's level is limited and the writing time is very short, it is inevitable that there will be some errors or inaccuracies in the article. I urge readers to criticize and correct them. Like the author's article, click a wave of attention in the upper right corner, thank you.

Reference:

Https://www.percona.com/doc/percona-server-for-mongodb/LATEST/mongorocks.html

Https://yq.aliyun.com/articles/231377

Https://www.percona.com/live/17/sessions/comparing-mongorocks-wiredtiger-and-mmapv1-performance-and-efficiency?spm=a2c4e.11153940.blogcont231377.21.6c457b684BOXvj

Colored egg

After a year, I co-authored this book "MongoDB Operation and maintenance practice" with my best friend Mr. Zhang. I would like to thank the Electronic Industry Publishing House for letting me realize my dream of publishing a book. Thanks to Brother Youdong, Brother Li Dan, Brother Li Binge and Brother Zhang Liangge for their book reviews! Thank you, Brother Fei and Brother Rulin, for your guidance and help in my work since I joined Xiaomi! Thanks to my lover, Ms. Li Aixuan, without your support, I could not have completed this huge project. JD.com has his own stock. Students who like MongoDB are welcome to support it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.