MongoDB Export scenario query Optimization # 1 07/02 Update SLTechnology News&Howtos

MongoDB Export scenario query Optimization # 1

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Original link: https://github.com/aCoder2013/blog/issues/1 reproduced, please indicate the source.

Introduction

Some time ago, I encountered a scenario similar to exporting data. After observation, it is found that the speed will become slower and slower. It takes 40-60 minutes to export 1 million data. From the log observation, it is found that the time-consuming is getting higher and higher.

Reason

From the point of view of the code logic, it is exported in batches, similar to the front-end paging, which is specifically implemented through skip+limit, so what is the problem with using this approach? Let's google the documentation of these two interfaces:

The cursor.skip () method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return results. As the offset (e.g. PageNumber above) increases, cursor.skip () will become slower and more CPU intensive. With larger collections, cursor.skip () may become IO bound.

To put it simply, as the number of pages grows, skip () will become slower and slower, but as for the scenario we exported here, it should not be necessary to recalculate every time and do some useless work. My understanding should be able to get a pointer and slowly traverse it. After simple google, we found that it is possible to do so.

We can add a method in the persistence layer to return a cursor for the upper layer to traverse the data, so that we no longer have to traverse the exported result set from O (N2) to O (N). Here we can also specify a batchSize to set the amount of data (the number of elements) fetched from the MongoDB at a time. Note that the maximum here is 4m.

/ *

Limits the number of elements returned in one batch. A cursor * typically fetches a batch of result objects and store them * locally.

* *

If {@ code batchSize} is positive, it represents the size of each batch of objects retrieved. It can be adjusted to optimize * performance and limit data transfer.

* *

If {@ code batchSize} is negative, it will limit of number objects returned, that fit within the max batch size limit (usually * 4MB), and cursor will be closed. For example if {@ code batchSize} is-10, then the server will return a maximum of 10 documents and * as many as can fit in 4MB, then close the cursor. Note that this feature is different from limit () in that documents must fit within * a maximum size, and it removes the need to send a request to close the cursor server-side.

, /

For example, if I configure 8000 here, then the mongo client will grab so much data by default:

After a simple local test, we found that the performance has been improved by leaps and bounds, exporting 300000 data, using the previous way, it takes an average of 500ms to turn the page to the back, which takes a total of 60039ms. After optimization, the average time-consuming is between 100ms-200ms, and the total time-consuming 16667ms (including the time-consuming of business logic in the middle).

Use DBCursor cursor = collection.find (query) .batchSize (8000); while (dbCursor.hasNext ()) {DBObject nextItem = dbCursor.next (); / / business code. / /}

So shall we take a look at the logic inside hasNext? Yes。

@ Override public boolean hasNext () {if (closed) {throw new IllegalStateException ("Cursor has been closed");} if (nextBatch! = null) {return true;} if (limitReached ()) {return false } while (serverCursor! = null) {/ / an instruction is sent to mongo to grab data getMore (); if (nextBatch! = null) {return true;}} return false;} private void getMore () {Connection connection = connectionSource.getConnection () Try {if (serverIsAtLeastVersionThreeDotTwo (connection.getDescription ()) {try {/ / you can see that the `nextBatch` instruction initFromCommandResult (connection.command (namespace.getDatabaseName (), asGetMoreCommandDocument (), false) is actually called here. New NoOpFieldNameValidator (), CommandResultDocumentCodec.create (decoder, "nextBatch") } catch (MongoCommandException e) {throw translateCommandException (e, serverCursor) }} else {initFromQueryResult (connection.getMore (namespace, serverCursor.getId (), getNumberToReturn (limit, batchSize, count), decoder)) } if (limitReached ()) {killCursor (connection);}} finally {connection.release ();}

Finally, initFromCommandResult gets the result and parses it into a Bson object

Summary

When we usually write code, it is best to add burying points for each method, interface or even finer granularity, or we can set it to debug level, so that we can check the time-consuming at any time by making use of the dynamic update level of the logging framework such as log4j/logback, so that we can be more targeted and optimized. for this scenario, let's first see if there is something wrong with the logic of the code, and then see if it is a database problem. For example, there is no index, the amount of data is too large, and then think of ways to targeted optimization, and do not come up to play with the code.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.