How to solve the IO problem of MongoDB disk 07/01 Update SLTechnology News&Howtos

How to solve the IO problem of MongoDB disk

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

Editor to share with you how to solve the problem of MongoDB disk IO, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

IO concept

In the process of database optimization and storage planning, some important concepts of IO are always mentioned. Here, we will record in detail that the familiarity with this concept also determines the understanding of database and storage optimization. The following concepts are not authoritative documents, and the degree of authority can certainly not be said.

Read / write IO, the most common term, read IO, is to issue instructions to read the contents of a sector from disk. The instruction generally tells the disk the location of the start sector, and then gives the number of consecutive sectors that need to be read back from the initial sector, and whether the action is read or written. When the disk receives this instruction, it will read or write data as required by the instruction. The instruction + data issued by the controller is an IO, read or write.

Large / small block IO refers to the number of consecutive read sectors given in the instructions of the controller. If the number is very large, such as 128, 64, etc., it should be regarded as a large IO. If it is very small, such as 1, 4, 8, etc., it should be regarded as a small IO. There is no clear boundary between large blocks and small blocks.

Continuous / random IO refers to whether the initial sector address given by this IO and the end sector address of the last IO are completely contiguous or not much apart. If so, this IO should be regarded as a continuous IO, and if the difference is too big, it will be counted as a random IO. Continuous IO, because the initial sector and the last end sector are very close, the head almost does not need to change channel or the time to change channel is very short; if the difference is too large, the head needs a long time to change channel, and if there is a lot of random IO, the head keeps changing channel and the efficiency is greatly reduced.

Sequential / concurrent IO, which means that each set of instructions issued by the disk controller to the disk group (the instruction or data needed to complete a thing) is one or more. If it is one, then the IO queue in the controller cache can only come one by one, and this is the sequential IO;. If the controller can issue instruction sets to multiple disks in the disk group at the same time, it can execute multiple IO at a time, which is the concurrent IO mode. The concurrent IO mode improves efficiency and speed.

IO concurrency probability. For a single disk, the probability of IO concurrency is 0, because a disk can only IO once at a time. In the case of raid0,2 block disk, when the stripe depth is large (the stripe is too small to be concurrently with IO, which will be discussed below), the probability of concurrence of 2 IO is 1 IO 2. In other cases, please calculate by yourself.

IOPS . The time taken by an IO = seek time + data transfer time. IOPS=IO concurrency coefficient / (seek time + data transfer time), because the seek time is several orders of magnitude larger than the transmission time, the key factor affecting IOPS is the bottom seeking time. In the case of continuous IO, the seek time is very short, and only when the track is changed. Under this premise, the less the transmission time, the higher the IOPS.

IO throughput per second. Obviously, IO throughput per second = IOPS times average IO SIZE. The larger the Io size, the higher the IOPS, and the higher the IO throughput per second. The speed of reading and writing data per second of the magnetic head is set to VMagneV as the fixed value. Then IOPS=IO concurrency coefficient / (seek time + IO SIZE/V) is substituted, and IO throughput per second = IO concurrency coefficient multiplied by IO SIZE times V / (V times seek time + IO SIZE) is obtained. We can see that the biggest factors affecting the throughput of IO per second are IO SIZE and seek time. The larger the IO SIZE, the smaller the seek time and the higher the throughput. There is only one factor that can significantly affect IOPS, and that is seek time.

Three methods to solve the IO problem of MongoDB disk

1. Use combined large documents

We know that MongoDB is a document database, and each record is a document in JSON format. For example, like the following example, one such statistic is generated every day:

{metric: content_count, client: 5, value: 51, date: ISODate (2012-04-01 13:00)}

{metric: content_count, client: 5, value: 49, date: ISODate (2012-04-02 13:00)}

If you use a combined large document, you can store all the data for a month in one record:

{metric: content_count, client: 5, month: 2012-04, 1: 51, 2: 49,.}

Through the above two methods of storage, a total of about 7GB data is stored in advance (the machine has only 1.7GB memory), and the test reads the information for one year. The read performance of the two is obviously different:

Type 1: 1.6 seconds

Second: 0.3 seconds

So what's the problem?

The actual reason is that the combined storage can read fewer documents when reading the data. If reading documents can not be completely in memory, the cost is mainly spent on disk seek. The first storage method needs to read more documents when obtaining one-year data, so the number of disk seek is more. So it's slower.

In fact, foursquare, a well-known user of MongoDB, widely uses this way to improve read performance.

two。 Adopt a special index structure

We know that MongoDB, like traditional databases, uses B-tree as the index data structure. For tree-shaped indexes, the more centralized the indexes used to store hot data are in storage, the less memory the indexes waste. So let's compare the following two index structures:

Db.metrics.ensureIndex ({metric: 1, client: 1, date: 1}) and db.metrics.ensureIndex ({date: 1, metric: 1, client: 1})

Using these two different structures, the difference in insertion performance is also very obvious.

When the first structure is adopted, the insertion speed of 10k/s can be basically maintained when the amount of data is less than 20 million, but when the amount of data increases, the insertion speed will slowly decrease to 2.5k/s, and its performance may be even lower when the amount of data increases again.

When the second structure is adopted, the insertion speed can be basically stable at 10k/s.

The reason is that the second structure puts the date field in the first place of the index, so that when building the index, the new data updates the index not in the middle, but at the tail of the index. Indexes that are inserted too early hardly need to be modified in subsequent insert operations. In the first case, because the date field is not at the front, the index update often occurs in the middle of the tree structure, resulting in large-scale changes in the index structure.

3. Reserved space

Like point 1, this also takes into account that the main operating time of traditional mechanical hard drives is spent on disk seek operations.

For example, take the example in point 1, when we insert data, we insert all the space needed for this year's data in advance. This ensures that our data for the 12 months of the year is stored sequentially on disk in a single record, so when reading, we may need only one sequential read to the disk to read the data for a year, compared with the previous 12 reads, the disk seek is only once.

Db.metrics.insert ([{metric: content_count, client: 3, date: 2012-01, 0: 0, 1: 0, 2: 0,...} {.., date: {.. Date: {.., date: {.., date: {.. Date: {.., date: {.., date: {.. Date: {.., date:])

Results:

If you don't reserve space, you need 62ms to read the record for one year.

If you use the method of reserved space, it only takes 6.6ms to read the record for one year.

The above is all the contents of the article "how to solve the MongoDB disk IO problem". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.