What is the difference between InnoDB's data storage file and MyISAM 07/02 Update SLTechnology News&Howtos

What is the difference between InnoDB's data storage file and MyISAM

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail the differences between InnoDB data storage files and MyISAM. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

Why do you need to build an index?

First of all, we all know that the purpose of indexing is to improve query speed, so why can we improve query speed with indexes?

Let's look at a schematic diagram of an index.

If I have a SQL statement that is: select * from Table where id=15, then the whole table will be scanned without an index, that is, I will look for it one by one until I find this record of id=15. The time complexity is O (n).

What if you query when there is an index? First of all, the binary search will be carried out in the index value according to id=15. The efficiency of binary search is very high, and its time complexity is O (logn).

This is why the index can improve the query efficiency, but the amount of index data is also relatively large, so it is generally not stored in memory, but directly stored in the disk, so to read the contents of the files in the disk, it is inevitable to carry out disk IO.

Why does the index of MySQL use B+Tree

As we said above, index data is generally stored on disk, but the calculation data is carried out in memory. If the index file is very large, it cannot be loaded into memory all at once, so when using the index for data lookup, disk IO will be performed several times to load the index data into memory in batches, so a good index data structure can be obtained under the premise of getting the correct results. It must be the one with the least number of disk IO.

Hash Typ

At present, MySQL actually has two index data types to choose from, one is BTree (actually B+Tree), and one is Hash.

But why in the actual use process, basically most of them choose BTree?

Because if you use an index of Hash type, MySQL will perform a Hash operation on the index data when creating the index, so that you can quickly locate the disk pointer according to the hash value, even if the amount of data is large, it can also quickly and accurately locate the data.

However, for range queries such as select * from Table where id > 15, indexes of Hash type can not be handled. For this kind of range query, full table scanning will be done directly, and indexes of Hash type can not be sorted either.

In addition, although the bottom layer of MySQL has done a series of processing, it is still not completely guaranteed that there will not be Hash collisions.

Binary tree

So why doesn't MySQL have a binary tree as its index data structure? As we all know, binary tree locates data through binary search, so the effect is good, and the time complexity is O (logn).

But the problem with the binary tree is that in special cases, it will degenerate into a stick, that is, an one-way linked list. At this time, its time complexity will be reduced to O (n).

So when we want to query the records of id=50, it's the same as a full table scan. Therefore, because of this situation, binary tree is not suitable as the data structure of index.

Balanced binary tree

So since a binary tree will degenerate into a linked list in special cases, why not a balanced binary tree?

The height difference of the child nodes of a balanced binary tree cannot exceed 1, such as the binary tree in the following figure. If the key is 15, the height of the left child node is 0, the height of the right child node is 1, and the height difference is not more than 1. So the following tree is a balanced binary tree.

Because it can keep the balance, its query time complexity is O (logN). As for how to keep the balance, it is mainly to do some left-handed, right-handed, etc., the details of keeping the balance are not the main content of this article, you can search if you want to know.

What is the problem with using this data structure to index MySQL?

Too many disk IO: in MySQL, one IO operation only reads one node, so if a node has at most two child nodes, then it only has the query scope of these two child nodes, so to be accurate to specific data, you need to read multiple times. If the tree is very deep, there will be a lot of disk IO. The performance has naturally deteriorated.

Low space utilization: for balanced binary trees, each node value holds a keyword, a data area, and a pointer to two child nodes. As a result, only such a small amount of data is loaded in a painstaking IO operation, which is really a bit of a killer.

The query effect is unstable: if in a highly deep balanced binary tree, if the queried data happens to be the root node, it will be quickly found that if the queried data happens to be the leaf node, it will be returned after multiple disk IO, and the response time may not be in the same order of magnitude as that of the root node.

Although binary tree solves the problem of balance, it also brings new problems, that is, because of the depth of its own tree, it will cause a series of efficiency problems.

Then in order to solve the problem of balanced binary tree, balanced multi-tree (Balance Tree) has become a better choice.

Balanced multi-tree (Balance Tree-B-Tree)

B-Tree means balanced multi-tree. Generally speaking, a node in a B-Tree has as many child nodes as we call it a B-Tree. M is usually used to denote the order, and when m is 2, it is a balanced binary tree.

Each node of a B-Tree can have at most one keyword Math.ceil, and at least one keyword should be stored. All leaf nodes are on the same layer. The figure below is a 4-order B-Tree.

So let's take a look at how B-Tree looks up data:

If you query the data of id=7, load the node of keyword 20 into memory first and judge that 7 is smaller than 20.

Then load the first child node, and if the query data is equal to 12 or 17, it will be returned directly. It does not mean to continue to look down and find that 7 is less than 12.

Then continue to load the first child node, find 7, and directly return the data data under 7.

In this way, the whole operation actually takes 3 IO operations, but in fact, the average B-Tree has many branches at each layer (usually greater than 100).

In order to make better use of the IO capability of the disk, MySQL sets the size of the operation page to 16K, that is, the size of each node is 16K. If the keywords in each node are of int type, then it is 4 bytes. If the size of the data area is 8 bytes, and the node pointer occupies 4 bytes, then the number of keywords that can be saved in each node of B-Tree is: (1601000) / (410004) = 1000, each node can store up to 1000 keywords, and each node can have up to 1001 branch nodes.

In this way, when querying index data, a disk IO operation can read 1000 keywords into memory for calculation, a disk IO operation of B-Tree, and N disk IO operations on top of balanced binary data.

It should be noted that: in order to ensure the balance of data, B-Tree will do a series of operations, this process of maintaining balance is time-consuming, so when creating an index, choose the appropriate field, and do not create too many indexes, if you create too many indexes, the process of updating the index is also time-consuming when updating the data.

Also, do not select the value of the low differentiation field as the index, such as the gender field, which has a total of two values, which may cause the depth of the B-Tree to be too large and the indexing efficiency reduced.

B+Tree

B-Tree has solved the problem of balanced binary tree and guaranteed query efficiency, so why is there B+Tree?

Let's start with what B+Tree looks like.

B+Tree is a variant of B-Tree, and the formula relationship between each node keyword and m-order of B+Tree is different from that of B-Tree.

First of all, the number of child nodes in each node and the proportion of keywords that can be stored in each node is 1:1, and then the left closed interval is used to query the data. there is no data in the branch node, only the keywords and child nodes point to, and the data is stored in the leaf node.

So take a look at how data queries are done in B+Tree.

For example:

Now if you want to query the data of id=2, you will first take out the root node, load it into memory, and find that id=2 exists in the root node. Because the data is stored in the left closed interval, id

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.