Lecture notes on the overall architecture of MySQL database 07/01 Update SLTechnology News&Howtos

Lecture notes on the overall architecture of MySQL database

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the handout of the overall structure of MySQL database, hoping to supplement and update some knowledge. If you have any other questions you need to know, you can continue to follow my updated article in the industry information.

I. the overall structure of the database

First, let's take a look at the overall structure of MySQL data as follows:

This is a very classic MySQL system architecture diagram, through which you can see the functions of various parts of MySQL.

When the client connects to the database, the first thing to face is the connection pool, which is used to manage users' connections, and will do some authentication and authentication.

After connecting to the database, the client sends SQL statements, and the SQL interface module is used to accept the user's SQL statements.

SQL statements often need to conform to strict syntax rules, so it is necessary to have a syntax parser to parse the sentence. The principle of parsing syntax, as learned in the compilation principle, changes from a statement to a syntax tree.

The query to which the user belongs can be optimized so that the fastest query path can be selected, which is the role of the optimizer.

In order to speed up the query speed, there will be a query cache module. If the query cache has a hit query result, the query statement can directly fetch data from the query cache.

All the components above are the database service layer, followed by the database engine layer, and the current mainstream database engine is InnoDB.

Any changes to the database will be recorded by binary log in the database service layer, which is the basis of active and standby replication.

For the database engine layer, a famous figure is as follows:

In the storage engine layer, there are also caches and logs, and eventually the data falls on the disk.

The cache in the storage engine layer is also used to improve performance, but unlike the cache in the database service layer, the cache in the database service layer is query cache, while the cache read and write in the database engine layer are cached. The cache in the database service layer is based on query logic, while the cache in the database engine is based on data pages, which can be said to be physical.

Even if the write of the data is only written to the cache in the database engine layer, for the database service layer, even if it has been persisted, of course, it will cause data inconsistency between the cache page and the page on the hard disk. This inconsistency is ensured by the logs of the database engine layer.

So the log of the database engine layer is also different from that of the database service layer. The log of the service layer records the modification logic one by one, while the log of the engine layer records the physical difference between the cache page and the data page.

Second, the workflow of the database

When a query is received, the various components in the architecture of MySQL work as follows:

The client establishes a TCP connection with the database service layer, and the connection management module establishes the connection and requests a connection thread. If there is an idle connection thread in the connection pool, it is assigned to the connection, and if not, create a new connection thread responsible for the client if the maximum number of connections is not exceeded.

Before the actual operation, you also need to call the user module for authorization check to verify that the user has permissions. After the service is provided, the connection thread begins to receive and process the SQL statement from the client.

After receiving the SQL statement, the connection thread hands the statement to the SQL statement parsing module for syntax analysis and semantic analysis.

If it is a query statement, you can first see if there are any results in the query cache, and if so, you can return the results directly to the client.

If there are no results in the query cache, you need to actually query the database engine layer, so send it to the SQL optimizer for query optimization. If it is a table change, it will be handed over to the insert, update, delete, create and alter processing modules respectively.

The next step is to request the database engine layer, open the table, and obtain the appropriate lock if necessary.

The next processing goes to the database engine layer, such as InnoDB.

In the database engine layer, you should first query whether there is any corresponding data in the cache page, if so, you can return it directly, and if not, read it from disk.

When the corresponding data is found in the disk, it will be loaded into the cache, thus making the following query more efficient. Because of the limited memory, the flexible LRU table is often used to manage the cache page to ensure that the cache is frequently accessed data.

Get the data and return it to the client, close the connection, release the connection thread, and the process ends.

Third, the principle of database index

In the whole process, the easiest thing to call the bottleneck is the reading and writing of data, which often means reading and writing disks sequentially or randomly, while the speed of reading and writing disks is often slow.

What if we speed up the process? I'm sure you all guessed that indexing.

Why can indexing speed up this process?

I believe everyone has visited the food mall, where many restaurants are dazzling. If you are not in a hurry, are not hungry, and have no requirements for search performance, you can stroll slowly in the mall, visit one restaurant, and know that you can find the restaurant you want to eat. But when you are hungry, or you have an appointment with a restaurant, you must want to go straight to that restaurant. At this time, you will often look at the index map of the floor, quickly find the location of your target restaurant, and go straight to the topic. It will save a lot of time, this is the purpose of the index.

So the index is through the value, quickly find its location, so that it can be quickly accessed.

Another function of the index is to make some judgments without really looking at the data. for example, if there is a restaurant in the mall, you can see it by looking at the index. you don't have to really stroll around the mall, and if you find out all the Sichuan restaurants, you just need to look at the index, and you don't have to run a Sichuan restaurant.

So how does indexing work in MySQL?

The index structure of MySQL is often a B + tree.

An M-order B + tree has the following properties:

1. Nodes are divided into index nodes and data nodes. The index node is equivalent to the internal node of the B-tree, and all the index nodes form a B-tree with all the characteristics of the B-tree. In the Inode, Key and pointers are stored, not specific elements. The data node is equivalent to the external node of the B-tree. The external node of the B-tree is empty and is used in the B + tree to store the real data elements, which contains Key and other information about the elements, but there is no pointer.

two。 The B-tree of the entire index node is only used to find out which external node the data element with a certain Key is located in. Found Key in the index node, things are not over, to continue to find the data node, and then read out the elements in the data node, or binary search, or sequential scan to find the real data elements.

3.The order M is only used to control the degree of the index node part, and has nothing to do with M as to how many elements each data node contains.

4. There is also a linked list that strings all the data nodes together and can be accessed sequentially.

This definition is rather abstract. Let's look at a concrete example.

1.jpg

We can see from the figure that this is a third-order B + tree, while an external data node contains up to 5 items. If the inserted data is in the data node, the B-tree formed by the index node will not change if it does not cause splitting and merging.

If an item 76 is inserted in the external node of 71 to 75, it causes a split, 71, 72, 73 becomes a data node, 74, 75, 76 becomes a data node, and for an index node, it is equivalent to inserting a process with a Key of 74.

If 43 is deleted from the external node from 41 to 43, it causes merging, 41, 42, 61, 62, 63 into one node, which is equivalent to deleting a Key of 60 for the index node.

When searching, because the height of the B+ tree layer is very small, we can locate it relatively quickly. For example, if we want to find a value of 62, if we find a value greater than 40 on the root node, visit the right side if it is less than 70, and visit the right side if it is greater than 60. In the second leaf node, we find 62 and locate it successfully.

In MySQL's InnoDB, there are two types of B+ tree indexes, one is called clustered index, and the other is called secondary index.

The leaf node of the cluster index is the data node, often the primary key is used as the cluster index, and the leaf node of the secondary index stores the KEY field plus the primary key value. Therefore, to access the data through the secondary index, you have to access the index twice.

1.jpg

Another form of index is called a composite index, or a composite index, which can be indexed on multiple columns.

1.jpg

The collation of this index is to compare the first column first, compare the second column if the first column is equal, and so on.

Fourth, the advantages and disadvantages of database index

The most obvious advantage of database indexing is to reduce the number of Icano. Here are several scenarios.

For the = condition field, you can directly find the B+ tree, through a small number of hard disk reads (equivalent to the height of the B+ tree layer), you can reach the leaf node, and then directly locate the location of the data.

For the fields of the range, because the B+ tree is sorted, the range can be quickly located through the tree.

Similarly, for orderby, group by, distinct/max, min, because the B+ tree is ordered, it can also get the results quickly.

There is also a common scenario called index overwriting data. For example, when An and B are used as conditional fields, there is often an AND index. at the same time, when select C and D, we often build a joint index (A, B), which is a secondary index, so when searching, we can quickly find the corresponding leaf nodes and records through the B + tree of the secondary index, but there are plenty of ID in the records, so we need to find the B+ tree of the clustered index again to find the records in the real table. Then in the record, C and D are read out. If the joint index is (A, B, C, D), then all the data in the B + tree of the secondary index can be returned directly, reducing the process of searching the tree.

Of course, the index must have a price. There is no such thing as a free lunch.

Most of the benefits of the index are the improvement of the efficiency of reading, while the price of the index is the reduction of the efficiency of writing.

Inserting and modifying data can mean a change in the index.

When inserting, a clustered index is often built on the primary key, so it is best for the primary key to use self-growth, so that the inserted data is always last, sequential and efficient. Primary keys do not use UUID, so the order is relatively random, will lead to random writes, the efficiency is relatively inefficient. Do not use the primary key related to the business, because it means that it will be updated and will be faced with a deletion and reinsertion, which will be less efficient.

Through the introduction of the principle of B+ tree above, we can see that the splitting cost of B+ tree is relatively high, and splitting often occurs in the process of insertion.

On the other hand, the modification of data is basically equivalent to deletion and re-insertion, and the cost is also relatively high.

For the secondary index of the columns of some strings, it often results in random writes and reads, and the pressure on Ibind O is also great.

Fifth, interpret the principles behind the military rules of the database.

By understanding the principles of these two indexes, we can explain why the military chief of many so-called databases looks like this. Let's explain them one by one.

Under what circumstances should you use a composite index instead of a separate index?

Suppose there is a conditional statement An AND indexing, if An and B are two separate indexes, only one index causes the effect under the condition of AND, and B should be judged one by one, while if you use a combined index (A, B), you only need to traverse a tree, which greatly increases the efficiency. But for A / An OR blobs, the combined index does not work because of yes or, so a separate index can be used, in which case both indexes can work at the same time.

Why should the index have the degree of differentiation, and the degree of differentiation should be put in front of the combined index?

If there is no distinction, such as gender, it is equivalent to dividing the entire large table into two parts, and the search data still needs to traverse half of the table to find it, making the index meaningless.

If there is a combined index, do you still need a single-column index?

If the combined index is (A, B), then the combined index can be used for condition A, because the combined index is sorted according to the first column first, so it is not necessary to build a separate index for A, but it is not necessary for Biterb, because only when the first column is the same, the second column is compared, so the second column is the same, can be distributed in different nodes, there is no way to quickly locate.

Is it better to have as many indexes?

Of course not, only if you add an index where necessary, the index will not only reduce the efficiency of insertion and modification, but also when querying, there is a query optimizer, and too many indexes will confuse the optimizer. There may be no way to find the correct query path, so a slow index is chosen.

Why use self-incrementing primary keys?

Because the string primary key and random primary key will make the data randomly inserted, the efficiency is relatively inefficient, the primary key should be updated less to avoid B+ tree and frequent merging and splitting.

Why try not to use NULL?

NULL is more difficult to deal with in the B+ tree, often requires special logic to deal with, but reduces the efficiency.

Why not index fields that are updated frequently?

Updating a field means updating the corresponding index, which often means deleting and then inserting. The index is originally a way to form a certain data structure in advance at the stage of writing, thus making it more efficient at the stage of reading. however, if a field writes more and reads less, it is not recommended to use the index.

Why not use functions in query conditions?

For example, ID+1=10 this condition, the index is written in advance when generated, ID+1 this operation in the query phase, the index is incompetent for example, there is no way to all the indexes to do a calculation, and then compare it, the cost is too high, so should use ID=10-1.

Why not use negative query conditions such as NOT?

You can imagine that for a B + tree, the following node is 40, if your condition is equal to 20, check on the left, if your condition is equal to 50, check on the right, but your condition is not equal to 66, what should the index do? I didn't know it until I went through it.

Why don't fuzzy queries start with wildcards?

For a B + tree, if the root is the character def, if the wildcard is after it, for example, abc%, should search for the left, for example, efg%, should search for the right, if the wildcard is in front of% abc, you don't know which way to go, so scan them all.

Why change OR to IN, or use Union?

The optimization of OR query conditions is often difficult to find the best path, especially when there are many conditions in OR, especially for the same field, it is better to use IN. The database will sort the conditions in IN and deal with them uniformly through binary search. With Union for different fields, you can have each subquery use an index.

Why should data types be as small as possible, integers are often used instead of character types, and prefix indexes can be considered for long character types?

Because the database is stored according to the page, the size of each page is the same, if the data type is larger, the number of pages will be more, the data on each page will be less, and the height of the tree will be higher, so the number of Icano to be read from the search data will be more, and the node will be easily split when inserting, which will reduce the efficiency. The use of integers instead of character types is often considered, and integers are more efficient for indexing, such as IP addresses. If there are long character types that need to be queried using an index, in order not to make the index too large, consider indexing the prefix of the field instead of the entire field.

VI. Methodology of query optimization

To find the SQL statements that need to be optimized, first collect the problematic SQL statements.

The MySQL database provides a slow SQL log function. Through the parameter slow_query_log, a list of SQL quotations whose execution time exceeds a certain threshold is obtained.

The SQL statement that does not use the index can be opened through the long_queries_not_using_indexes parameter.

Min_examined_row_limit, only SQL statements with a number of scan records greater than this value will be recorded in the slow SQL log.

To find the statement in question, the next step is to use explainSQL to get the execution plan of SQL, whether to scan the record through the index, and to optimize the execution efficiency by creating the index. Whether there are too many scan records. Whether the lock is held for too long and whether there is a lock conflict. Whether a large number of records are returned.

Next, you can customize the optimization. For fields that are not covered by the filter criteria, create an index on a more differentiated field, and if multiple fields are involved, create a federated index as much as possible.

Scan a large number of records, return a small number of records, poor differentiation, re-evaluate the fields involved in the SQL statement, and select multiple fields with high differentiation to create an index.

The number of scanned records is very large, and the number of returned records is also very large. The filtering condition is not strong, so add SQL filtering condition.

Schema_redundant_indexes to see what redundant indexes are available.

If multiple indexes involve fields in the same order, you can form a federated index schema_unused_indexes to see which indexes have never been used.

VII. The principle of separation of reading and writing

Databases tend to write less and read more, so the first step in performance optimization is read-write separation.

1.jpg

Master-slave replication is implemented based on the log of the service layer on the master node, while an IO thread on the slave node reads the log and then writes it locally. Another thread re-executes from the slave node after reading from the local log.

2.jpg

The figure shows the flow chart of master-slave asynchronous replication. After the master instance is written to the engine, the success is returned, and then the event is sent to the slave instance and executed on the slave instance. This synchronization method is fast, but when the master is dead, if it has not been replicated, there may be a problem of data loss.

3.jpg

Database synchronous replication is also different, it returns to the client after the disk is set down from the node, of course, it will degrade the performance. NetEase database team improves the performance through group commit, parallel replication and other technologies.

With master-slave replication, the read-write separation strategy can be set in the database DAO layer, and some can do this through database middleware.

In fact, database logs have many other uses, such as using Canal (Alibaba open source project: incremental subscription & consumption based on MySQL database Binlog) to subscribe to the Binlog of the database, which can be used to update the cache and so on.

After reading the above handouts on the overall architecture of MySQL database, I hope it can bring some help to you in practical application. Due to the limited space in this article, it is inevitable that there will be deficiencies and need to be supplemented. If you need more professional answers, you can contact us on the official website for 24-hour pre-sales and after-sales to help you answer questions at any time.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.