Don't waste your money on a column! Two articles will take you to understand the principle of MySQL optimization. 07/04 Update SLTechnology News&Howtos

Don't waste your money on a column! Two articles will take you to understand the principle of MySQL optimization.

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Collected two articles, from different places, collectively known as "the most complete optimization guide to MySQL in history" to share the collection. Update the first article today and update the second one tomorrow.

Author: Chenchuan Source:

Https://www.jianshu.com/p/d7665192aaaf

The copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If there is any infringement, please contact us and we will deal with it in a timely manner.)

Speaking of MySQL query optimization, I believe you have a lot of tricks: can not use SELECT *, do not use NULL fields, reasonably create an index, choose the appropriate data type for the field … .. Do you really understand these optimization techniques? Do you understand the working principle behind it? Is there really an improvement in performance in the actual scenario? I don't think so. Therefore, it is particularly important to understand the principles behind these optimization suggestions. I hope this article will allow you to re-examine these optimization recommendations and make reasonable use of them in actual business scenarios.

MySQL logical architecture

If you can build an architectural diagram of how the components of MySQL work together in mind, it will help to gain a deeper understanding of the MySQL server. The following figure shows the logical architecture diagram of MySQL.

The MySQL logic architecture is divided into three layers, and the top layer is the client layer, which is not unique to MySQL. Functions such as connection processing, authorization authentication, security and so on are handled in this layer.

Most of MySQL's core services are in the middle layer, including query parsing, analysis, optimization, caching, built-in functions (such as time, mathematics, encryption, etc.). All cross-storage engine functions are also implemented at this layer: stored procedures, triggers, views, and so on.

The lowest layer is the storage engine, which is responsible for data storage and extraction in MySQL. Similar to file systems under Linux, each storage engine has its advantages and disadvantages. The middle service layer communicates with the storage engine through API, and these API interfaces shield the differences between different storage engines.

MySQL query process

We always want MySQL to achieve higher query performance, and the best way is to figure out how MySQL optimizes and executes queries. Once you understand this, you will find that a lot of query optimization work actually follows some principles to enable MySQL's optimizer to run in a reasonable way.

What exactly does MySQL do when sending a request to MySQL?

Client / server communication protocol

The MySQL client / server communication protocol is "half-duplex": either the server sends data to the client or the client sends data to the server at any time, both of which cannot occur at the same time. Once one end starts to send a message, the other end has to receive the whole message before it can respond to it, so we cannot and do not need to cut a message into small pieces to send independently, and there is no way to control the flow.

The client sends the query request to the server in a separate packet, so when the query statement is very long, you need to set the max_allowed_packet parameter. However, it is important to note that if the query is too large, the server will refuse to receive more data and throw an exception.

In contrast, the server usually responds to a lot of data to the user, consisting of multiple packets. But when the server responds to the client request, the client must receive the entire return result completely, instead of simply taking the first few results and asking the server to stop sending. Therefore, in the actual development, it is a very good habit to keep the query simple and return only the necessary data, and to reduce the size and number of data packets between communications, which is one of the reasons to avoid using SELECT * and add LIMIT restrictions in the query.

Query cache

Before parsing a query statement, if the query cache is open, MySQL checks to see if the query statement hits the data in the query cache. If the current query happens to hit the query cache, the results in the cache are returned directly after checking the user permissions once. In this case, the query will not be parsed, nor will it generate an execution plan, let alone execute.

MySQL stores the cache in a reference table (don't interpret it as table, it can be thought of as a data structure similar to HashMap). Through a hash index, the hash value is calculated from the query itself, the database to be queried, the client protocol version number, and other information that may affect the result. So the difference between the two queries on any character (for example, spaces, comments) will cause the cache to miss.

If the query contains any user-defined functions, storage functions, user variables, temporary tables, system tables in the mysql library, the query results

Will not be cached. For example, the function NOW () or CURRENT_DATE () will return different query results for different query times, and for example, the query statements containing CURRENT_USER or CONNECION_ID () will return different results for different users. There is no point in caching such query results.

Since it is a cache, it will be invalidated, so when will the query cache expire? MySQL's query caching system tracks each table involved in the query, and if these tables (data or structure) change, all cached data associated with this table will be invalidated. Because of this, MySQL must set all caches of the corresponding table to invalidate during any write operation. If the query cache is very large or fragmented, this operation may cause a lot of system consumption, or even cause the system to freeze for a while. And the extra consumption of query caching on the system is not only for write operations, but also for read operations:

1. Any query statement must be checked before it starts, even if the SQL statement never hits the cache

two。 If the query results can be cached, the results will be cached after execution, resulting in additional system consumption

Based on this, we need to know that querying cache does not always improve system performance, caching and invalidation will bring additional consumption, and only when the resource savings caused by cache are greater than the resources consumed by itself, will bring performance improvement to the system. However, it is very difficult to evaluate whether opening the cache will lead to a performance improvement, and it is beyond the scope of this article. If the system does have some performance problems, you can try to open the query cache and make some optimizations in the database design, such as:

1. Replace a large table with multiple small tables, be careful not to overdesign

two。 Batch insertion instead of cyclic single insert

3. Reasonably control the size of the cache space. Generally speaking, it is appropriate to set the size to several tens of megabytes.

4. You can use SQL_CACHE and SQL_NO_CACHE to control whether a query statement needs to be cached

The final advice is not to open query caching easily, especially for write-intensive applications. If you can't help it, you can set query_cache_type to DEMAND, where only queries that join SQL_CACHE will be cached, and other queries won't, so you can control which queries need to be cached very freely.

Of course, the query caching system itself is very complex, and what is discussed here is only a small part of other more in-depth topics, such as: how does caching use memory? How to control the fragmentation of memory? What is the impact of transactions on query cache, etc., readers can read the relevant materials on their own, here is the right to throw a brick to attract jade.

Syntax parsing and preprocessing

MySQL parses the SQL statement with keywords and generates a corresponding parsing tree. This process parser validates and parses mainly through syntax rules. For example, whether the wrong keywords are used in SQL or whether the keywords are in the correct order, and so on. The preprocessing further checks whether the parsing tree is legal according to the MySQL rules. For example, check whether the data tables and columns to be queried exist, and so on.

Query optimization

The syntax tree generated by the previous step is considered legal and is converted into a query plan by the optimizer. In most cases, a query can be executed in many ways, and the corresponding results are returned. The role of the optimizer is to find the best execution plan.

MySQL uses a cost-based optimizer, which attempts to predict the cost of a query when it uses some kind of execution plan, and selects the one with the lowest cost. In MySQL, you can get the cost of calculating the current query by querying the value of the last_query_cost for the current session.

Mysql > select * from t_message limit 10

... Omit the result set

Mysql > show status like 'last_query_cost'

+-+ +

| | Variable_name | Value |

+-+ +

| | Last_query_cost | 6391.799000 | |

+-+ +

The results in the example indicate that the optimizer believes that a random search of about 6391 data pages is needed to complete the above query. This result is calculated from a number of columns of statistics, including the number of pages per table or index, the cardinality of the index, the length of the index and data rows, the distribution of the index, and so on.

There are many reasons why MySQL chooses the wrong execution plan, such as inaccurate statistics, not considering operation costs beyond its control (user-defined functions, stored procedures), MySQL's view of the best is not what we thought (we want the execution time to be as short as possible, but MySQL values choose what it considers to be low cost, but small cost does not mean short execution time), and so on.

MySQL's query optimizer is a very complex part that uses a lot of optimization strategies to generate an optimal execution plan:

1. Redefine the association order of tables (when multiple tables are associated with queries, they do not necessarily follow the order specified in SQL, but there are some tricks to specify the association order)

two。 Optimize the MIN () and MAX () functions (find the minimum value of a column. If the column has an index, you only need to find the leftmost end of the B+Tree index. Otherwise, you can find the maximum value. For details, see below)

3. Terminate the query early (for example, when using Limit, the query will be terminated as soon as a satisfied number of result sets are found)

4. Optimized sorting (in the old version of MySQL will use two transfer sort, that is, first read the row pointer and the fields that need to be sorted in memory, and then read the data rows according to the sort results, while the new version uses a single transfer sort, that is, read all the data rows at once, and then sort according to the given column. For Istroke O-intensive applications, it will be much more efficient)

With the continuous development of MySQL, the optimization strategies used by optimizers are also constantly evolving. Here are only a few optimization strategies that are not frequently used and easy to understand. Other optimization strategies can be found by yourself.

Query execution engine

After completing the parsing and optimization phase, MySQL generates the corresponding execution plan, and the query execution engine executes the results step by step according to the instructions given by the execution plan. Most of the operation of the entire execution process is done by calling the interfaces implemented by the storage engine, which are called handler API. Each table during the query is represented by an handler instance. In fact, MySQL creates a handler instance for each table in the query optimization phase, and the optimizer can obtain information about the table according to the interface of these instances, including all the column names of the table, index statistics, and so on. The storage engine interface provides a wealth of functions, but there are only a few dozen interfaces at the bottom, which complete most of the operations of a query like building blocks.

Return the result to the client

The final stage of query execution is to return the results to the client. Even if the data cannot be queried, MySQL still returns information about the query, such as the number of rows affected by the query, the execution time, and so on.

If the query cache is turned on and the query can be cached, MySQL will also store the results in the cache.

The result set returning to the client is an incremental and step-by-step process. It is possible that when MySQL generates the first result, it begins to gradually return the result set to the client. In this way, the server does not need to store too many results and consumes too much memory, and it also allows the client to get the returned results the first time. It should be noted that each row in the result set is sent as a packet that meets the communication protocol described in ① and then transmitted through the TCP protocol. During the transmission process, the packets of MySQL may be cached and then sent in batches.

Let's go back and summarize the whole query execution process of MySQL, which is divided into five steps:

1. The client sends a query request to the MySQL server

two。 The server first checks the query cache and immediately returns the results stored in the cache if it hits the cache. Or move on to the next stage.

3. The server performs SQL parsing, preprocessing, and then the optimizer generates the corresponding execution plan

According to the execution plan, 4.MySQL calls the API of the storage engine to execute the query

5. Return the results to the client while caching the query results

Performance optimization recommendations

After reading so much, you may expect to give some optimization tools, yes, here are some optimization suggestions from three different aspects. But please wait, there is one more piece of advice for you: don't listen to the "absolute truth" you see about optimization, including what is discussed in this article, but should test your assumptions about execution plans and response times in real business scenarios.

Scheme Design and data Type Optimization

As long as you follow the principle of small and simple data types, smaller data types tend to be faster, consume less disk and memory, and require fewer CPU cycles for processing. Simpler data types require fewer CPU cycles to compute. For example, integers are cheaper than character operations, so they use integers to store ip addresses and DATETIME to store time instead of strings.

Here are a few techniques that may be easy to misunderstand:

1. Generally speaking, changing a column that can be NULL to NOT NULL will not do much to improve performance, except that if you plan to create an index on the column, you should set the column to NOT NULL.

two。 Specifying a width for an integer type, such as INT (11), is of no use at all. INT uses 32 bits (4 bytes) of storage space, so its representation range has been determined, so INT (1) and INT (20) are the same for storage and computation.

3.UNSIGNED indicates that negative values are not allowed, which roughly doubles the upper limit of positive numbers. For example, the TINYINT storage range is-128127, while the UNSIGNED TINYINT storage range is 0255.

4. Generally speaking, there is not much need to use the DECIMAL data type. Even when you need to store financial data, you can still use BIGINT. For example, if you need to be accurate to 1/10000, you can multiply the data by 1 million and then use BIGINT storage. This can avoid the problems of inaccurate floating point calculation and high cost of accurate DECIMAL calculation.

4.TIMESTAMP uses 4 bytes of storage and DATETIME uses 8 bytes of storage. As a result, TIMESTAMP can only represent 1970-2038, which is much smaller than DATETIME, and the value of TIMESTAMP varies from time zone to time zone.

5. In most cases, there is no need to use enumerated types, one of the disadvantages is that the enumerated string list is fixed, adding and removing strings (enumeration options) must use ALTER TABLE (if you just append elements to the end of the list, you don't need to rebuild the table).

Don't have too many columns for 6.schema. The reason is that the API of the storage engine needs to copy the data in the row buffer format between the server layer and the storage engine layer, and then decode the buffer into columns at the server layer, which is a very expensive conversion process. If there are too many columns and few columns are actually used, it may cause the CPU to be too high.

7. Large table ALTER TABLE is very time-consuming, and the way MySQL performs most operations to modify table results is to create an empty table with the new structure, find out all the data from the old table and insert it into the new table, and then delete the old table. It takes longer, especially when there is insufficient memory, a large table, and a large index. Of course, there are some tricks and tricks that can solve this problem. If you are interested, you can check it yourself.

Create a high performance index

Indexing is an important way to improve the performance of MySQL queries, but too many indexes may lead to high disk utilization and high memory footprint, thus affecting the overall performance of the application. You should try to avoid thinking about adding an index after the event, because you may need to monitor a large number of SQL to locate the problem, and the time to add the index must be much longer than the initial time to add the index, so the addition of the index is also very technical.

The next step is to show you a series of strategies for creating high-performance indexes and how each strategy works. But before that, understanding some algorithms and data structures related to the index will help to better understand the content of the following article.

Index-related data structures and algorithms

Generally speaking, the index refers to the B-Tree index, which is the most commonly used and effective index to find data in relational databases, which is supported by most storage engines. The term B-Tree is used because MySQL uses this keyword in CREATE TABLE or other statements, but in fact different storage engines may use different data structures, such as InnoDB, which uses B+Tree.

B in B+Tree refers to balance, which means balance. It should be noted that the B+ tree index cannot find a specific row with a given key value, it only finds the page where the data row is found, and then the database will read the page into memory, and then find it in memory, and finally get the data to be found.

Before introducing B+Tree, let's take a look at the binary search tree, which is a classical data structure. The value of the left subtree is always less than the value of the root, and the value of the right subtree is always greater than the value of the root, as shown in the following figure ①. If you want to find a record with a value of 5 in this lesson tree, the general process is: first find the root, its value is 6, greater than 5, so look for the left subtree, find 3, and 5 is greater than 3, and then find the right subtree of 3, a total of 3 times. In the same way, if you look for a record with a value of 8, you need to look for it 3 times. So the average search times of binary search tree is (3 + 3 + 3 + 2 + 2 + 1) / 6 = 2.3, while for sequential search, it only takes one for records with value 2, but for records with values of 8, it takes 6 times. So the average number of searches for sequential search is (1 + 2 + 3 + 4 + 5 + 6) / 6 = 3.3. Therefore, in most cases, the average search speed of binary search tree is faster than that of sequential search.

Because the binary search tree can be constructed arbitrarily, the binary search tree like figure ② can be constructed with the same value. Obviously, the query efficiency of this binary tree is about the same as that of sequential search. If you want the binary search number to have the highest query performance, you need the binary search tree to be balanced, that is, the balanced binary tree (AVL tree).

The balanced binary tree first needs to conform to the definition of the binary search tree, and secondly, it must satisfy that the height difference between the two subtrees of any node can not be greater than 1. Obviously, the graph ② does not satisfy the definition of balanced binary tree, while the graph ① is a balanced binary tree. The search performance of the balanced binary tree is relatively high (the best performance is the optimal binary tree). The better the query performance is, the greater the maintenance cost is. For example, in the balanced binary tree of figure ①, when the user needs to insert a new node with a value of 9, the following changes need to be made.

It is the simplest case to change the inserted tree back into a balanced binary tree with a left-handed operation, which may need to be rotated multiple times in the actual application scenario. At this point, we can consider a problem, the search efficiency of the balanced binary tree is good, the implementation is very simple, and the corresponding maintenance cost is acceptable, why the MySQL index does not directly use the balanced binary tree?

With the increase of data in the database, the size of the index itself increases, so it is impossible to store it all in memory, so the index is often stored on disk in the form of index files. In this way, disk Imax O consumption will be generated in the index search process, which is several orders of magnitude higher than memory access. Can you imagine the depth of a binary tree with millions of nodes? If you put a binary tree of such a depth on disk, every time you read a node, you need to read the disk's IWeiO, and the time-consuming of the whole search is obviously unacceptable. So how to reduce the number of access to Imax O during the search process?

An effective solution is to reduce the depth of the tree and change the binary tree into m-tree (multipath search tree), and B+Tree is a kind of multipath search tree. When you understand B+Tree, you only need to understand its two most important characteristics: first, all keywords (can be understood as data) are stored in the leaf node (Leaf Page), non-leaf node (Index Page) does not store the real data, and all record nodes are stored on the same layer leaf node in order of key size. Secondly, all leaf nodes are connected by pointers. The following figure shows a simplified B+Tree with a height of 2.

How to understand these two characteristics? MySQL sets the size of each node to an integer multiple of a page (as described below), that is, when the node space is fixed, each node can store more internal nodes, so that each node can be indexed in a larger and more accurate range. The advantage of all leaf nodes using pointer links is that they can access intervals. For example, in the above figure, if you look for records that are greater than 20 but less than 30, you only need to find node 20, and you can traverse the pointer to find 25 and 30 in turn. If there is no link pointer, the interval search cannot be done. This is an important reason why MySQL uses B+Tree as the index storage structure.

Why MySQL sets the node size to an integer multiple of the page requires an understanding of how disks are stored. The access of the disk itself is much slower than that of the main memory. coupled with the loss of mechanical movement (especially the ordinary mechanical hard disk), the access speed of the disk is often one millionth of the main memory. in order to minimize the disk Imax O, the disk is often not strictly read on demand, but will be pre-read every time. Even if only one byte is needed, the disk will start from this position and read a certain length of data back into memory in order. The length of a pre-read is generally an integral multiple of the page.

A page is a logical block of computer managed memory. Hardware and OS often divide the main memory and disk storage area into continuous blocks of equal size, and each storage block is called a page (in many OS, the page size is usually 4K). Main memory and disk exchange data in pages. When the data to be read by the program is not in the main memory, a page fault exception will be triggered, and the system will send a read signal to the disk, which will find the starting position of the data and continuously read one or more pages back into memory, and then return together, and the program continues to run.

MySQL skillfully takes advantage of the principle of disk pre-reading by setting the size of a node to be equal to one page, so that each node only needs to be loaded once. In order to achieve this goal, each new node directly applies for a page of space, which ensures that a node is physically stored in a page, and that the computer storage allocation is page-aligned. It is only necessary to read a node once. Assuming that the height of B+Tree is h, it takes at most one search time for hmer O (root node resident memory), and the complexity is O (h) = O (logmN). In practical application scenarios, M is usually large, often more than 100, so the height of trees is generally small, usually no more than 3.

Finally, we briefly understand the operation of the B+Tree node, and have a general understanding of the maintenance of the index as a whole. Although the index can greatly improve the query efficiency, it still costs a lot to maintain the index, so it is particularly important to create the index reasonably.

Still taking the above tree as an example, we assume that each node can only store four internal nodes. First, insert the first node 28, as shown in the following illustration.

Neither leaf page nor index page is full.

Then insert the next node 70, and after querying in Index Page, we know that the leaf node should be inserted between 50 and 70, but the leaf node is full, so you need to split the leaf node. The starting point of the current leaf node is 50, so split the leaf node according to the intermediate value, as shown in the following figure.

Leaf Page split

Finally, insert a node 95, when both Index Page and Leaf Page are full, you need to split it twice, as shown in the following figure.

Leaf Page and Index Page split

After the split, it finally formed such a tree.

Final tree

In order to keep the balance, B+Tree needs to do a lot of page splitting operations for the newly inserted values, and the page split requires the Ihand O operation. In order to reduce the page split operation as much as possible, B+Tree also provides a rotation function similar to the balanced binary tree. When the Leaf Page is full but its left and right sibling nodes are not full, B+Tree is in no hurry to split, but moves the record to the sibling node of the current page. Usually, the left brother will be checked first for rotation. For example, in the second example above, when you insert 70, you don't do a page split, but a left-handed operation.

Left-handed operation

Through the rotation operation, the page splitting can be minimized, so as to reduce the disk Icano operation in the process of index maintenance, and improve the efficiency of index maintenance. It is important to note that deleting a node is similar to inserting a node, which still requires rotation and split, which is not explained here.

High performance strategy

From the above, I believe you have a general understanding of the data structure of B+Tree, but how do indexes organize data storage in MySQL? To illustrate with a simple example, if you have the following data table:

CREATE TABLE People (

Last_name varchar (50) not null

First_name varchar (50) not null

Dob date not null

Gender enum (`m`, `f`) not null

Key (last_name,first_name,dob)

);

For each row of data in the table, the index contains the values of the last_name, first_name, and dob columns. The following figure shows how the index organizes the data store.

How indexes organize data storage, from: high-performance MySQL

As you can see, the index is first sorted according to the first field, and when the name is the same, it is sorted according to the third field, that is, the date of birth. it is for this reason that there is the "leftmost principle" of the index.

1. Cases where MySQL will not use indexes: non-independent columns

"Independent column" means that an index column cannot be part of an expression or an argument to a function. For example:

Select * from where id + 1 = 5

It's easy to see that it's equivalent to id = 4, but MySQL can't parse the expression automatically, and it's the same with functions.

2. Prefix index

If the column is long, you can usually index the first part of the characters, which can effectively save indexing space and improve indexing efficiency.

3. Multi-column index and index order

In most cases, building independent indexes on multiple columns does not improve query performance. The reason is very simple, MySQL does not know which index to choose is more efficient, so before the old version, such as MySQL5.0, it will randomly select the index of a column, while the new version will adopt the strategy of merging indexes. To take a simple example, in a list of movie actors, separate indexes are created on both the actor_id and film_id columns, followed by the following query:

Select film_id,actor_id from film_actor where actor_id = 1 or film_id = 1

The old version of MySQL randomly selects an index, but the new version makes the following optimizations:

Select film_id,actor_id from film_actor where actor_id = 1

Union all

Select film_id,actor_id from film_actor where film_id = 1 and actor_id 1

When multiple indexes intersect (multiple AND conditions), generally speaking, one index containing all related columns is better than multiple independent indexes.

When there are multiple indexes to do joint operations (multiple OR conditions), the merging and sorting of result sets need to consume a lot of CPU and memory resources, especially when some of the indexes are not selective and need to return and merge a large amount of data, the query cost is higher. So in this case, you might as well take a full table scan.

Therefore, if you find index merging in explain (Using union appears in the Extra field), you should check whether the query and table structure are already optimal. If there is no problem with the query and table, it can only show that the index is poorly built, and whether the index is appropriate should be carefully considered. It is possible that a multi-column index containing all related columns is more suitable.

We mentioned earlier how indexes organize data storage, and when you can see from the figure that the order of indexes is critical to the query, it is obvious that more selective fields should be placed in front of the index. in this way, most of the data that does not meet the criteria can be filtered out through the first field.

Index selectivity refers to the ratio of non-repeated index values to the total number of records in the data table. The higher the selectivity, the higher the query efficiency, because the higher the selectivity of the index, the more rows can be filtered out by MySQL. The selectivity of the unique index is 1, which is the best index selectivity and the best performance.

After understanding the concept of index selectivity, it is not difficult to determine which field is more selective, just check it out, such as:

SELECT * FROM payment where staff_id = 2 and customer_id = 584

Should indexes be created (staff_id,customer_id) or should they be reversed? To execute the following query, just index the field whose selectivity is closer to 1.

Select count (distinct staff_id) / count (*) as staff_id_selectivity

Count (distinct customer_id) / count (*) as customer_id_selectivity

Count (*) from payment

In most cases, there is no problem with using this principle, but still pay attention to whether there are any special cases in your data. To take a simple example, for example, to query the information of users who have had transactions under a user group:

Select user_id from trade where user_group_id = 1 and trade_amount > 0

MySQL chose the user_group_id,trade_amount for this query, which does not seem to be a problem if the special circumstances are not taken into account, but the reality is that most of the data in this table is migrated from the old system, and because the data from the old system is not compatible, it gives the data migrated from the old system a default user group. In this case, the number of rows scanned by the index is basically the same as the full table scan, and the index does not play any role.

Generally speaking, rules of thumb and inferences are useful in most cases to guide our development and design, but the actual situation is often more complex, and some special situations in the actual business scenario may destroy your entire design.

4. Avoid multiple range conditions

In actual development, we often use multiple scope conditions, such as querying users who have logged in within a certain period of time:

Select user.* from user where login_time > '2017-04-01' and age between 18 and 30

There is a problem with this query: it has two scope conditions, the login_time column and the age column, and MySQL can use the index of the login_time column or the index of the age column, but not both.

5. Overwrite index

If an index contains or overrides the values of all the fields that need to be queried, there is no need to return to the table query, which is called an override index. Overriding an index is a very useful tool that can greatly improve performance, because queries only need to scan the index for many benefits:

The index entry is much smaller than the data row size. If only the index is read, the amount of data access will be greatly reduced.

Indexes are stored in the order of column values, and the range queries that are O-intensive for Imax are much less than the IO that randomly reads each row of data from disk.

6. Use index scanning to sort

MySQL can produce ordered result sets in two ways: one is to sort the result set, and the other is that the results scanned according to the index order are naturally ordered. If the value of the type column in the explain result is index, it means that an index scan is used for sorting.

Scanning the index itself is fast because you only need to move from one index record to the next adjacent record. But if the index itself cannot cover all the columns that need to be queried, then the corresponding rows have to be queried back to the table for each index record scanned. This read operation is basically random Istroke O, so reading data in indexed order is usually slower than sequential full table scans.

When designing an index, it is best if an index can satisfy both sorting and query.

You can use the index to sort the results only if the column order of the index is exactly the same as the order of the ORDER BY clause, and all columns are sorted in the same direction. If the query needs to associate multiple tables, the index can be used for sorting only if the fields referenced by the ORDER BY clause are all the first table. The limitation of the ORDER BY clause is the same as that of the query, meeting the requirements of the leftmost prefix (with one exception, the leftmost column is specified as a constant, the following is a simple example). In other cases, sort operations need to be performed, but index sorting cannot be used.

/ / the leftmost column is constant. Index: (date,staff_id,customer_id)

Select staff_id,customer_id from demo where date = '2015-06-01' order by staff_id,customer_id

7. Redundant and duplicate indexes

Redundant indexes are indexes of the same type created in the same order on the same column. Such indexes should be avoided as far as possible and deleted as soon as they are found. For example, if you have an index (A _ quotient B), and then create an index (A), it is a redundant index. Redundant indexes often occur when adding a new index to a table, such as someone who has created a new index (Agraine B), but this index is not an extension of the existing index (A).

In most cases, you should try to expand existing indexes instead of creating new ones. However, there are rare cases where performance considerations require redundant indexes, such as extending an existing index to make it too large, thus affecting other queries that use the index.

8. Delete indexes that have not been used for a long time

It is a good habit to delete indexes that have not been used for a long time on a regular basis.

With regard to the topic of indexing, I intend to stop here. Finally, I would like to say that indexing is not always the best tool, and it is effective only if the benefits of indexing help improve query speed outweigh the extra work. For very small tables, a simple full table scan is more efficient. For medium to large tables, the index is very efficient. For very large tables, the cost of building and maintaining indexes increases, and other technologies, such as partitioned tables, may be more efficient. Finally, it is a virtue to mention the test after explain.

Specific type of query optimization optimization COUNT () query

COUNT () is probably the most misunderstood function. It has two different functions: one is to count the number of values in a column, and the other is to count the number of rows. When counting column values, the column value is required to be non-empty, and it does not count NULL. If you confirm that the expression in parentheses cannot be empty, you are actually counting the number of rows. The simplest thing is that when using COUNT (*), it doesn't expand to all the columns as we thought. In fact, it ignores all the columns and counts the rows directly.

Our most common misunderstanding is that we specify a column in parentheses but want the statistical result to be the number of rows, and often mistakenly assume that the former will perform better. But in fact, this is not the case. If you want to count the number of rows, directly use COUNT (*), the meaning is clear, and the performance is better.

Sometimes some business scenarios do not need a completely accurate count value, but can use an approximate value instead. The number of rows from EXPLAIN is a good approximation, and the execution of EXPLAIN does not really need to execute the query, so the cost is very low. Generally speaking, performing COUNT () requires scanning a large number of rows to get accurate data, so it is difficult to optimize, and the only thing that can be done at the MySQL level is to overwrite the index. If the problem can not be solved, it can only be solved at the architectural level, such as adding summary tables, or using an external caching system such as redis.

Optimize associative query

In big data scenario, tables are associated with each other through a redundant field, which has better performance than using JOIN directly. If you do need to use an associative query, you need to pay special attention to:

Make sure there is an index on the columns in the ON and USING sentences. The order of associations should be taken into account when creating the index. When tables An and B are associated with column c, if the order of the optimizer association is An and B, then there is no need to create an index on the corresponding column of Table A. Unused indexes impose an additional burden, and in general, unless there is another reason, you only need to create an index on the corresponding column of the second table in the association order (the specific reasons are analyzed below).

Make sure that any expressions in GROUP BY and ORDER BY refer to only the columns in a table, so that MySQL can use indexes to optimize.

To understand the first technique of optimizing associative queries, you need to understand how MySQL executes associative queries. The policy executed by the current MySQL association is very simple, it performs a nested loop association operation on any association, that is, it first loops out a single piece of data in one table, then looks for a matching row in the nested loop to the next table, and goes on until it finds matching behavior in all tables. Then the columns needed in the query are returned based on the rows matched by each table.

Too abstract? To illustrate with the example above, for example, there is a query like this:

SELECT A.xx,B.yy

FROM An INNER JOIN B USING (c)

WHERE A.xx IN (5 and 6)

Assuming that MySQL associates according to the association order An and B in the query, you can use the following pseudo-code to indicate how MySQL completes the query:

Outer_iterator = SELECT A.xxMagi A.c FROM A WHERE A.xx IN (5pc6)

Outer_row = outer_iterator.next

While (outer_row) {

Inner_iterator = SELECT B.yy FROM B WHERE B.C = outer_row.c

Inner_row = inner_iterator.next

While (inner_row) {

Output[inner _ row.yy,outer_row.xx]

Inner_row = inner_iterator.next

}

Outer_row = outer_iterator.next

}

As you can see, the outermost query is queried based on the A.xx column, and if there is an index on A.C, the entire associated query will not be used. If you look at the inner query, it is obvious that if there is an index on B.C, it can speed up the query, so you only need to create an index on the corresponding column of the second table in the association order.

Optimize LIMIT paging

When paging is required, it is usually achieved by using LIMIT plus offset, along with appropriate ORDER BY words. If there is a corresponding index, the efficiency is usually good, otherwise, MySQL needs to do a lot of file sorting operations.

A common problem is that when the offset is very large, such as a query such as LIMIT 10000 20, MySQL needs to query 10020 records and return only 20 records, and the first 10000 records will be discarded, which is very expensive.

One of the easiest ways to optimize this query is to use overlay index scans as much as possible, rather than querying all columns. Then do an association query as needed and return all the columns. When the offset is large, the efficiency of doing so will be greatly improved. Consider the following query:

SELECT film_id,description FROM film ORDER BY title LIMIT 50,5

If the table is very large, it is best to change the query to look like this:

SELECT film.film_id,film.description

FROM film INNER JOIN (

SELECT film_id FROM film ORDER BY title LIMIT 50,5

) AS tmp USING (film_id)

The delayed association here will greatly improve the efficiency of the query, allowing MySQL to scan as few pages as possible, get the records that need to be accessed, and then return to the columns needed by the original table to query according to the associated columns.

Sometimes if you can use a bookmark to record the location of the last data fetch, you can start scanning directly from the location of the bookmark record next time, so you can avoid using OFFSET, such as the following query:

SELECT id FROM t LIMIT 10000, 10

Change to:

SELECT id FROM t WHERE id > 10000 LIMIT 10

Other optimizations include using a pre-calculated summary table, or associating to a redundant table that contains only primary key columns and columns that need to be sorted.

Optimize UNION

MySQL's strategy for dealing with UNION is to create a temporary table first, then insert the query results into the temporary table, and finally make the query. Therefore, many optimization strategies do not work well in UNION queries. It is often necessary to manually "push" WHERE, LIMIT, ORDER BY and other words into each subquery so that the optimizer can make full use of these conditions to optimize first.

Unless you really need the server to de-duplicate, be sure to use UNION ALL. If there is no ALL keyword, MySQL will add the DISTINCT option to the temporary table, which will cause the data of the entire temporary table to be checked for uniqueness, which is very expensive. Of course, even with the ALL keyword, MySQL always puts the result in a temporary table, then reads it out, and then returns it to the client. Although this is not necessary in many cases, for example, you can sometimes return the results of each subquery directly to the client.

Conclusion

Understanding how queries are executed and where time is spent, coupled with some knowledge of the optimization process, can help you better understand MySQL and the principles behind common optimization techniques. I hope the principles and examples in this article can help you better connect theory with practice and apply more theoretical knowledge to practice.

There is nothing else to say, leave you two thinking questions, you can think about the answer in your head, this is what we often talk about, but few people will think about why?

There are a lot of programmers will throw out such a view when sharing: do not use stored procedures as much as possible, stored procedures are very difficult to maintain, but also increase the cost of use, and business logic should be put on the client. If the client can do these things, why stored procedures?

A stored procedure is a collection of compiled SQL statements that, if implemented by a program, may require multiple connections to the database, reducing the coupling between the program and the database. )

JOIN itself is also very convenient, just query it directly, why do you need a view?

A view is a virtual table that simplifies user operations and provides security for confidential data. )

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.