MySQL optimization guidelines that veteran drivers must also master 07/19 Update SLTechnology News&Howtos

MySQL optimization guidelines that veteran drivers must also master

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

When the number of MySQL records in a single table is too large, the performance of adding, deleting, modifying and querying will decline sharply. This article will provide some optimization references. You can refer to the following steps to optimize.

Single table optimization

Unless single-table data will continue to rise in the future, do not consider splitting in the first place, which will bring various complexities of logic, deployment, and operation and maintenance.

Generally speaking, the integer-based table is less than ten million, and the string-based table below 5 million is not a big problem.

In fact, in many cases, the performance of MySQL single table still has a lot of room for optimization, and even can normally support more than 10 million levels of data.

Field

About fields:

Try to use TINYINT, SMALLINT, MEDIUM_INT as integer types instead of INT, and add UNSIGNED if non-negative.

The length of the VARCHAR allocates only the space that is really needed.

Use enumerations or integers instead of string types.

Try to use TIMESTAMP instead of DATETIME.

Do not have too many fields in a single table, and it is recommended that it be less than 20.

Avoid using NULL fields, which are difficult to query and optimize and take up extra index space.

Use integers to save IP.

Indexes

About the index:

It is not better to have as many indexes as possible. To create indexes according to the query, consider indexing the columns involved in the WHERE and ORDER BY commands, and check whether indexes or full table scans are used according to EXPLAIN.

Try to avoid judging the NULL value of a field in the WHERE clause, otherwise it will cause the engine to give up using the index and do a full table scan.

Fields with sparse value distribution are not suitable for indexing, such as "gender", which has only two or three values.

Character fields are indexed with prefixes only.

It is best not to use character fields as primary keys.

Without foreign keys, the constraint is guaranteed by the program.

Try not to use UNIQUE, and the program guarantees the constraint.

When using multi-column indexes, be careful to keep the order and query conditions consistent, while deleting unnecessary single-column indexes.

Query SQL

About querying SQL:

You can find the slower SQL by opening the slow query log.

Do not do column operations: SELECT id WHERE age + 1 = 10, any column operation will result in a table scan, including database tutorial functions, evaluation expressions, and so on. When querying, move the operation to the right of the equal sign as much as possible.

SQL statements are as simple as possible: a SQL can only operate on one CPU; large statements break down small statements to reduce lock time; a large SQL can block the entire library.

No SELECT *.

The efficiency of rewriting OR to IN:OR is n, the efficiency of IN is log (n), and the number of IN is recommended to be less than 200.

Instead of functions and triggers, it is implemented in the application.

Avoid% xxx queries.

Use less JOIN.

Compare with the same type, such as the ratio of '123' to '123', 123 to 123.

Try to avoid using the! = or operator in the WHERE clause, otherwise the engine will give up using the index and do a full table scan.

For consecutive values, use BETWEEN instead of IN:SELECT id FROM t WHERE num BETWEEN 1 AND 5.

List data do not take the whole table, use LIMIT to page, the number of each page is not too large.

engine

At present, MyISAM and InnoDB engines are widely used:

MyISAM

The MyISAM engine is the default engine for MySQL 5.1 and earlier, with the following features:

Row locks are not supported, locking all tables that need to be read during reading and exclusive locks on tables when writing.

Transactions are not supported.

Foreign keys are not supported.

Security recovery after a crash is not supported.

While the table has a read query, it is supported to insert new records into the table.

Supports the first 500 characters of BLOB and TEXT, and supports full-text indexing.

Support for delayed updating of indexes, greatly improving write performance.

For tables that will not be modified, compressed tables are supported, greatly reducing disk space consumption.

InnoDB

InnoDB becomes the default index after MySQL 5.5, which is characterized by:

Row locks are supported and MVCC is used to support high concurrency.

Support transactions.

Foreign keys are supported.

Support for security recovery after a crash.

Full-text indexing is not supported.

PS: it is said that InnoDB already supports full-text indexing in MySQL 5.6.4.

In general, MyISAM is suitable for SELECT-intensive tables, while InnoDB is suitable for INSERT and UPDATE-intensive tables.

System tuning parameters

You can use the following tools for benchmarking:

Sysbench: a modular, cross-platform and multithreaded performance testing tool.

Https://github.com/akopytov/sysbench

Iibench-mysql: a tool for insert performance testing based on Java's MySQL / Percona / MariaDB index.

Https://github.com/tmcallaghan/iibench-mysql

TPC-C testing tool developed by tpcc-mysql:Percona.

Https://github.com/Percona-Lab/tpcc-mysql

There are many tuning parameters. For more information, please see the official documentation. Here are some important parameters:

The back_log:back_log value can indicate how many requests can be stored on the stack in a short period of time before MySQL temporarily stops answering new requests.

In other words, if the connection data of MySQL reaches max_connections, new requests will be stored in the stack waiting for a connection to release resources. The number of back_log in the stack will not be granted connection resources if the number of waiting connections exceeds back_log. Can rise from the default 50 to 500.

Wait_timeout: database connection idle time, idle connection will consume memory resources. It can be reduced from the default 8 hours to half an hour.

Max_user_connection: the maximum number of connections. Default is 0. There is no upper limit. It is best to set a reasonable upper limit.

Thread_concurrency: the number of concurrent threads, set to twice the number of CPU cores.

Skip_name_resolve: disables DNS parsing of external connections, eliminating DNS parsing time, but requires all remote hosts to access it with IP.

Key_buffer_size: the cache size of the index block. Increasing the cache size will increase the index processing speed and have the greatest impact on the performance of the MyISAM table.

For about 4G memory, it can be set to 256m or 384m. By querying show status like 'key_read%', it is best to ensure that the key_reads / key_read_requests is less than 0.1%.

Innodb_buffer_pool_size: caches data blocks and index blocks, which have the greatest impact on InnoDB table performance.

By querying show status like 'Innodb_buffer_pool_read%', make sure (Innodb_buffer_pool_read_requests-Innodb_buffer_pool_reads) / Innodb_buffer_pool_read_requests is as high as possible.

The innodb_additional_mem_pool_size:InnoDB storage engine is used to store data dictionary information and the amount of memory space for some internal data structures.

When there are too many database objects, resize this parameter appropriately to ensure that all data can be stored in memory to improve access efficiency. When it is too small, MySQL will record Warning information to the database error log, so you need to resize this parameter.

The buffer used by the transaction log of the innodb_log_buffer_size:InnoDB storage engine is generally not recommended to exceed 32MB.

Query_cache_size: cache the ResultSet in MySQL, that is, the result set of the execution of a SQL statement, so only for Select statements.

When there is any change in the data of a table, it will invalidate the cached data in Query Cache of all Select statements that reference the table.

So, when our data changes very frequently, the cost of using Query Cache may outweigh the gain.

Adjust according to the hit rate (Qcache_hits/ (Qcache_hits+Qcache_inserts) * 100). Generally speaking, it is not recommended to be too large. 256MB may be almost enough, and large-scale configuration static data can be adjusted appropriately. You can view the current system Query Cache usage by using the command show status like 'Qcache_%'.

Read_buffer_size:MySQL read buffer size. A request for a sequential scan of the table allocates a read buffer and MySQL allocates a memory buffer for it.

If sequential scan requests for tables are frequent, you can improve their performance by increasing the value of the variable and the size of the memory buffer.

The buffer size used by sort_buffer_size:MySQL to perform sorting. If you want to increase the speed of ORDER BY, first see if you can let MySQL use indexes instead of additional sorting phases. If not, try increasing the size of the sort_buffer_size variable.

The random read buffer size of the read_rnd_buffer_size:MySQL. When rows are read in any order (for example, in sort order), a random read cache is allocated.

When sorting a query, MySQL scans the buffer first to avoid disk search and improve query speed. If you need to sort a large amount of data, you can increase the value appropriately.

However, MySQL allocates this buffer space for each customer connection, so you should try to set this value appropriately to avoid excessive memory overhead.

Record_buffer: each thread that performs a sequential scan allocates a buffer of this size to each table it scans. If you do a lot of sequential scans, you may want to increase this value.

Thread_cache_size: save threads that are not currently associated with a connection but are ready to serve a new connection later, and can quickly respond to thread requests for a connection without creating a new one.

Table_cache: similar to thread_cache _ size, but used to cache table files, with little effect on InnoDB, mainly for MyISAM.

Upgrade Hardwar

Scale Up, needless to say, depending on whether MySQL is CPU-intensive or I-SSD-intensive, MySQL performance can be significantly improved by improving CPU and memory and using SSD.

Separation of reading and writing

It is also a commonly used optimization at present, reading and writing from the main library, generally do not use double-master or multi-master to introduce a lot of complexity, try to use other schemes in this paper to improve performance. At the same time, many split solutions also take into account the separation of read and write.

Caching

Caching can occur at these levels:

Inside MySQL: the relevant settings are described in the system tuning parameters.

Data access layer: for example, MyBatis caches SQL statements, while Hibernate can be accurate to a single record, where the cached object is mainly the persistent object Persistence Object.

Application service layer: we can control the cache more precisely and implement more strategies by programming, where the cached object is the data transfer object Data Transfer Object.

Web layer: caching for Web pages.

Browser client: the cache on the client side.

You can join the cache at one or more levels according to the actual situation. This paper focuses on the caching implementation of the service layer.

At present, there are two main ways:

Write Through: after the data is written to the database, the cache is updated at the same time to maintain the consistency between the database and the cache.

This is how most current application caching frameworks such as Spring Cache work. This implementation is very simple, good synchronization, but average efficiency.

Write Back: when there is data to write to the database, only the cache is updated and then synchronized to the database asynchronously in batches.

This kind of implementation is more complex and requires more application logic. at the same time, it may cause the database to be out of sync with the cache, but it is very efficient.

Table partition

The partition introduced by MySQL in version 5.1 is a simple horizontal split. Users need to add partition parameters when creating the table, which is transparent to the application without modifying the code.

For users, the partition table is an independent logical table, but the underlying layer is composed of multiple physical child tables. The code for partitioning is actually encapsulated by the objects of a group of underlying tables, but for the SQL layer, it is a black box that completely encapsulates the underlying layer.

The way MySQL implements partitioning also means that indexes are also defined according to partitioned child tables and there is no global index.

The user's SQL statement needs to be optimized for the partition table, and the columns with partition conditions should be included in the SQL condition, so that the query can be located on a small number of partitions, otherwise all partitions will be scanned.

You can use EXPLAIN PARTITIONS to see that a SQL statement falls on those partitions for SQL optimization.

The following figure 5 records fall on two partitions:

Mysql > explain partitions select count (1) from user_partition where id in (1)

+- -+

+- -+

+- -+

1 row in set (0.00 sec)

The benefits of zoning are:

You can make a single table store more data.

The data of the partition table is easier to maintain, you can delete a large amount of data in bulk by clearing the entire partition, or you can add new partitions to support newly inserted data. In addition, a separate partition can be optimized, checked, repaired, and so on.

Some queries can be determined from the query conditions only fall on a small number of partitions, the speed will be very fast.

The data of the partition table can also be distributed over different physical devices, thus making efficient use of multiple hardware devices.

You can use partitioned tables to avoid some special bottlenecks, such as mutually exclusive access to a single index of InnoDB, and inode lock contention for ext3 file systems.

Individual partitions can be backed up and restored.

Limitations and disadvantages of partitions:

A table can only have a maximum of 1024 partitions.

If there is a primary key or unique index column in the partition field, then all primary key columns and unique index columns must be included.

Foreign key constraints cannot be used for partitioned tables.

The NULL value invalidates partition filtering.

All partitions must use the same storage engine.

Type of partition:

RANGE partition: assigns multiple rows to a partition based on column values that belong to a given contiguous interval.

LIST partitions: similar to partitioning by RANGE, except that LIST partitions are selected based on column values matching a value in a set of discrete values.

HASH partition: a partition that is selected based on the return value of a user-defined expression that uses the column values of the rows to be inserted into the table. This function can contain any valid expression in MySQL that produces a non-negative integer value.

KEY partitions: similar to partitioning by HASH, except that KEY partitions only support computing one or more columns, and the MySQL server provides its own hash function. One or more columns must contain integer values.

Partitioning is suitable for scenarios: if the most suitable scene data has a strong time series, it can be partitioned according to time.

As follows:

CREATE TABLE members (

Firstname VARCHAR (25) NOT NULL

Lastname VARCHAR (25) NOT NULL

Username VARCHAR (16) NOT NULL

Email VARCHAR (35)

Joined DATE NOT NULL

) PARTITION BY RANGE (YEAR (joined)) (

PARTITION p0 VALUES LESS THAN (1960)

PARTITION p1 VALUES LESS THAN (1970)

PARTITION p2 VALUES LESS THAN (1980)

PARTITION p3 VALUES LESS THAN (1990)

PARTITION p4 VALUES LESS THAN MAXVALUE

);

It is very efficient to add time range conditions to the query, and it is easy to delete unwanted historical data in batches.

If the data has obvious hot spots, and except for this part of the data, the other data is rarely accessed, then you can put the hot spot data in a separate partition, so that the data in this partition can be cached in memory. When querying, only a small partition table can be accessed, which can effectively use the index and cache.

In addition, MySQL has an early and simple partitioning implementation-merge Table (merge table), which is more restrictive and lacks optimization, so it is not recommended and should be replaced by a new partitioning mechanism.

Vertical split

Vertical split database is split according to the relevance of the data tables in the database, for example: there are both user data and order data in a database, so vertical split can put user data into user database and order data into order database.

Vertical split table is a way to split a data table vertically. it is common to split a large multi-field table according to common fields and unused fields. the number of data records in each table is generally the same, but the fields are different and associate with the primary key.

For example, the original user table is:

After a vertical split, it is:

The advantages of vertical splitting are:

You can make the row data smaller so that a Block can hold more data and reduce the number of Block O times when querying (less Block is read each time).

You can achieve the goal of maximizing the use of Cache, specifically when splitting vertically, you can put together the fields that do not change very often, and put the fields that change frequently.

Data maintenance is simple.

The disadvantages are:

The primary key is redundant and redundant columns need to be managed.

Can cause table join JOIN operations (increasing CPU overhead) you can reduce database pressure by doing JOIN on the business server.

There is still the problem of excessive amount of data in a single table (horizontal split is required).

Transaction processing is complex.

Horizontal split

Horizontal split is to divide the data into two parts: table and database through a certain strategy, and each piece of data will be distributed to different MySQL tables or libraries to achieve distributed effect and can support a very large amount of data. The previous table partition is also essentially a special in-library partition.

The sub-table in the library, because the data of the table is not distributed to different machines, it only solves the problem that the data of a single table is too large.

Therefore, it does not play a big role in reducing the pressure on the MySQL server, we are still competing for IO, CPU and network on the same physical machine, which has to be solved by sub-libraries.

If the previous vertically split user table is split horizontally, the result is:

In practice, it is often a combination of vertical split and horizontal split, that is, Users_A_M and Users_N_Z are split into Users and UserExtras, so that there are four tables.

The advantages of horizontal splitting are:

There are no performance bottlenecks of single library big data and high concurrency.

The modification of the application side is less.

The stability and load capacity of the system are improved.

The disadvantages are:

The consistency of fragmented transactions is difficult to solve.

Cross-node JOIN has poor performance and complex logic.

It is difficult to expand the data many times and has a great amount of maintenance.

Slicing principle

The slicing principles are as follows:

Can not be divided, refer to single table optimization.

The number of shards is as small as possible, and shards are evenly distributed over multiple data nodes as far as possible, because the more shards a query SQL spans, the worse the overall performance is. Although it is better than the result of all data in one shard, you can only expand the capacity and increase the number of shards when necessary.

Sharding rules need to be carefully selected and planned in advance, sharding rules should be selected, and data growth mode, data access mode, sharding relevance, and sharding expansion should be considered.

The most recent sharding strategies are range sharding, enumerating shards, and consistent Hash shards, which are all conducive to capacity expansion.

Try not to span multiple shards of SQL in a transaction. Distributed transactions have always been a difficult problem to deal with.

Query conditions should be optimized as much as possible to avoid Select *. Under a large number of data result sets, a lot of bandwidth and CPU resources will be consumed. Query should avoid returning a large number of result sets and index frequently used query statements as far as possible.

Reduce the possibility of cross-library JOIN through data redundancy and table partitioning.

Here special emphasis is placed on the selection of sharding rules, if the data of a table has obvious time characteristics, such as orders, transaction records, etc.

They are usually suitable for time range slicing, because of the timely data, we often pay attention to the recent data, and the query conditions are often filtered with a time field.

A better solution is that the current active data is sliced with a relatively short span, while the historical data is stored with a longer span.

Generally speaking, the choice of sharding depends on the condition of querying SQL most frequently, because the query SQL without any Where statement will traverse all shards, and the performance is relatively the worst, so the more this kind of SQL, the greater the impact on the system, so we should try our best to avoid this kind of SQL.

Solution

Because of the complicated logic involved in horizontal split, there are many mature solutions at present. These solutions are divided into two main categories: client architecture and agent architecture.

Client architecture

By modifying the data access layer, such as JDBC, Data Source, MyBatis, through configuration to manage multiple data sources, directly connect to the database, and complete the data fragment integration in the module, generally presented in the way of Jar package.

This is an example of a client architecture:

You can see that the sharding implementation is implemented with the application server by modifying the Spring JDBC layer.

The advantages of the client architecture are:

Apply the directly connected database to reduce the downtime risk caused by the dependence of the peripheral system.

Components with low integration cost and no need for additional operation and maintenance.

The disadvantages are:

Limited to work on the database access layer, scalability is general, for more complex systems may be inadequate.

Put the pressure of sharding logic on the application server, causing additional risk.

Agent architecture

Through independent middleware to manage all data sources and data fragmentation and integration, the back-end database cluster is transparent to the front-end applications, and requires independent deployment and operation and maintenance agent components.

This is an example of an agent architecture:

Proxy components generally exist in the form of clusters in order to offload and prevent single points, and may need to be managed by service components such as ZooKeeper.

The advantages of the agent architecture are:

It can handle very complex requirements, is not limited by the original implementation of the database access layer, and has strong expansibility.

Transparent to the application server without any additional load.

The disadvantages are:

Need to deploy and operation and maintenance independent agent middleware, high cost.

The application needs to connect to the database through an agent, and there is an extra hop on the network, resulting in performance loss and additional risk.

Comparison of different schemes

At present, there are still many options for the industry to choose from, but how should they be chosen?

In my opinion, it can be considered according to the following ideas:

Determine whether to use the client architecture or the proxy architecture. Small and medium-sized or relatively simple scenarios tend to choose client architecture, while complex scenarios or large-scale systems tend to choose agent architecture.

Whether the specific function is satisfied, such as the need for cross-node ORDER BY, then supporting this function is a priority.

Without considering the products that have not been updated within a year, it means that the development is stagnant, even without maintenance and technical support.

It is best to choose in the order of big companies, → communities, →, small companies, → individuals.

Choose ones that have a good reputation, such as Github stars, user quantity and quality, and user feedback.

Open source is preferred, and often the project has special needs that may require changes to the source code.

In accordance with the above ideas, the following options are recommended:

Client architecture: ShardingJDBC

Agent architecture: MyCat or Atlas

MySQL-compatible and horizontally scalable database

There are also some open source databases that are compatible with MySQL protocols, such as:

TiDB

Cubrid

However, there is still a gap between its industrial quality and MySQL, and requires a large amount of OPS investment. If you want to migrate the original MySQL to a scalable new database, you can consider some cloud databases:

Aliyun PetaData

Aliyun OceanBase

Tencent Cloud DCDB

NoSQL

Doing Sharding on MySQL is a kind of dancing in chains. In fact, many big watches themselves do not have a great demand for RDBMS such as MySQL, and do not require ACID.

Consider migrating these tables to NoSQL to completely solve the horizontal scaling problem, such as:

Log class, monitoring class, statistical data

Unstructured or weakly structured data

There is not a strong requirement for transactions, and there is not much data for associated operations.

Reference:

The Mysql thing.

Mysql strategy

MySQL:MySQL 5.6 Reference Manual

Author: please call me boss

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.