How to optimize tens of millions of large tables by MySQL 07/09 Update SLTechnology News&Howtos

How to optimize tens of millions of large tables by MySQL

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how to optimize tens of millions of large watches by MySQL". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

How to optimize tens of millions of large meters is a very technical problem. Usually, our intuitive thinking will jump to split or data partition. Here, I would like to add and sort it out. I would like to sum up some experience with you in this respect. You are also welcome to make suggestions.

From the beginning, there was a lot of fire in my mind, to constant self-criticism, and then with reference to some team experience, I sorted out the following outline.

Since we are going to get through this problem, we are bound to go back to the source. I will divide the problem into three parts:

"tens of millions", "Big Table", "Optimization"

It also corresponds to what we identified in the diagram.

"data volume", "object" and "target".

Let me explain it step by step and come up with a series of solutions.

1. Amount of data: tens of millions

In fact, ten million is just a sensory number, that is, the large amount of data in our impression. Here we need to refine this concept, because with the change of business and time, the amount of data will also change, we should examine this indicator with a dynamic thinking, so we should have different processing strategies for different scenarios.

1) the amount of data is 10 million, which may reach 100 million or more.

Usually some data pipelining, logging business, in which the data will gradually increase over time, it is very easy to exceed the threshold of 10 million.

2) the amount of data is ten million, which is a relatively stable one.

If the amount of data is relatively stable, usually in some state-oriented data, such as 10 million users, then the information of these users has a corresponding row of data records in the table, with the growth of business, this level is relatively stable.

3) the amount of data is ten million, so there should not be so much data.

This situation is mostly found by us passively, and it is usually too late. For example, you see a configuration table with tens of millions of data; or some of the data in the table has been stored for a long time, and 99% of the data is out-of-date data or junk data.

The amount of data is an overall understanding, we need to do a closer understanding of the data, which can lead to the content of the second part.

two。 Objects: data tabl

The process of data operation is like the existence of multiple pipes in the database, in which there are data to be processed, and the use and attribution of these data are not the same.

Data is generally divided into three types according to the type of business:

(1) Pipeline data

Pipeline data is stateless, and there is no correlation between multiple businesses. New documents will be generated every time the business comes, such as transaction pipelining and payment pipelining. As long as the new documents can be inserted, the business can be completed. The characteristic is that the later data does not rely on the previous data, and all the data flows into the database according to time.

(2) State data

Stateful data is stateful, and multiple businesses depend on stateful data, and to ensure the accuracy of the data, for example, when recharging, you must get the original balance in order to pay successfully.

(3) configuration data

This type of data has a small amount of data, and the structure is simple, generally static data, the change frequency is very low.

At this point, we can have an understanding of the overall background, if we want to optimize, we have to face such a 3-3 matrix, if we want to consider the read-write ratio of the table (read more and write less, read less and write more … ), then it will be 24 kinds. Obviously, the exhaustion is not shown, and it is not necessary at all. Different business policies can be specified according to different data storage characteristics and business characteristics.

For this, we take the way of grasping the key points and sort out some common optimization ideas, especially the core ideas, which is also a ruler of our entire optimization design, and the difficulty determines the motivation and risk for us to do this.

As for the optimization scheme, I want to use the business-oriented dimension to elaborate.

3. Goal: optimization

At this stage, we want to talk about the optimization plan, summed up a little bit, relatively speaking, it is relatively comprehensive.

The whole is divided into five parts:

In fact, what we usually call sub-libraries, sub-tables and other programs are only a small part of them, and they will be more abundant after expansion.

In fact, it is not difficult to understand that the amount of data we have to support is 10 million, which is relatively large. There must be more than one table to be maintained by DBA. How can it be better managed while supporting expansion in business development while ensuring performance? these are several mountains in front of us.

Let's talk about these five types of improvements separately:

Optimize the design scheme 1. Standard design

What we first mentioned here is the standard design, not other high-end design solutions.

Hegel said: order is the first condition of freedom. It is especially important in the work scene of division of labor and cooperation, otherwise the teams contain each other too much and there are many problems.

Specification design I would like to mention the following specifications, which are actually only part of the development specification and can be used as a reference.

The essence of the specification is not to solve problems, but to effectively put an end to some potential problems. For the specifications to be followed by tens of millions of meters, I have combed out some of the following details, which can basically cover some of our common design and use problems, such as the field design of the table is varchar (500), in fact, it is a very non-standard way of implementation, let's talk about these specifications.

1) configuration specification

(1) MySQL database uses InnoDB storage engine by default.

(2) make sure that the character set is unified, the character sets of MySQL database related systems, databases and tables are all UTF8, and the places where the character set can be set such as application connection and display are also set to UTF8 character set.

Note: UTF8 format can not store facial expression data, need to use UTF8MB4, can be set in the MySQL character set. In 8.0, it has been defaulted to UTF8MB4, which can be unified or customized according to the business situation of the company.

(3) the transaction isolation level of MySQL database is RR (Repeatable-Read) by default, and it is recommended to set it to RC (Read-Committed) during initialization, which is more suitable for OLTP business.

(4) the tables in the database should be planned reasonably and the amount of data in a single table should be controlled. For MySQL database, it is recommended that the number of records in a single table should be controlled within 2000W.

(5) for MySQL instances, the number of databases and tables is as small as possible; the number of databases is generally no more than 50, and the number of data tables under each database is generally no more than 500 (including partition tables).

2) Specification for building tables

(1) InnoDB forbids the use of foreign key constraints, which can be guaranteed at the program level.

(2) DECIMAL must be used instead of FLOAT and DOUBLE to store exact floating point numbers.

(3) there is no need to define the display width in the integer definition, such as using INT instead of INT (4).

(4) the ENUM type is not recommended, but TINYINT can be used instead.

(5) do not use TEXT or BLOB types as much as possible. If you must, it is recommended to split large fields or less commonly used descriptive fields into other tables; in addition, it is forbidden to use database to store pictures or files.

(6) use YEAR (4) instead of YEAR (2) in storage years.

(7) the recommended field is defined as NOT NULL.

(8) it is recommended that DBA provide SQL audit tool, and the standardization of table creation needs to be audited by the audit tool.

3) naming convention

(1) the library, tables and fields are all lowercase.

(2) Library name, table name, field name and index name are all in lowercase letters and separated by "_".

(3) it is recommended that the library name, table name and field name should not exceed 12 characters. (library names, table names, and field names support a maximum of 64 characters, but no more than 12 characters in order to be uniform, easy to identify, and reduce transmission)

(4) see the library name, table name and field name, and there is no need to add comments.

A brief summary of the object naming conventions is shown in Table 4-1 below for reference.

4) Index specification

(1) the recommended naming rules for the index: idx_col1_ col2 [_ colN] and uniq_col1_ col2 [_ colN] (abbreviations are recommended if the field is too long).

(2) it is recommended that the number of fields in the index should not exceed 5.

(3) the number of indexes in a single table is controlled within 5.

(4) it is generally recommended that InnoDB tables have primary key columns, especially in high-availability cluster scenarios.

(5) when establishing a composite index, priority should be given to the fields with high selectivity.

(6) UPDATE and DELETE statements need to add indexes according to WHERE conditions.

(7) it is not recommended to use a% prefix fuzzy query, such as LIKE "% weibo", which cannot use an index and will result in a full table scan.

(8) make rational use of overlay indexes, such as:

(9) SELECT email,uid FROM user_email WHERE uid=xx, if uid is not the primary key, you can create an overlay index idx_uid_email (uid,email) to improve query efficiency.

(10) avoid using functions on index fields, otherwise the index will become invalid when querying.

(11) contact DBA when confirming whether the index needs to be changed.

5) Application specification

(1) avoid using stored procedures, triggers, custom functions, etc., it is easy to couple business logic with DB, and it will become a bottleneck when you do a distributed solution later.

(2) consider using UNION ALL and reduce the use of UNION, because UNION ALL does not remove duplicates, but less sorting operations, which is relatively faster than UNION. If there is no need to remove duplicates, UNION ALL is preferred.

(3) consider using limit N and using less limit M ~ N, especially when the table is large or M is relatively large.

(4) reduce or avoid sorting, such as: if sorting is not needed in the group by statement, you can increase order by null.

(5) use COUNT (*) instead of COUNT (primary_key) and COUNT (1) to record numbers in statistical tables; avoid using COUNT (*) operation in InnoDB tables, Memcache or Redis can be used for counting statistics with strong real-time requirements, and non-real-time statistics can be updated regularly using separate statistical tables.

(6) the original comment attribute must be added when doing the field change operation (modify column/change column), otherwise the comment will be lost after modification.

(7) using prepared statement can improve performance and avoid SQL injection.

(8) the IN in the SQL statement should not contain too many values.

(9) UPDATE and DELETE statements must have explicit WHERE conditions.

(10) the field value in the WHERE condition needs to match the data type of the field to avoid implicit type conversion by MySQL.

(11) SELECT and INSERT statements must explicitly specify the field name, and the use of SELECT * or INSERT INTO table_name values () is prohibited.

(12) the INSERT statement is submitted using batch (INSERT INTO table_name VALUES (), (), ()... The number of values should not be too large.

Optimization design scheme 2: business layer optimization

Business layer optimization should be the most profitable optimization way, and for the business layer is fully visible, there are mainly business split, data split and two common optimization scenarios (read more, read less and write more)

1) Business split

Split the mixed business into independent services

Separation of state and historical data

Business split is actually a split of a mixed business into a clearer independent business, so that Business 1, Business 2. Independent business makes the total business volume is still very large, but each part is relatively independent, reliability is still guaranteed.

For the separation of state and historical data, I can give an example to illustrate.

For example, we have a table Account, which assumes that the user balance is 100.

We need to be able to trace the historical information of the data changes after the data changes. if the status data is updated on the account, increase the balance by 100, so that the balance is 200.

This process may correspond to a update statement, an insert statement.

We can transform this into two different data sources, account and account_hist

There will be two insert records in account_hist, as follows:

In account, there is a update statement, as follows:

This is also a very basic hot and cold separation, which can greatly reduce the complexity of maintenance and improve the efficiency of business response.

2) data split

2.1 split by date, which is more common, especially according to the date dimension, the change at the program level is very small, but the benefit from scalability is very large.

Data is split by date dimension, such as test_20191021

Data is split by week and month for dimensions, such as test_201910

The data is split by quarterly and annual dimensions, such as test_2019

Use partition mode, partition mode is also a common way of use, using hash,range and other ways will be more, in MySQL I do not recommend the use of partition tables, because with the growth of storage capacity, although the data is split vertically, but in the final analysis, the data is difficult to achieve horizontal expansion, there is a better way to expand in MySQL.

2.3 read more and write less to optimize the scene

Using cache and Redis technology, the read request is hit at the cache level, which can greatly reduce the pressure of hot data query at the MySQL level.

2.4 read less, write less and optimize the scenario. You can take three steps:

1) using the asynchronous commit mode, the most intuitive thing for the application layer is the performance improvement, resulting in the least synchronous waiting.

2) using queuing technology, a large number of write requests can be extended by queuing to achieve batch data writing.

3) reduce the writing frequency, which is difficult to understand. Let me give you an example.

For business data, such as product classification, where the business priority is slightly lower than the amount, if the data is updated too frequently, you can appropriately adjust the range of data updates (for example, from the original 10 minutes per minute) to reduce the frequency of updates.

For example: update status data with a score of 200, as shown in the following figure

Can be transformed to, as shown in the following figure.

If the business data is updated too frequently in a short period of time, such as 100 updates per minute, and the points range from 100 to 10000, it can be submitted in batches according to time frequency.

For example: update the status data with a score of 100, as shown in the following figure.

No need to generate 100 transactions (200 SQL statements) can be transformed into 2 SQL statements, as shown in the following figure.

Business metrics, such as updating frequency details, can be discussed and decided according to specific business scenarios.

Optimization design plan 3: architecture layer optimization

Architecture layer optimization is actually what we think is a high-tech job, and we need to introduce some new tricks at the architectural level according to the business scenario.

3.1. System horizontal expansion scenario

3.1.1 using middleware technology, data routing and horizontal expansion can be realized. Common middleware include MyCAT,ShardingSphere,ProxySQL and so on.

3.1.2 use read-write separation technology, which is an extension for read requirements, with more emphasis on the state table. If a certain delay is allowed, multi-copy mode can be used to achieve horizontal expansion of read requirements, or middleware, such as MyCAT,ProxySQL,MaxScale,MySQL Router, can be used.

3.1.3 load balancing technology is adopted, such as LVS technology or Consul technology based on domain name service.

3.2. Taking into account the business scenarios of OLTP+OLAP, you can use NewSQL and give priority to HTAP technology stacks that are compatible with MySQL protocols, such as TiDB

3.3. There are several options for offline statistics in business scenarios.

3.3.1 using NoSQL architecture, there are mainly two types, one is data warehouse architecture suitable for MySQL protocol compatibility, the common ones are Infobright or ColumnStore, and the other is column-based storage, which belongs to heterogeneous direction, such as HBase technology.

3.3.2 use data warehouse system, based on MPP architecture, such as using Greenplum statistics, such as Test1 statistics

Optimization design scheme 4: database optimization

Database optimization, in fact, there are a lot of cards to play, but relatively speaking, the space is not so large, let's talk about it one by one.

4.1 transaction optimization

Select the transaction model according to the business scenario and whether it is a strong transaction dependency.

For the transaction dimensionality reduction strategy, let's give a few small examples.

4.1.1 dimensionality reduction strategy 1: convert stored procedure calls to transparent SQL calls

For new business, it is obviously not a good idea to use stored procedures. Compared with other commercial databases, the function and performance of MySQL stored procedures need to be verified, and in the current lightweight business processing, stored procedures are too "heavy".

Some application architectures seem to be distributed, but the way of calling in the database layer is based on stored procedures, because stored procedures encapsulate a lot of logic, which is difficult to debug, and the portability is not high. in this way, the business logic and performance pressure are at the database level, which makes the database layer easy to become a bottleneck and difficult to achieve true distribution.

So there is a clear direction of improvement is to transform the stored procedure into the way of SQL call, which can greatly improve the processing efficiency of the business, and the interface call of the database is simple enough and clear and controllable.

4.1.2 dimensionality reduction policy 2:DDL operation is converted to DML operation

Some businesses often have an urgent need to add fields to a table, which makes DBA and business students very tired. You can imagine that a table has hundreds of fields, and basically all of them are name1,name2. Name100, the design itself is problematic, not to mention performance. The reason is that the demand of the business changes dynamically. For example, a game equipment has 20 attributes, which may increase to 40 attributes after a month. In this way, all devices have 40 attributes, whether they are used or not. And there is a lot of redundancy in this way.

We also mentioned some basic elements of design in the design specification, on the basis of which we need to add that we need to maintain a limited number of fields. if we want to achieve the extension of these functions, we can actually achieve it in a configuration way. for example, convert some dynamically added fields into some configuration information. Configuration information can be modified and supplemented by DML, and it can also be more dynamic and easy to expand for data entry.

4.1.3 dimensionality reduction strategy 3:Delete operation is converted to efficient operation

Some businesses need to clean up some periodic data on a regular basis. For example, if the data in the table is retained for only one month, then the data beyond the time range will be cleaned up. If the order of magnitude of the table is relatively large, the cost of this Delete operation is too high. We can have two types of solutions to convert the Delete operation into a more efficient way.

The first is to establish a periodic table according to the business, such as the lunar table, weekly table, daily table and other dimensions, so that data cleaning is a relatively controllable and efficient way.

The second solution is to use the operation mode of MySQL rename. For example, if a large table of 20 million yuan needs to clean up 99% of the data, then we can quickly filter the supplementary records according to the conditions to achieve "transposition".

4.2 SQL optimization

In fact, relatively speaking, the need for minimalist design, many points are in the standard design, if you abide by the norms, the problem will be eliminated, here to add a few points:

4.2.1 SQL statement simplification, simplification is a sharp weapon of SQL optimization, because of simplicity, it is superior.

4.2.2 avoid or eliminate complex associations of multiple tables as much as possible. Large table association is a nightmare for large table processing. Once this opening is opened, more and more requirements need to be associated, and there is no turning back for performance optimization, not to mention that large table association is the weakness of MySQL. Although Hash Join has just been launched, do not have problems that have long existed in commercial databases as if you have mastered absolute mass destruction.

4.2.3 avoid anti-joins and semi-joins as much as possible in SQL, which is the weak aspect of the optimizer. What is reverse joins and semi-joins? In fact, it is easier to understand, for example, not in, not exists is anti-join, in,exists is semi-join, this kind of problem occurs in tens of millions of large tables, the performance is several orders of magnitude difference.

4.3 Index optimization

It should be a degree that needs to be grasped in the optimization of large tables.

4.3.1 there must be a primary key first, and the first rule in the specification design is that no rebuttal is accepted here.

4.3.2 secondly, SQL queries are based on indexes or unique indexes, making the query model as simple as possible.

4.3.3 finally, the query of range data should be eliminated as much as possible, and the range scan should be reduced as much as possible in the case of tens of millions of large tables.

Optimization design scheme 4: management optimization

This part should be the most easily overlooked part of all the solutions. I would like to put it at the end, and I would also like to pay tribute to my colleagues in operation and maintenance, always doing my duty for many problems that I think should be normal.

Generally speaking, the data cleaning of tens of millions of large tables is time-consuming. It is suggested that the strategy of separating hot and cold data should be improved in the design, which may sound like a mouthful. Let me give you an example of converting the Drop operation of a large table into a reversible DDL operation.

Drop operation is submitted by default and is irreversible. It is synonymous with running away in database operation. Currently, there is no corresponding Drop operation recovery feature at MySQL level, unless it is restored through backup, but we can consider converting Drop operation into a reversible DDL operation.

By default, each table in MySQL has a corresponding ibd file. In fact, you can convert Drop operation into a rename operation, that is, migrate files from testdb to testdb_arch. In terms of permissions, testdb_arch is business invisible, and rename operation can smoothly implement this deletion function. If it is confirmed that it can be cleaned after a certain period of time, data cleanup is not visible to existing business processes, as shown in the following figure.

In addition, there are two additional suggestions, one is for large table changes, as far as possible to consider low-peak online changes, such as the use of pt-osc tools or maintenance period changes, do not repeat.

Finally, to sum up, it is actually a sentence:

The optimization of tens of millions of large watches is optimized at the cost according to the business scenario, and is definitely not an isolated optimization.

This is the end of the content of "how to optimize tens of millions of large tables in MySQL". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.