What is the method of large table optimization of MySQL 20 million data 07/06 Update SLTechnology News&Howtos

What is the method of large table optimization of MySQL 20 million data

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what is the method of large table optimization of MySQL 20 million data". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is the method of large table optimization of MySQL 20 million data".

Overview of the problem

Using the Aliyun rds for MySQL database (that is, the MySQL5.6 version), there is a user who has recorded nearly 20 million data in the online record table for six months, and has retained 40 million of the data in the last year. The query speed is extremely slow, resulting in daily jams. Seriously affect the business.

Premise of the problem: in the old system, the people who designed the system at that time probably did not graduate from college, and the table design and sql statements were not only rubbish, but could not be looked directly at. The original developers have left, to me to maintain, this is the legendary maintenance can not run, and then I am the one who fell into the pit!

I'm trying to solve the problem, so, and I have this log.

Overview of the solution

Plan 1: optimize the existing mysql database. Advantages: the existing business is not affected, the source program does not need to modify the code, and the cost is the lowest. Disadvantages: if there is an optimization bottleneck, the amount of data will be over 100 million.

Plan 2: upgrade the database type to a 100% mysql-compatible database. Pros: it does not affect the existing business, the source program does not need to modify the code, and you hardly need to do anything to improve database performance. Disadvantages: spend more money

Plan 3: one step, big data solution, replace the newsql/nosql database. Advantages: strong scalability, low cost, no data capacity bottleneck, disadvantages: need to modify the source code

The above three solutions can be used sequentially. There is no need to change nosql if the amount of data is less than 100 million, and the development cost is too high. I have tried all three schemes, and all of them have formed landing solutions. In this process, I expressed my condolences to the developers who ran away ten thousand times:)

Solution 1: optimize the existing mysql database

Communicate with Aliyun database boss over the phone and ask the and Google solution and about the boss in the group. The summary is as follows (all essence):

1. Performance should be considered when designing databases and creating tables

The compilation of 2.sql needs to be optimized.

4. Zoning

4. Sub-table

5. Sub-library

1. Performance should be considered when designing databases and creating tables

The mysql database itself is highly flexible, resulting in underperformance and relying heavily on the capabilities of developers. In other words, if the developer's ability is high, the mysql performance is high. This is also a common problem with many relational databases, so companies' dba usually pays a lot of money.

When designing tables, you should pay attention to:

The table field avoids null values, which are difficult to query and take up extra index space. It is recommended that the default number 0 replace null.

Try to use INT instead of BIGINT, and add UNSIGNED if non-negative (which doubles the numerical capacity). Of course, it's better to use TINYINT, SMALLINT, and MEDIUM_INT.

Use enumerations or integers instead of string types

Try to use TIMESTAMP instead of DATETIME

Do not have too many fields in a single table. It is recommended that it be less than 20.

Using integers to save IP

Indexes

The more indexes, the better. To create targeted indexes according to the query, consider indexing the columns involved in the WHERE and ORDER BY commands. You can check whether indexes or full table scans are used according to EXPLAIN.

Null judgment of fields in the WHERE clause should be avoided as far as possible, otherwise it will cause the engine to abandon the use of indexes and perform full table scans

Fields with sparse value distribution are not suitable for indexing, such as "gender", which has only two or three values.

Character fields are indexed with prefixes only

The character field had better not be the primary key.

No foreign keys are needed, and the program guarantees the constraint.

Try not to use UNIQUE, and the program guarantees the constraint.

When using multi-column indexes, the order of ideas and query conditions are consistent, while unnecessary single-column indexes are deleted.

In short, use the appropriate data type and choose the appropriate index

Choose the appropriate data type (1) use the smallest data type that can be saved, the integer < date,time < char,varchar < blob (2) use simple data types, and integer types are less expensive than character processing because string comparisons are more complex. For example, the int type stores the time type, and the bigint type changes to the ip function (3) using a reasonable field attribute length, a fixed-length table is faster. Use enum, char instead of varchar (4) use not null to define fields as much as possible (5) use as little text as possible, and select appropriate index columns (1) frequently queried columns, columns that appear in where,group by,order by,on clauses (2) where conditions =, between,in, and like string + wildcard (%) columns of small length, the smaller the index field, the better. Because the storage unit of the database is the page, the more data can be stored in a page, the better. (4) columns with large dispersion (more different values) are placed in front of the federated index. To view the discreteness, you can count different column values. The larger the count, the higher the discretization:

The original developer has run away, the table has long been established, I can not modify, so: the wording can not be implemented, give up!

The compilation of 2.sql needs to be optimized.

Use limit to limit the records of query results

Avoid select * and list the fields you need to find

Use join to replace subqueries

Split large delete or insert statements

You can find the slower SQL by opening the slow query log

Do not do column operations: SELECT id WHERE age + 1 = 10, any column operation will result in a table scan, including database tutorial functions, evaluation expressions, etc. When querying, move the operation to the right of the equal sign as much as possible

Sql statements are as simple as possible: a sql can only operate on one cpu; large statements break down small statements to reduce lock time; a large sql can block the entire library

The efficiency of rewriting OR to IN:OR is n, the efficiency of IN is log (n), and the number of in is recommended to be less than 200.

Without functions and triggers, implemented in the application

Avoid% xxx type query

Use less JOIN

Compare with the same type, such as the ratio of '123' to '123', 123 to 123

Try to avoid using the! = or operator in the WHERE clause, otherwise the engine will give up using the index and do a full table scan

For consecutive values, use BETWEEN instead of IN:SELECT id FROM t WHERE num BETWEEN 1 AND 5

List data do not take the whole table, use LIMIT to page, the number of each page is not too large

The original developer has run away, the program has been completed online, I can not modify sql, so: the wording can not be executed, give up!

engine

At present, MyISAM and InnoDB engines are widely used:

MyISAM

The MyISAM engine is the default engine for MySQL 5.1 and earlier, with the following features:

Row locks are not supported. Locks are applied to all tables that need to be read when reading and exclusive locks are added to tables when writing

Transactions are not supported

Foreign keys are not supported

Security recovery after crash is not supported

While the table has a read query, new records can be inserted into the table.

Support for the first 500 characters of BLOB and TEXT, and full-text indexing

Support for delayed updating of indexes, greatly improving write performance

For tables that will not be modified, compressed tables are supported to greatly reduce disk space consumption

InnoDB

InnoDB becomes the default index after MySQL 5.5, which is characterized by:

Supports row locks and uses MVCC to support high concurrency

Support transaction

Foreign keys are supported

Support for security recovery after a crash

Full-text indexing is not supported

Generally speaking, MyISAM is suitable for SELECT-intensive tables, while InnoDB is suitable for INSERT and UPDATE-intensive tables.

MyISAM may be super-fast and take up a small amount of storage space, but the program requires transaction support, so InnoDB is necessary, so the program can not be executed, give up!

3. Zoning

The partition introduced by MySQL in version 5.1 is a simple horizontal split. Users need to add partition parameters when creating the table, which is transparent to the application without modifying the code.

For users, the partition table is an independent logical table, but the underlying layer is composed of multiple physical child tables. The code for partitioning is actually encapsulated by the objects of a group of underlying tables, but for the SQL layer, it is a black box that completely encapsulates the underlying layer. The way MySQL implements partitioning also means that indexes are also defined according to partitioned child tables, and there is no global index.

The user's SQL statement needs to be optimized for the partition table, and the columns of partition conditions should be taken in the SQL condition, so that the query will be located on a small number of partitions, otherwise all partitions will be scanned, and you can use EXPLAIN PARTITIONS to see that a certain SQL statement will fall on those partitions, thus carrying out SQL optimization. I tested that the columns without partition conditions will also improve the speed when querying, so this measure is worth a try.

The benefits of zoning are:

You can make a single table store more data.

The data of the partition table is easier to maintain, and you can support the newly inserted data by deleting a large amount of data in batches clearly throughout the partition, or by adding new partitions. In addition, a separate partition can be optimized, checked, repaired, etc.

Some queries can be determined from the query conditions to fall on only a few partitions, and the speed will be very fast.

The data of the partition table can also be distributed on different physical devices, making it funny to make use of multiple hardware devices.

You can use partition table dependencies to avoid some special bottlenecks, such as mutually exclusive access to a single index of InnoDB, inode lock contention for ext3 file systems

Individual partitions can be backed up and restored

Limitations and disadvantages of partitions:

A table can only have up to 1024 partitions

If there is a primary key or unique index column in the partition field, then all primary key columns and unique index columns must be included

Partitioned tables cannot use foreign key constraints

Null values invalidate partition filtering

All partitions must use the same storage engine

Type of partition:

RANGE partition: assigns multiple rows to a partition based on column values that belong to a given contiguous interval

LIST partition: similar to partition by RANGE, except that LIST partition is selected based on column values matching a value in a set of discrete values

HASH partition: a partition that is selected based on the return value of a user-defined expression that uses the column values of the rows to be inserted into the table. This function can contain any valid expression in MySQL that produces a non-negative integer value

KEY partitions: similar to partitioning by HASH, except that KEY partitions only support computing one or more columns, and the MySQL server provides its own hash function. One or more columns must contain integer values

For the specific concept of mysql partition, please google or consult the official documentation. I just throw a brick to attract jade here.

First of all, I partitioned 12 RANGE according to the month, and the query efficiency increased by about 6 times, but the effect was not obvious, so: changing id to HASH partition, divided into 64 partitions, the query speed was significantly improved. Problem solved!

Results are as follows: PARTITION BY HASH (id) PARTITIONS 64

Select count () from readroom_website;-11901336 rows of records

/ number of affected rows: 0 found record: 1 warning: 0 duration 1 query: 5.734 sec. /

Select * from readroom_website where month (accesstime) = 11 limit 10

/ number of affected rows: 0 found record: 10 warning: 0 duration 1 query: 0.719 sec. , /

4. Sub-table

Sub-table is a large table, optimized according to the above process, or the query is stuck, then divide the table into multiple tables, divide a query into multiple queries, and then return the results to the user.

The split table is divided into vertical split and horizontal split, usually with a field as the split item. For example, split the id field into 100 tables: the table name is tableName_id0

But: the sub-table needs to modify the source code, which will bring a lot of work to the development and greatly increase the development cost, so: it is only suitable to consider the existence of a large amount of data at the initial stage of development, and do a good job of sub-table processing. It is not suitable for the application to go online and then make changes, the cost is too high! And choose this plan, not as low as the cost of choosing the second and third option I provide! Therefore, it is not recommended.

5. Sub-library

Divide a database into multiple, it is recommended to do a read-write separation on the line, the real sub-library will also bring a lot of development costs, the loss outweighs the gain! It is not recommended.

Plan 2 details: upgrade the database to a 100% mysql-compatible database

Mysql performance is not good, so change it. In order to ensure that the source code is not modified and the existing business is migrated smoothly, it is necessary to change to a database that is 100% compatible with mysql.

Open source choice

TiDB https://github.com/pingcap/tidb

Cubrid https://www.cubrid.org/

Open source database will bring a lot of operation and maintenance costs and there is still a gap between its industrial quality and MySQL, there are many holes to step on, if your company requires that you must build your own database, then choose this type of product.

Cloud data selection

Aliyun POLARDB

Https://www.aliyun.com/product/polardb?spm=a2c4g.11174283.cloudEssentials.47.7a984b5cS7h5wH

Official introduction: POLARDB is a next-generation relational distributed cloud native database developed by Aliyun. It is 100% compatible with MySQL, with a storage capacity of up to 100T and a performance improvement of up to 6 times that of MySQL. POLARDB not only combines the characteristics of stability, reliability and high performance of commercial database, but also has the advantages of simple, scalable and continuous iteration of open source database, and the cost is only 1x10 of commercial database.

I opened a test, support free mysql data migration, no operating costs, performance improvement of about 10 times, the price is not much different from rds, is a good alternative solution!

Aliyun OcenanBase

Taobao use, can withstand Singles Day holiday, excellent performance, but in the public test, I can not try, but it is worth looking forward to

Aliyun HybridDB for MySQL (formerly PetaData)

Https://www.aliyun.com/product/petadata?spm=a2c4g.11174283.cloudEssentials.54.7a984b5cS7h5wH

Official introduction: cloud database HybridDB for MySQL (formerly known as PetaData) is a HTAP (Hybrid Transaction/Analytical Processing) relational database that supports both massive data online transactions (OLTP) and online analysis (OLAP).

I also tested it, it is an olap and oltp compatible solution, but the price is too high, up to 10 yuan per hour, used for storage is too wasteful, suitable for storage and analysis business together.

Tencent Cloud DCDB

Https://cloud.tencent.com/product/dcdb_for_tdsql

Official introduction: DCDB, also known as TDSQL, is a high-performance distributed database that is compatible with MySQL protocol and syntax and supports automatic horizontal split, that is, the business is displayed as a complete logical table, but the data is evenly split into multiple shards. Each shard adopts master / slave architecture by default, providing a full range of solutions such as disaster recovery, recovery, monitoring and non-downtime expansion, which is suitable for TB or PB-level massive data scenarios.

I don't like to use Tencent's, so I don't say much. The reason is that when there is a problem, we can't find anyone, and the online problem can't solve the headache! But he is cheap and suitable for ultra-small companies to play.

Plan 3 details: remove mysql, change to big data engine to deal with data

The amount of data has exceeded 100 million, so we have no choice but to go to big data.

Open source solution

The hadoop family. Hbase/hive is the only thing that matters. But there is a high cost of operation and maintenance, the general company can not afford to play, without 100,000 input will not have a very good output!

Cloud solution

This is more, and it is also a future trend. Big data provides professional services by professional companies, and small companies or individuals purchase services. Big data, like water / electricity and other public facilities, exists in all aspects of society.

The best thing to do in China is Ali Yun.

I chose Aliyun's MaxCompute with DataWorks, which is super comfortable, paid by quantity and at a very low cost.

MaxCompute can be understood as open source Hive, which provides sql/mapreduce/ai algorithm / python script / shell script and other ways to manipulate data. The data is displayed in tabular form, stored in a distributed way, and processed by scheduled tasks and batch processing. DataWorks provides a workflow way to manage your data processing tasks and schedule monitoring.

Of course, you can also choose Aliyun hbase and other products, I here is mainly offline processing, so choose MaxCompute, basically graphical interface operation, about 300 lines of sql, the cost of no more than 100 yuan to solve the data processing problem.

Thank you for your reading, the above is the content of "what is the method of large table optimization of MySQL 20 million data". After the study of this article, I believe you have a deeper understanding of what the method of large table optimization of MySQL 20 million data is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.