A tutorial on the method of dividing libraries and tables 07/01 Update SLTechnology News&Howtos

A tutorial on the method of dividing libraries and tables

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "the method course of sub-database and sub-table". In the daily operation, I believe that many people have doubts about the method course of sub-database and sub-table. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about the method course of dividing libraries and tables. Next, please follow the editor to study!

MySQL table size limit

MySQL is generally installed and deployed on Linux operating systems (such as CentOS 7.4). The default is InnoDB storage engine, and the independent tablespace option (parameter innodb_file_per_table=1) is enabled. When you create a table orders, a data file orders.ibd is automatically generated. The file size is limited by the operating system Block size. The following is the corresponding relationship between the block size and the maximum size of the ext3 file system.

Operating system block size maximum file size maximum file system size 1KB16GB2TB2KB256GB8TB4KB2TB16TB8KB16TB32TB

View operating system page size and block size

This means that the maximum size of a single MySQL table cannot exceed 2TB. Let's do a simple calculation. Assuming that the average row length of a table is 32KB (InnoDB maximum row length limit of 65536 bytes, 64KB), how many rows of data can it store? 4 x 1024 x 1024 x 1024 / 32 = 134217728 is less than 140 million.

For ele.me, Meituan's order form 140 million of the takeout trading system is easy to achieve, with an average of 2000W orders a day and 140 million a week. I can't play anymore. I usually use the scheme of living in different places and write the data to the corresponding IDC data center according to the location of the user, which is actually a high-end sub-meter scheme that is not within the scope of our discussion today.

Sub-table scheme

The application scenario of split table is that the data volume of a single table grows too fast, which affects the response time of the business interface, but the load of MySQL instance is not high. In this case, you only need to split the table and do not need to split the database (split the instance).

We know that a table size satisfies the following formula: TABLE_SIZE = AVG_ROW_SIZE x ROWS, from which we can know that the table is too large, or the average row length is too large, that is, the table has too many fields or too many records. This results in two different sub-table schemes, namely split field (vertical table) and split record (horizontal table).

Vertical subtable

Or take the order table orders as an example, split according to the fields, which needs to consider how to split the fields to maximize the DML performance on the table. The conventional solution is to separate hot and cold (put the frequently used fields in one table, and the rest of the frequently used fields in another table).

After the orders table is split, it becomes two tables, orders01 and orders02, and two data files, orders01.ibd and orders02.ibd,orders, will be stored on disk. The maximum size of the table is 4TB. After the split, how to query? For example:

The columns after the above SQL,select are in two tables (order_id,order_sn in orders01, source in orders02), and the above SQL can be rewritten as follows.

If database middleware is used, query rewriting can be realized automatically, such as mycat,sharding-sphere. If you don't need middleware, you can create a route table (primary key ID, original table name, field name, child table name). Every time you parse SQL, you need to get the required child table according to the original table name + field name, and then rewrite SQL and execute SQL to return the result. This kind of code modification is too large. And error-prone, so this vertical split is not often used in the actual business.

If the business table has the necessary Text type to store data, vertical split can be used to reduce the table size and split the text field into child tables.

After splitting the text type into child tables, the average row length of the original table becomes smaller and more data can be stored.

Horizontal subtable

A horizontal split table is divided into pieces according to the records in the table. For example, the order table orders currently has 2000w data, which is estimated to reach 100m in a year based on business growth. With reference to the best practices of Aliyun RDS for MySQL, it is recommended that a single table should not exceed 500w. The 100m data can be divided into 20 sub-tables.

The question is, what is the basis for splitting it? The primary key id is still the user's user_id. The data is split evenly by the primary key ID, and there are almost no scenarios of querying orders through ID. Most scenarios of business access to orders are filtered based on user_id, and the uniqueness of user_id is very high (the orders table corresponding to a user_id has few records and good selectivity). Using user_id as Sharding key can satisfy most business scenarios, and the data of each child table is relatively uniform after split.

In this way, the orders table is divided into 20 child tables, which correspond to 20 data files (orders_0.ibd,orders_1.ibd, etc.) on the storage of InnoDB. Then execute the SQL statement select order_id, order_sn, source from * * orders** where user_id = 1001 You can quickly locate the location of the record to be found in orders_1, and then do query rewriting and convert it into SQL statement select order_id, order_sn, source from * * orders_01** where user_id = 1001. This kind of query rewriting function has been implemented in many middleware, and sharding-sphere or sharding-jdbc can be implemented commonly.

Monthly breakdown

For accounting or billing systems, daily settlement or daily account tasks of the previous day are done every night, and monthly settlement or monthly account tasks are done on the 1st of each month. After the task is executed, the data of the relevant tables have been static (the business layer does not need these data). According to the characteristics of the business, you can create a monthly table. For example, for the billing table bills, you can create a monthly breakdown table (October table bills_202010,202011 November table). After the monthly account task is completed, it can be archived to the history database and used in the data warehouse ETL to make analysis reports. After confirming that the data is synchronized to the history database, you can delete these tables to release space.

MySQL partition table

You may be thinking, after the above horizontal table, we need to modify the code to ensure that the SQL is routed correctly, execute and return the result. This call link is a bit long. Is there a solution for table division within MySQL? In fact, there are, you can consider using MySQL's HASH partition, the conventional hash is also based on the number of partitions to take modular (%) operation, and the above user_id% 20 is the same, to see an example.

This creates 20 partitions, which correspond to 20 data files on disk (ordersp0.ibd all the way to ordersp19.ibd). Take a look at the execution of SQL.

As can be seen from the execution plan, through the partition key user_id filtering, you can directly locate the partition p19 where the data is located (user_id = 1019 20 = 19, so on the p19 partition), and then access the data file ordersp19.ibd corresponding to p19 to get the data. The advantage of this scheme is that MySQL internally implements the function of SQL routing without having to modify the business code.

Sub-library scheme

We talked about the plan of the next table, so when will we divide the database? We know that most of the highly available architectures of MySQL are one master and multiple slaves, and all write operations take place on Master. With the growth of business and the increase of data volume, many APIs have long response time, and Timeout often occurs. Moreover, the problem can no longer be solved by upgrading the configuration of MySQL instances. At this time, there are usually two ways to split libraries: splitting libraries by business and dividing libraries by table. Here are two sub-library schemes.

Sub-database by business

For example, the trading system trade database is deployed in a single RDS instance, and now there are more and more transaction requirements and functions. The tables related to orders, prices and inventory are growing rapidly, and the time-consuming of some interfaces increases. At the same time, there are a large number of slow query alarms, and upgrading the RDS configuration is not very effective. At this time, you need to consider splitting the business and separating the interfaces related to inventory and price.

In this way, after splitting according to the business module, the corresponding trade database is split into three RDS instances, the writing ability of the database is improved, the interface response time of the service is also shortened, and the stability of the system is improved.

Divide the library by table

The split table scheme is introduced above. The common sub-tables are vertical and horizontal sub-tables (all the split sub-tables are stored in the same RDS instance), and the corresponding sub-libraries are vertical sub-libraries and horizontal sub-libraries. Here, the split RDS instances are actually split, and the split sub-tables are stored in different RDS instances. The actual business of vertical split is rarely used, so it is not introduced.

For example, the order table orders of the transaction database contains more than 200 million data, and the RDS instance encounters a writing bottleneck. Ordinary insert requires 50ms and often receives CPU utilization alarms, so it is time to consider sub-database. According to the growth trend of business volume, we plan to expand a RDS instance with the same configuration, and split the order table orders into 20 sub-tables, each with 10 RDS instances.

This solves the problem that the orders of the order table is too large. When querying, we should first locate which RDS instance is through the partition key user_id, and then locate the specific child table, and then do the DML operation. The problem is that the workload of code transformation is heavy, and the service call link is longer, which has a certain impact on the stability of the system. In fact, there are already some database middleware to achieve the function of sub-database and sub-table, such as the common mycat, Ali Cloud's DRDS and so on.

Distributed database

Through the introduction of the above sub-table and sub-database scheme, we will mainly encounter the following three types of problems:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Write performance bottleneck for MySQL single Master.

After the SQL parsing processing after sub-library and sub-table, the service invocation link becomes longer and the system becomes unstable.

Dynamic expansion after sub-database and sub-table is not easy to implement, for example, 20 tables are divided at the beginning, and it is difficult to expand to 50 tables without affecting the business.

Problems after split

Vertical split

Cross-library Join problem

Before the vertical split, the data needed in the system can be completed through the table Join, and after the split, the database may be distributed in different RDS instances, and Join is more troublesome to deal with. According to the MySQL development specification, cross-library Join is generally prohibited, so how to deal with it?

First of all, we should consider the rationality of this vertical split, if it can be adjusted, then give priority to adjustment, if it can not be adjusted, according to the past practical experience, summarize several common solutions.

Global table

Friends who have used mycat for sub-database and sub-table know that there is a concept of global table, that is, there is a full amount of data on every DataNode, such as some data dictionary tables, the data is rarely modified, and the performance problems of cross-library Join can be avoided.

Data synchronization

For distributed systems, the databases of different services are distributed on different RDS instances. In the case of prohibiting cross-library Join, data synchronization is a solution.

Through the data synchronization tool, the users table of the user library is synchronized to the trade library in real time, so that the Join operation can be done directly in the trade library, which depends on the stability of the synchronization tool. If there is a delay in synchronization, it will lead to data inconsistency and dirty data, so it is necessary to do a good risk assessment and backing plan.

Distributed transaction problem

After the split, the data is distributed on different RDS instances, and the DML operation on the table becomes the DML operation of multiple child tables, which involves distributed transactions and follows the transaction ACID characteristics. At the same time, two important theories are also mentioned: CAP (Consistency consistency, Availability availability, Partitiontolerance partition tolerance Partitiontolerance) and BASE (Basically Available basic availability, Soft state soft state, Eventually consistent final consistency) Then there are different schemes to solve the problem of distributed transaction.

MySQL XA transaction

MySQL supports distributed transactions (XA transaction or 2PC two-phase commit), which is divided into two phases: Prepare and Commit. The transaction process is as follows

If any XA Client rejects this submission, all databases require XA Manager to roll back their information in the transaction. The advantage is that it can maximize the strong consistency of data, which is suitable for business scenarios that require strong consistency of data. The disadvantage is that the implementation is complex, at the expense of availability, which has a great impact on performance, and is not suitable for high-concurrency and high-performance scenarios.

Local message table

The implementation of the local message table should be the most widely used in the industry, its core idea is to split the distributed transaction into local transactions for processing, and its basic design idea is to split the remote distributed transaction into a series of local transactions.

Treatment process

Message producer: you need to create an additional message table and record the message delivery status. The message table and business data are submitted in a transaction, that is, they are in a database. The message is then sent to the consumer of the message via MQ, and if the message fails, it will be sent again.

Message consumer: you need to process the message and complete your own business logic. At this point, if the local transaction is successful, it indicates that the processing has been successful, and if the processing fails, the execution will be retried. If it is a failure of the business, you can send a business compensation message to the manufacturer to notify the manufacturer to roll back and other operations.

The producer and consumer scan the local message table regularly and send the unfinished message or failed message again. If there is a reliable automatic reconciliation logic, this scheme is still very practical.

Horizontal split

Distributed globally unique ID

MySQL InnoDB tables all use the self-increasing primary key ID. After the database is divided into tables, the data tables are distributed on different shards. If you use the self-increasing ID as the primary key, the host ID on different shards will repeat, and you can use the Snowflake algorithm to generate a unique ID.

Sharding key selection

When selecting the sharding key, you need to count all the SQL on the table first, and try to select the fields with more frequency and unique value as the sharding key, which can not only achieve a uniform distribution of data, but also quickly locate the data location, such as user_id,order_id.

Data expansion

For example, the order table orders in the transaction database trade has been split horizontally (on two different RDS instances). At this time, it is found that the write performance of two RDS is not enough, so it is necessary to expand another RDS, and at the same time expand the orders from the original 20 sub-tables to 40 (user_id% 40). This requires data migration to achieve data rebalance, not only downtime to migrate the data, but also to modify the code. I feel a little thankless.

Cross-library Join problem

It is the same as the cross-library Join problem in vertical split.

Sorting paging across libraries

In the order by user_id limit n scenario, when the sort field is the shard field user_id, the specific shard can be easily located through the shard key, while when the sort field is not a shard field, such as order by create_time, it becomes complicated to process, and the data needs to be sorted and returned in different shards, and the result sets returned by different shards are summarized and sorted again. Finally, it is returned to the user.

Cross-library function processing

When using functions such as max,min,sum,count for statistics and calculation, we need to first perform the corresponding function processing on each shard data source, then reprocess each result set, and finally return the processing result.

ER fragmentation

In the RDBMS system, there are often some associated relationships between tables. If we can determine the association relations first, and store those table records with associated relationships on the same slice, the problem of cross-fragment join can be well avoided.

Non-fragment key filtering

Most business scenarios can be filtered by sharding keys, but some scenarios do not have sharding keys, such as querying the order table orders by status and time range, such as SQL.

This is very painful. You can only scan all the shard data once, Union the data of each shard and then reply it to the client. In this scenario, you can consider creating a composite index (status,create_time) to allow SQL to scan the index range while reducing the amount of data returned. If it is a core business scenario, you can consider real-time data warehouses (such as MPP-based analytical database ADB, distributed columnar database Clickhouse). Synchronize the required tables to the warehouse in real time, and then do the processing, which is also a common solution in the actual business.

At this point, the study of "the method course of sub-database and sub-table" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.