How to realize smooth data migration by expanding the capacity of sub-database and sub-table 07/04 Update SLTechnology News&Howtos

How to realize smooth data migration by expanding the capacity of sub-database and sub-table

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to realize smooth data migration by database and table expansion". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let Xiaobian take you to learn how to realize smooth data migration by expanding databases and tables.

I. Background

In 2020, a high-tech taxi bomb external order system in charge of the author carried out an expansion, sub-table and database migration. The order system is deployed on Alibaba Cloud as a whole. The service is deployed by Alibaba Cloud ECS. The database adopts Alibaba Cloud RDS. The configuration center is based on Alibaba Cloud ACM self-research. The data synchronization is based on Alibaba Cloud DTS self-research and self-research sub-library sub-table components, distributed ID components, etc.

The background of this expansion is that there are 256 tables in total in the original 4 instances and 4 databases, with 64 tables in each database. Some single tables have exceeded 10 million orders of magnitude. According to the current daily single quantity order, the single table will reach hundreds of millions of records within one year. Excessive data volume of single table will bring database performance problems.

Note: Bullet inside and outside refers to elastic computing. Bullet inside and outside actually refer to two independent elastic computing network environments. Bullet mainly refers to the flexible computing environment deployed in Alibaba production network. It was originally built based on the original Taobao technology and is mainly used to support Taobao business. Bullet mainly refers to the elastic computing environment deployed in Alibaba Public Cloud, which supports Alibaba Cloud computing business.

II. Capacity planning

1. Current sub-library and sub-table situation

4 instances (16C/64G/3T SSD), 4 libraries (one library per instance), 64 tables per library, 256 tables in total.

Calculate table space usage with RDS background one-click diagnostics (take the test environment database for example).

2. capacity calculation

number of instances

Database bottlenecks are mainly reflected in: disk, CPU, memory, network, connection number, and the number of connections is mainly affected by CPU and memory. CPU and memory can be upgraded dynamically, but SSD disk capacity is supported up to 6T (up to 3T below 32C, up to 6T above 32C).

However, considering the cost at this stage, we can first double the size of the instance, using 8 instances (16C/64G/3T SSD), each instance building 4 databases, each database 128 tables (here is actually a cost trade-off process, theoretically should adopt the principle of "multi-database less tables", a single database 128 tables is actually too many, a single database recommended 32 or 64 tables is appropriate).

If the instance pressure increases later, the instance configuration can be upgraded (16C/128G, 32C/128G, 32C/256G, etc.); if the single instance upgrade cannot be solved in the future, consider expanding the instance, and only need to migrate the database to the new instance, and the migration cost is small.

number of tables

According to the single table up to 1000w data evaluation, 4096 tables can support the daily 5000w single *3 years (10.1 pressure test standard), daily 2000w single *5 years structure. (Since there are many business tables, the calculation process of single data size is ignored here)

Number of banks

32 libraries with 128 tables each. In the future, it can be expanded to a maximum of 32 instances without rehashing, only data migration is required.

Aliyun RDS Specifications and Price List

III. Data migration

Because the expansion of database and table involves the rehash process (256 tables change to 4096 tables), and Aliyun DTS only supports isomorphic database data migration, we self-developed data synchronization middleware based on DTS binlog to kafka capability.

The whole data migration work includes: preliminary preparation, data synchronization link (historical data full synchronization, incremental data real-time synchronization, rehash), data verification link (full verification, real-time verification, verification rule configuration), data repair tools, etc.

1. preparations

Unique Service ID

Before data synchronization, you need to sort out the unique business IDs of all tables. Only when the unique business ID is determined can data synchronization be realized.

Note:

Is there any business that uses database self-increment ID as business ID? If there is a need to modify the business first, fortunately, there is no order business.

Whether each table has a unique index, this in the combing process found that there are several tables without a unique index.

Once there is no unique index in the table, it will cause the risk of data duplication during the data synchronization process, so we first add a unique index (possibly a joint unique index) to the table without a unique index according to the business scenario.

By the way, Aliyun DTS does isomorphic data migration, using database self-increasing ID as the unique ID. In this case, if two-way synchronization is done, it will cause data coverage problems. There are also solutions. Previously, our practice was to solve the problem by using the self-increasing ID of the old and new entities with single and even numbers to ensure that the self-increasing ID of the old and new instances would not conflict. Because this time we use the self-developed bidirectional synchronization component, this problem is not discussed in detail here.

Sorting out table rules

Different table rules determine the difference between rehash and data verification. It is necessary to sort out whether the user ID latitude table is divided into tables or non-user ID latitude tables, whether it is only divided into tables, whether it is not divided into tables, and so on.

2. data synchronization

The overall scheme of data synchronization is shown in the figure below. Data synchronization is based on binlog, independent intermediate services do synchronization, and there is no intrusion into service codes.

Next, each link is introduced.

Full synchronization of historical data

A single service, select data from the old library in batches by using cursors, and then batch insert it into the new library after rehashing. Here, you need to configure the jdbc connection string parameter rewriteBatchedStatements=true to make the batch operation effective.

In addition, it should be noted that historical data will also be continuously updated. If full synchronization of historical data is started first, the data just synchronized may not be the latest. Therefore, the method here is to first enable one-way synchronization of incremental data (from the old database to the new database). At this time, only the backlog kafka message is enabled and will not be consumed; then, the full synchronization of historical data is started. After the full synchronization of historical data is completed, the consumption kafka message is enabled for incremental data synchronization (improving the efficiency of full synchronization and reducing the backlog is also a key link), so as to ensure the consistency of data in the process of migrating data.

incremental data real-time synchronization

Incremental data synchronization takes into account the stability, disaster recovery and rollback ability of gray scale flow switching, and adopts real-time bidirectional synchronization scheme. Once the stability problem occurs in the new database or the data consistency problem occurs in the new database, it can quickly rollback and switch back to the old database to ensure the stability and data reliability of the database.

The incremental data real-time synchronization is realized by data-sync, a self-developed data synchronization component based on Aliyun DTS. The main scheme is that DTS data subscription capability automatically converts the subscribed database binlog into kafka. Data-sync component subscribes to kafka messages, filters, merges, groups, rehashes, decomposes tables, inserts/updates in batches, and finally submits offset. A series of operations such as offset are finally completed.

Filtering circular messages: You need to filter out circular synchronization binlog messages. This problem is more important and will be described separately later.

Data Merge: Multiple operations on the same record save only the last one. In order to improve performance, the data-sync component does not immediately perform data flow after receiving the kafka message, but stores it in the local blocking queue first, and then the local scheduled task performs data flow operation on N pieces of data in the local queue every X seconds. In this case, N pieces of data may be operations on the same record of the same table, so only the last one needs to be kept here (similar to redis aof rewriting).

Update to insert: When merging data, if insert+update only retains the last update in the data, the execution will fail, so update needs to be converted to insert statement here.

Merge by new table: split and merge the N pieces of data to be submitted finally according to the new table. In this way, batch database operations can be performed directly according to the latitude of the new table to improve insertion efficiency.

There are a few issues to note throughout the process:

Question 1: How to prevent data consistency problems caused by asynchronous messages out of order?

First of all, kafka asynchronous messages have order problems, but what you need to know is that binlog is sequential, so dts are also sequential when delivering kafka messages in detail. What you need to do here is to ensure that only one consumer can guarantee the order of the data, and there will be no data state coverage, thus solving the data consistency problem.

Question 2: Will there be a message loss problem, such as a consumer service restart?

Here, offset is not automatically submitted, but after each consumption data is finally stored, offset is asynchronously stored in a mysql table. If the consumer service restarts and goes down, etc., the latest offset is obtained from mysql after restarting and consumption begins. The only problem with this is that there may be instantaneous partial message duplication, but because the binlog described above is sequential, it ensures that the data is ultimately consistent.

Question 3: Will update to insert lose fields?

Binlog is sent with all fields, and there will be no missing fields.

Question 4: Circular message problem.

A separate presentation follows.

rehash

As mentioned earlier, because it is a 256 table to a 4096 table, each piece of data needs to be rehashed to re-calculate the sub-database sub-table.

To say rehash, we have to introduce the current order data sub-library and sub-table strategy. The last four digits of the user ID are redundant in the order ID, and the library number and table number are determined by hash calculation of the last four digits of the user ID.

In the process of data synchronization, from the old database to the new database, it is necessary to obtain the four-digit module 4096 after the user ID in the order ID to determine the database table position of the data in the new database; from the new database to the old database, it is necessary to use the four-digit module 256 after the user ID to determine the database table position of the data in the old database.

Binlog Cycle Consumption Problem with Bidirectional Synchronization

Imagine that the business writes a piece of data to a table of the old instance, thus generating a binlog; the data-sync middleware receives the binlog and writes the record to the new instance, thus generating a binlog in the new instance; at this time, the data-sync middleware receives the binlog again. The cycle continued, the messages increased, and the order of the data was disrupted.

How to solve this problem? We adopt a data coloring scheme, as long as we can identify the data written to the database so that data-sync middleware writes instead of business writes, there is no need to perform message flow again the next time the binlog data is received.

Therefore, data-sync middleware requires that a transaction table be created for each database instance. The transaction table tb_transaction has only id, tablename, status, create_time and update_time fields. Status defaults to 0.

Returning to the above problem, the business writes a piece of data to a table of the old instance, thus generating a binlog; after the data-sync middleware receives the binlog, it operates as follows:

#Start transaction, use transaction to ensure atomicity and consistency of sql start transaction; set autocommit = 0; #Update transaction table status=1, start coloring the business data behind the identifier update tb_transaction set status = 1 where tablename = ${tableName}; #The following is the binlog insert xxx; update xxx; update xxx;#update transaction table status=0, business data behind the identifier loses coloring update tb_transaction set status = 0 where tablename = ${tableName}; commit;

At this time, the data-sync middleware packages the above statements and submits them to the new instance. After updating the data, the new instance will also generate binlogs corresponding to the above statements. When the data-sync middleware receives binlogs again, it only needs to judge that the data in tb_transaction table status=1 is encountered, and the subsequent data are discarded directly. Until status=0 is encountered, the data-sync middleware will continue to receive data, so as to ensure that the data-sync middleware will only flow the messages generated by the business.

3. data check

The data verification module is implemented by the data verification service data-check module, which is mainly based on the data comparison at the database level to check whether each data field is consistent one by one. If it is inconsistent, it will retry or alarm through the configured verification rules.

full calibration

Based on the old database, query whether each piece of data exists in the new database and whether each field is consistent.

Based on the new database, query whether each piece of data exists in the old database and whether each field is consistent.

real-time calibration

Check every 5 minutes, query the updated data of the old database and the new database in the last 5+1 minutes, and do diff.

The difference data shall be checked twice and thrice (due to concurrency and data delay), and alarm shall be given if the three checks are different.

4. data repair

After data verification, once data inconsistency is found, it is necessary to repair the data.

There are two schemes for data repair. One is suitable for large-scale data inconsistency. The method of resetting kafka offset is used to re-consume data messages and overwrite the problematic data.

The second type is applicable to small-scale data inconsistency. The data repair module automatically pulls the sls log recorded by the data-check module, parses the log, generates synchronization statements, and updates them to the target database.

IV. Gray switching data source

1. Global gray scale tangential flow scheme

Overall gray scheme: SP+ user latitude to achieve, SP latitude: rely on gray environment cut to do, user latitude: rely on the user ID after the four-digit percentage cut.

Grayscale cutting process must be coordinated with stop writing (second level), why stop writing, because there is a certain delay in data synchronization (normal millisecond level), and all business operations must be guaranteed on an instance, otherwise in the old database business has just modified a piece of data, at this time switch to the new database If the data has not been synchronized, the old data will have data consistency problems. So the steps should be:

Stop writing first

All the observation data are synchronized.

Switching data sources

Finally, stop writing and start normal business writing.

2. Preparation before tangential flow-ABC verification

Although a large number of tests have been conducted in the test environment before the flow cut, the test environment is different from the production environment. Once a problem occurs in the production environment database, it may be a disaster. Although the data verification and data repair process is described above, intercepting the problem before it occurs is the most important work for service stability.

So we propose the concept of ABC verification, gray-scale environment ABC verification preparation:

Buy two new database instances. The current order database is A, and the two new database instances are B and C respectively.

Configure DTS to synchronize from A to B (dts supports isomorphic data synchronization without rehashing), B as the validation library of the old library, and C as the new library

Use B and C as production drill verification

When B and C walkthroughs are complete, configure A and C for formal bidirectional synchronization

3. Grayscale cut-off step

Specific grayscale scheme and data source switching process:

The code is configured in advance with two sets of database sub-database sub-table rules.

Configure gray scale with ACM.

The code intercepts the mybatis request, and according to the modulo of the last four digits of the user id and the gray scale ratio set in the ACM setting, the new library identifier is transmitted to the library sub-table component through ThreadLocal.

Judge whether there is a gray white list at present, if it hits, transfer the new library ID to the sub-library sub-table component through ThreadLocal.

The sub-database and sub-table component obtains the sub-table rules of the new sub-database according to the ACM configuration, and performs database reading and writing operations.

When cutting, it will cooperate with ACM configuration gray scale hit users to stop writing.

V. Summary

The whole data migration process is still relatively complicated, and the time is not very abundant (the process is also interspersed with eleven full-link pressure test transformation). In the limited time, we gather everyone's efforts to repeatedly discuss and mine possible problems, and then demonstrate the solution, not letting go of any link that may cause problems. In the same sentence, intercepting problems before they occur is the most important work for service stability.

There are still many details in the process, from the preparation of data migration to data synchronization testing, from gray process determination to formal production switching, especially the point of combining business and data, there are many details to consider, which are not listed in this article.

Finally, after nearly two months of intense work, no business code intrusion, zero accidents, smoothly completed the expansion of sub-database sub-table and data migration work.

At this point, I believe everyone has a deeper understanding of "how to realize smooth data migration by database and table expansion". Let's do some actual operation! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.