A tutorial on Code Transformation and Optimization methods of MySQL flow tool Maxwell 07/09 Update SLTechnology News&Howtos

A tutorial on Code Transformation and Optimization methods of MySQL flow tool Maxwell

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article introduces the "MySQL flow tool Maxwell code transformation and optimization method tutorial" related knowledge, in the actual case operation process, many people will encounter such a dilemma, then let the editor lead you to learn how to deal with these situations! I hope you can read it carefully and be able to achieve something!

Maxwell is an open source product, and its volume is much smaller than that of Canal. After comprehensive consideration, we chose Maxwell in a short time.

From quick start to functional support, it is a good product with overall support, and it also makes the progress of technical research and iteration much faster.

Generally speaking, to compare, basically will come up with this picture (the data has subjective characteristics, for reference only), because considering that bootstrap is a rigid requirement, so the functional considerations of this part is also an important tradeoff.

Recently, some problems and improvement points of Maxwell have been found in the process of transferring the database to big data:

1) at present, Maxwell's service management mode only supports start mode. If you want to stop using manual kill mode, it is relatively rough. Of course, to communicate with the author, you can use signal processing to achieve it indirectly.

2) the core configuration of Maxwell is to filter synchronization objects, which can support regular and other modes. If the filtering rules are complex or constantly adjusted later, the Maxwell service needs to be restarted for each adjustment, and there is no mode similar to reload.

3) for DDL changes, if the initialization of Maxwell has been completed and the service has been started, when creating a table later, Maxwell will record the changes in the `schemas` table to maintain the version change record. There is no such information in the existing metadata tables `tables` and `columns`, which will bring some deviations to the back-end service resolution table structure (there will be corresponding JSON for the change configuration of DDL)

1. Problem positioning

Most of these problems are suggestions, not fatal ones, so the overall progress can still be advanced, but recently some data problems have been found after synchronizing several large tables.

1) bootstrap takes a long time to view Maxwell-related monitoring. The overall data throughput is about 800s / s, which seems to have reached the bottleneck of synchronization. It takes about an hour to synchronize a table with more than 2 million data, which is relatively long. In our recent test, if several 10 million large tables are initialized serially, it will take about 2-3 hours, which is really too long.

2) there is a difference in the time field value of synchronization data, which is also found in the data comparison between the middle end (maxwell planned for the middle service) and the back end (Flink,Kudu planned for the back end). The data comparison result of bootstrap is almost not the same, which means that bootstrap has some potential problems in data, so the whole thing is rotated to the bootstrap part of Maxwell.

Looking at the code logic really surprised me. Currently, this problem only occurs in the bootstrap section. For example, the time field value of the data is:

However, after logical processing, there will be time zone calculation, and the difference in time zone will be automatically made up.

The problem now is not the performance risks caused by initialization, but the intrusive quality of the data, which makes the data look messy.

For this problem, the focus of the analysis is the difference in the handling of the time zone. I originally thought that this change should be very small, but it took a lot of effort to debug and integrate the environment.

two。 Problem correction

For the data difference of time zone, it is mainly due to the time zone difference of datetime data type, which is currently 13 hours.

Check out the code class SynchronousBootstrapper of Maxwell:

After debugging, the logical scope of the code that needs to be changed is based on the function setRowValues:

Can be modified to:

After the change, the logic of the entire bootstrap will be normal after debugging and repeated testing.

3. Trade-off and correction of performance problems

Of course, in addition to this, some minor improvements have also been made.

The first problem is the performance of bootstrap. Before, there seemed to be a bottleneck, but the throughput could not go up at around 800. I have made the following improvements:

1) one of the basic principles of bootstrap is the usage mode of select * from xxxx order by id;. If the amount of data in the table is relatively large, in fact, the part of order by seems to have lost the primary key, and this clause will force full index scanning, but the overall effect is not full table scanning, so I simply removed the order by clause in the logic.

The transformation of the whole logic is also very light:

Private ResultSet getAllRows (String databaseName, String tableName, Table table, String whereClause, if (Competition! = null & &! pk.equals ("")) {sql + = String.format ("order by% s", Competition);}

2)。 Sleep 1 millisecond after writing data is removed

After further analysis of the code, it is found that the bottleneck of throughput in bootstrap is one of the weird sleep 1 processing. According to the preliminary analysis, it may be considered that the task of bootstrap will generate a large amount of data, and the pressure on bandwidth and load is high. Through sleep, the speed can be slowed down and can be controlled as a whole.

In addition, the number of synchronized data items will be included in the log statistics of bootstrap, which is currently not highly dependent, and the work of data verification will stop Slave before data comparison.

The performance improvement and improvement is about 3-5 times, so we can choose the logic of this part according to the actual situation. In our circulation design, the data is transferred based on the Slave side, so it will not impact the main database, and the logic of this part of the change is also very light, just comment out sleep (1).

Public void performBootstrap (BootstrapTask task, AbstractProducer producer, Long currentSchemaID) throws Exception {producer.push (row); Thread.sleep (1); + + insertedRows

After the change, after testing and comparison, it is found that the performance is much better, at most, there can be more than 6000, the same initialization is all done in less than 15 minutes.

Such small improvements have also brought us some sense of achievement, the subsequent scale of data synchronization continues to expand, and there is no feedback on the problem of data quality, of course, there are still some work to be refined on this basis.

4. Subsequent improvements to the direction of bootstrap

1) use the idea of sharding to improve bootstrap

Improve the efficiency of data extraction, for tens of millions of large table data extraction, can be extracted according to interval segmentation (need to take into account the impact of data changes and writing), the current logic is too rigid.

Private ResultSet getAllRows (String databaseName, String tableName, Table table, String whereClause, Connection connection) throws SQLException {Statement statement = createBatchStatement (connection); String Competition = table.getPKString (); String sql = String.format ("select * from `% s`.s", databaseName, tableName) If (whereClause! = null & &! whereClause.equals ("")) {sql + = String.format ("where% s", whereClause);} if (Competition! = null & &! pk.equals ("")) {sql + = String.format ("order by% s", Competition);} return statement.executeQuery (sql);}

2) data dictionary index optimization

Maxwell data dictionary optimization, the current data dictionary, some SQL execution frequency is high, but from the database level is full table scan, these details need to be further adjusted.

For example, the following SQL statement:

> > explain select * from bootstrap where is_complete = 0 and client_id = 'dts_hb30_130_200_maxwell003' and (started_at is null or started_at)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.