What is the application of DataX in data migration 07/06 Update SLTechnology News&Howtos

What is the application of DataX in data migration

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is to share with you about the application of DataX in data migration. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

1. DataX definition

Let's start with a brief introduction to what datax is.

DataX is an offline data synchronization tool / platform widely used in Alibaba Group, which realizes efficient data synchronization among various heterogeneous data sources, including MySQL, Oracle, SqlServer, Postgre, HDFS, Hive, ADS, HBase, TableStore (OTS), MaxCompute (ODPS), DRDS and so on.

2. Commercial version of DataX

Aliyun DataWorks data integration is a commercial product of the DataX team on Aliyun, which is committed to providing high-speed and stable data mobility between rich heterogeneous data sources in a complex network environment, as well as data synchronization solutions in a complex business context. At present, it has supported nearly 3000 customers on the cloud and synchronized more than 3 trillion data in a single day. DataWorks data integration currently supports 50 kinds of offline data sources, which can be used for all kinds of synchronization solutions, such as whole database migration, batch cloud uploading, incremental synchronization, sub-database and sub-table. The real-time synchronization capability will be updated in 2020 to support any combination of reading and writing of 10 + data sources. Provide one-click full incremental synchronization solution from MySQL,Oracle and other data sources to big data engines such as Aliyun MaxCompute,Hologres.

For the git address of datax, please refer to the information below for details [1].

2.1 Application Ca

Next, we will introduce our application cases on two projects.

2.1.1 case one assists in analyzing data synchronization links through datax

During the migration of a customer's oracle database to the cloud, a packaged product is used, but the transmission efficiency is always very low, only 6M/s. Customers have always suspected that it is caused by the bottleneck of vpc network or rds database in the cloud, but in fact, the network in the cloud and the database as a whole are in a very low load state.

Since it is not convenient for us to test the client-side transmission tools directly, we temporarily deploy datax on ecs for analysis and control testing.

The test results are shown in the following figure.

Figure 1: test results

The commonly used optimization parameters are as follows:

1. Channel (channel)-concurrency

Through the test, we can see that the setting of channel has an obvious effect on the transmission efficiency. In the above experiments, when all the settings are set to 1, that is, without concurrency, the synchronization speed is 8.9m / s. After increasing this setting, the speed is obviously doubled, but when it is increased to a certain extent, the bottleneck point is transferred to other configurations.

two。 Slice (splitpk)

The official introduction of Git is as follows:

Description: when MysqlReader carries out data extraction, if you specify splitPk, it means that users want to use the fields represented by splitPk for data sharding, so Datax will start concurrent tasks for data synchronization, which can greatly improve the performance of data synchronization.

It is recommended that splitPk users use the table primary key, because the table primary key is usually more uniform, so the sliced fragments are not prone to data hotspots.

Currently, splitPk only supports shaping data sharding, and does not support floating point, string, date and other types. If the user specifies another unsupported type, MysqlReader will report an error.

If splitPk is not filled in, including no splitPk is provided or the splitPk value is empty, DataX is treated as using a single channel to synchronize the table data.

Required: no

Default value: empty

In fact, from the test results, slicing should be used with channel. If only splitpk is turned on, but the configuration of channel is 1, there will be no concurrency effect.

3.Batchsize

The official introduction of Git is as follows:

Description: the number of records submitted in batches at one time, which can greatly reduce the number of network interactions between DataX and Mysql and improve the overall throughput. However, setting this value too high may cause the DataX to run the process OOM.

Required: no

Default value: 1024

The actual test effect in the field is not obvious, the main reason is that the amount of data is small, when 1c1g is configured, properly increasing batch can improve the synchronization speed.

There are many other parameters to be explored by our friends, or we will share them with you next time.

From the test results, we can see that the tool used in the project can only synchronize the rate of 6m, which must be far from enough. The performance bottleneck is neither on the network nor on the database. After only doing some tests, we found that the database of 4c16g can easily reach the synchronization rate of 27m.

2.1.2 case 2 Application of datax in data synchronization

In this case, the customer migrates the data between two clouds. The customer creates a new cloud and needs to migrate the data from the previous cloud.

The plan is as follows:

Figure 2: scenario

2.2 deployment mode

Datax itself does not have automatic cluster deployment, so it needs to be manually deployed one by one.

The spare memory of 15 v2 physical machines is about 150G. (because the V2 cluster is about to go offline, the above products have no business, so the idle physical machines are used, and ecs or other virtual machines can be used when the physical conditions are not available.)

Machine requirements: synchronous task startup will take up more memory, test an average of 1 task takes up 10 GB of memory (large table)

A single 150 GB memory can support 15 concurrent tasks.

2.3 synchronization mode

2.3.1 data characteristics

Total number of classified items: more than 1T project7 units less than 97T1T project77 15T

History table: historical data, the number of items is small, but takes up more than 80% of the space

Dimension table + temporary table: a large number of items, a small table, and a small share of space

Data characteristics: customer analysis reports from dozens of departments, with more partitions in small tables and relatively less in large tables

2.3.2 synchronization order

Synchronize the big table first, then the small table

The business has no synchronization sequence requirements, the small table is changeable and the large table is stable, so the priority is to synchronize the large table.

Large table synchronization requires less manual configuration, and it is also suitable for early debugging. (do not choose a large table before officially starting synchronization. 10G-50G is recommended, which can measure efficiency and pressure. )

Advantages: many machines, high parallel, can quickly press to the bottleneck and improve the overall synchronization efficiency

Disadvantages: Datax is a native migration tool, does not have automation capabilities, and requires a large number of manual configuration synchronization tasks

Note: it is recommended to reserve several machines specifically for synchronizing small tables with multiple partitions, while other machines synchronize large tables.

2.4 Summary

To sum up, the advantages of Datax are obvious:

First of all, the deployment is very simple, no matter on the physical machine or the virtual machine, as long as the network is smooth, data synchronization can be carried out, which brings great convenience to the implementation staff and is not limited by the deployment of standard data synchronization products.

Furthermore, it is an open source product with almost no cost

However, its disadvantages are also obvious:

Open source tools provide more basic capabilities and do not have a series of functions such as task management, progress tracking, verification and so on. Users can only record them through scripts or tables, and there is no blank screen at present; therefore, it is destined to be used only as a temporary data synchronization tool.

Therefore, when there are some business requirements for standardized management or standardized operation and maintenance, or frequent and continuous use of data synchronization scheduling, production safety is very important, or it is recommended to use Aliyun's official products.

The above is the application of DataX in data migration. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.