Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Offline data synchronization artifact: DataX, which supports offline synchronization to MaxCompute for almost all heterogeneous data sources

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

Overview

DataX is an offline data synchronization tool / platform widely used in Alibaba Group, which realizes efficient data synchronization among various heterogeneous data sources, including MySQL, Oracle, SqlServer, Postgre, HDFS, Hive, ADS, HBase, TableStore (OTS), MaxCompute (ODPS), DRDS and so on.

As a data synchronization framework, DataX abstracts the synchronization of different data sources into Reader plug-ins that read data from the source data source and Writer plug-ins that write data to the target end. Theoretically, the DataX framework can support data synchronization of any data source type. At the same time, DataX plug-in system as a set of ecosystem, each connected to a new set of data sources, the new data sources can be interconnected with existing data sources.

Offline data synchronization will be used in big data analysis, data backup, data synchronization and other application scenarios, so this article specially introduces this artifact of Ali open source: DataX!

Preparatory work

Environment preparation: a Linux server with JDK8,maven and python 2.6 + installed

Download source code: https://github.com/alibaba/DataX.git

Decompress and compile the source code: mvn-U clean package assembly:assembly-Dmaven.test.skip=true

The following message appears, indicating that the compilation is successful (the compilation time is slightly longer. Because DataX supports many data sources and corresponding dependent packages, it may take about the compilation time of 20min, depending on the download speed and machine performance:

Cdn.com/fb3a20a220799ae3b890ea64cdd2f86d5cd2e430.png ">

Common errors:

In step 3, there may be an error that tablestore-streamclient cannot be compiled. Please go to https://mvnrepository.com/artifact/com.aliyun.openservices/tablestore-streamclient/1.0.0 to download the corresponding package and put it under the appropriate path of maven.

Tool use

After successfully compiling DataX, an executable file will be generated in the cd target/datax/datax/ directory, and we can use DataX to synchronize offline data in various formats (see https://github.com/alibaba/DataX/blob/master/userGuid.md for details), as follows:

You can write the data source format that is not in this table through a custom plug-in. For more information, please see https://github.com/alibaba/DataX/blob/master/dataxPluginDev.md

For example, we implement the simplest task of outputting JSON formatted data to the console:

Switch to directory: cd target/datax/datax/bin, for example, on our 192.168.1.63 server, change to directory: / home/data-transfer/datax/target/datax/datax/bin

View the configuration format command: python datax.py-r streamreader-w streamwriter

Write the configuration file, the stream2stream.json file is as follows:

1 {

2 "job": {

3 "content": [

4 {

5 "reader": {

6 "name": "streamreader"

7 "parameter": {

8 "sliceRecordCount": 10

9 "column": [

10 {

11 "type": "long"

12 "value": "10"

13}

14 {

15 "type": "string"

16 "value": "Hello, hello, World-DataX"

17}

18]

19}

20}

21 "writer": {

22 "name": "streamwriter"

23 "parameter": {

24 "encoding": "UTF-8"

25 "print": true

26}

27}

28}

29]

30 "setting": {

31 "speed": {

32 "channel": 5

33}

34}

35}

36}

Run the script: python datax.py. / stream2stream.json, and output the console after execution:

For example, offline data synchronization from mysql to mysql can be done using:

Python datax.py-r mysqlreader-w mysqlwriter get profile template

For more writer, please see the writer folder under the plugins directory (Writer is officially included by default and supports custom extensibility):

For more reader, please see the reader folder under the plugins directory (Reader is officially included by default and supports custom extensibility):

Note: if you want to synchronize data using offline increments, you can specify where filtering in the configuration file

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 208

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report