The second part: the practical application scenario of data fusion platform DataPipeline 04/20 Update SLTechnology News&Howtos

The second part: the practical application scenario of data fusion platform DataPipeline

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

In the article "Application scenarios of data Fusion platform DataPipeline" released last week, we introduced some situations encountered by customers in using the latest 2.6version from seven scenarios. Next, this article will continue to show you several other application scenarios.

1. Support for scenarios of sub-database and sub-table

Scene description

In the same data task, for a table of the source system, the data is synchronized to different tables in different databases according to the business logic of the data in the table. for example, synchronize the group's sales data into the sales data table of the corresponding branch database according to the different branches.

Applicable description of the scene

Source / destination: relational database

Read mode: unlimited

Operation steps

(1) determine the rules of subdatabase and table according to the design.

(2) Select the corresponding data source according to the established rules and create the data source

(3) the destination writes the rules of sub-database and sub-table to the CSV file in the specified format.

(4) DP writes the source data to the destination according to the defined CSV rules.

Note: for more details, please contact DataPipeline for development documentation.

Second, customize the data source scene

Scene description

At present, in the demand scenario of data transmission, many enterprises need to obtain business data from external partners and suppliers in addition to real-time and timing allocation from different upstream business databases to downstream systems. At this time, enterprises usually write different scripts according to their needs, manually call the API interface provided by third-party systems, write their own cleaning logic after grabbing the data, and finally achieve the data landing. DataPipeline's custom data source feature has the following advantages for the above scenarios:

Uniformly manage the data acquisition logic, quickly merge JAR to reduce the amount of script development; 2. When the upstream changes, there is no need to adjust each data transmission task; 3. It can be combined with DataPipeline's data parsing function, cleaning tool and target initialization function to reduce the overall development volume and provide monitoring and early warning.

Applicable description of the scene

Sources: custom data sourc

Destination: unlimited

Read mode: timing mode

Operation steps

(1) create a custom data source and upload the JAR package (or call the uploaded JAR package)

(2) Select the destination where the data is stored

(3) use cleaning tools to complete the data parsing logic.

(4) all configurations can be completed by configuring the destination surface structure.

Note: for more information, please refer to "Custom" data source to solve the external data acquisition problem of complex request logic. Please add a link description.

3. After the MySQL source Slave1 is hung up, how to use Slave2 synchronization to ensure that the data is not lost

Scene description

In order to avoid the impact on the MySQL master library, the DataPipeline connection MySQL Slave1 slave library performs real-time synchronization by parsing Binlog. However, when Slave1 dies, in order not to affect the task, you need to switch to Slave2 to continue real-time synchronization from the library.

However, the Binlog log obtained by Slave2 has a delay compared with Slave1, which will lead to missing data.

DataPipeline provides a rollback feature. Users can roll back the original task to a certain time period on DataPipeline to ensure that no data is missing, get the corresponding GTID, and then use the GTID to find the Binlog position and other information corresponding to the Slave2. This operation does not lose data, but may result in duplicate data (if the destination is a relational database and has a primary key, it can be deduplicated based on the primary key).

Applicable description of the scene

Source / destination: MySQL/ relational database

Read mode: real-time mode

Operation steps

(1) create a data source (Slave1, the conditions required to enable Binlog synchronization)

(2) activate the task normally.

(3) if the task fails, roll back the operation and get the GTID value at a certain point in time.

(4) create another data source (Slave2, the condition required to enable Binlog synchronization)

(5) Select Custom from the activation starting point (enter Slave1, the GTID you get when you roll back, and get the Binlog position and other information on the Slave2 according to the GTID).

4. Synchronize multiple tables to one Kafka Topic

Scene description

Because of business requirements, multiple tables need to be synchronized into Kafka Topic so that the data can be made available downstream. This scenario can be implemented in DataPipeline, but there are also some considerations.

Applicable description of the scene

Source / destination: unlimited / Kafka

Read mode: unlimited (real-time or incremental identification field mode is recommended because Kafka destination data cannot be deduplicated)

Operation steps

(1) if you do not use the advanced cleaning feature of DataPipeline, you can contact DataPipeline operation and maintenance staff to turn on a global parameter, so that you can select multiple tables to write to the same Kafka destination in a task. If it is in real-time mode, under this parameter, each piece of data will be appended with DML, Insert timestamp and other field information for downstream use.

(2) if you need to use DataPipeline advanced cleaning features, you need to distribute these tables in different tasks and write them to the same Kafka Topic.

Synchronize multiple tables to one table

Scene description

Because of business requirements, multiple tables need to be synchronized to one destination table. It can be done in DataPipeline, but there are a few things to pay attention to.

Applicable description of the scene

Source / destination: unlimited

Read mode: unlimited

Operation steps

(1) Advanced cleaning needs to be enabled in versions prior to 2.6.0, and default values are assigned to the added fields in advanced cleaning.

(2) versions after 2.6.0 do not need to turn on advanced cleaning, add fields to the write settings and turn off the blue button (synchronize fields, but not data)

(3) the table structure of the destination on DataPipeline must be consistent. If the field is missing, it needs to be added accordingly. Also, you need to build multiple tasks to distribute the source tables in different tasks.

6. How to synchronize pictures to HDFS

Scene description

The images of each region need to be uploaded to HDFS and saved for subsequent calculation.

Applicable description of the scene

Source / destination: FTP/HDFS

Read mode: timing mode

Operation steps

(1) Click File synchronization

(2) you can create a task normally.

This article focuses on six scenarios. If you encounter the same problem at work, you are welcome to communicate with us.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.