The first part: the application scenario of data fusion platform DataPipeline 04/26 Update SLTechnology News&Howtos

The first part: the application scenario of data fusion platform DataPipeline

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

DataPipeline has gone through several product iterations in the past year. As far as the latest 2.6 release is concerned, do you know what usage scenarios are available? Next, it will be divided into the first and second parts for everyone to interpret. I hope the one you care about will appear in these scenes.

Scenario 1: dealing with frequent changes in production data structures

Scene description

When synchronizing production data, the source side often deletes tables and adds or subtracts fields because of business relations. It is hoped that the task can continue to synchronize in this case. And when the source side adds or subtracts the field, the destination can choose whether to add or subtract the field with the source side according to the setting.

Applicable description of the scene

Source / destination: relational database

Read mode: no limit

Operation steps

Do not restrict the DataPipeline version

Under the task settings of DataPipeline and the advanced options under data destination settings, there is an option of "data Source change Settings", which can be selected according to the prompt.

Scenario 2: call the Jenkins task after the data task ends

Scene description

When the synchronization of the data task ends, start the defined Jenkins task immediately. Ensure the sequence of execution, as well as dependency.

Applicable description of the scene

Source / destination: traditional database (others require scripts)

Read mode: batch full or incremental identification field

Operation steps

Create a task flow in a DataPipeline task flow

Create a scheduled data synchronization task

Add [remote Command execution], add server IP, write python script and place it in the server specified directory

Please communicate with DataPipeline staff for detailed operation details.

Scenario 3: production data synchronization for test use

Scene description

MySQL- > MySQL real-time synchronization. During synchronization, the test team may want to test several tables in the task, and during the testing process, the destination will have INSERT/UPDATE/DELETE operations. I want to be able to automate the execution of scripts to pause the synchronization of several tables before testing. At the end of the test, the tables are resynchronized by an automated script, and the destination data needs to be consistent with the online data (that is, all the dirty data generated by the test needs to be cleaned up).

Applicable description of the scene

Source / destination: relational database destination

Read mode: no limit (full / incremental recognition field mode may need to be enabled [allow target table data to be cleared before batch synchronization is performed on a regular basis])

Operation steps

Required DataPipeline version > = 2.6.0

Execute the script provided by DataPipeline before testing the destination surface

After the destination finishes the test, execute the script to add the test table

Start the script to resynchronize several tables of the test to ensure that the data after the test can continue to be consistent with the online data

Refer to the list of DataPipeline swagger APIs. Script templates are available for reference.

Scenario 4: Hive- > GP column storage synchronization rate improvement scheme

Scene description

Hive- > GP, if the GP destination table is a manually created column storage table, the synchronization rate on DataPipeline will be very slow. This is due to the limitations of the GP column store itself. If the destination is the row table created by DataPipeline, and then the row table is converted into a list by script, the efficiency will be improved dozens of times.

Applicable description of the scene

Source / destination: Hive source / GP destination

Read mode: incremental / full

Operation steps

The destination table is a row table created automatically by DataPipeline.

Write a script to convert row tables to lists

After the synchronization of the data task is completed, the row transfer script is invoked through the DataPipeline task flow

Then provide the list data for downstream use

Scenario 5: encrypt and desensitize the data

Scene description

Because it involves user privacy or other security reasons, some fields of data need to be desensitized or encrypted. This kind of scenario can be fully satisfied with the advanced cleaning function of DataPipeline.

Applicable description of the scene

Source / destination: unlimited

Read mode: no limit

Operation steps

Do not restrict the DataPipeline version

You can configure the task normally, just turn on the advanced cleaning function.

Package the written encryption code or desensitization code into a jar package, upload it to the server's execution directory, and call it directly.

Advanced cleaning codes can contact DataPipeline to provide templates

Note: the jar package you write needs to upload the / root/datapipeline/code_engine_lib (general default) directory of the server where webservice, sink, and manager are located.

Scenario 6: identify the responsibility and causes of upstream and downstream data problems through the error queue

Scene description

As a data department, it needs to receive upstream data and transmit the data to the corresponding department according to the needs of the downstream department. Therefore, when there is dirty data or data problems, it is sometimes difficult to locate the cause of the problem and divide the responsibility.

And most of the time, the dirty data is discarded directly, and the upstream can not track the cause of the dirty data. Through the advanced cleaning function of DP, you can customize to put non-conforming data into the error queue.

Applicable description of the scene

Source / destination: unlimited

Read mode: no limit

Operation steps

Do not restrict the DataPipeline version

You can configure the task normally, just turn on the advanced cleaning function.

In the advanced cleaning, the corresponding fields are logically judged according to the business, and the unwanted data is thrown into the DP error queue.

Advanced cleaning codes can contact DataPipeline to provide templates

Scenario 7: easier to support manual addition of fields to destinations

Scene description

By Oracle- > SQLServer, you want to manually add a list of TIMESTAMP types to the destination, automatically assign the default value, and record the data INSERT time.

Applicable description of the scene

Source / destination: relational database destination

Read mode: no limit

Operation steps

Required DataPipeline version > = 2.6.0

When configuring the DataPipeline mapping page, add a column, and the field name is the same as the manually added name of the destination (any scale type is given, there is no need to be consistent)

Check the blue button in this field (enable to synchronize the field data, turn off to indicate that the field does not send any data), and click Save. As shown in the following figure:

Manually add the column to the destination

This article will focus on the above 7 scenarios. If you encounter the same problems at work, you are welcome to communicate with us.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.