In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
DataPipeline has gone through several product iterations in the past year. As far as the latest 2.6 release is concerned, do you know what usage scenarios are available? Next, it will be divided into the first and second parts for everyone to interpret. I hope the one you care about will appear in these scenes.
Scenario 1: dealing with frequent changes in production data structures
Scene description
When synchronizing production data, the source side often deletes tables and adds or subtracts fields because of business relations. It is hoped that the task can continue to synchronize in this case. And when the source side adds or subtracts the field, the destination can choose whether to add or subtract the field with the source side according to the setting.
Applicable description of the scene
Source / destination: relational database
Read mode: no limit
Operation steps
Do not restrict the DataPipeline version
Under the task settings of DataPipeline and the advanced options under data destination settings, there is an option of "data Source change Settings", which can be selected according to the prompt.
Scenario 2: call the Jenkins task after the data task ends
Scene description
When the synchronization of the data task ends, start the defined Jenkins task immediately. Ensure the sequence of execution, as well as dependency.
Applicable description of the scene
Source / destination: traditional database (others require scripts)
Read mode: batch full or incremental identification field
Operation steps
Create a task flow in a DataPipeline task flow
Create a scheduled data synchronization task
Add [remote Command execution], add server IP, write python script and place it in the server specified directory
Please communicate with DataPipeline staff for detailed operation details.
Scenario 3: production data synchronization for test use
Scene description
MySQL- > MySQL real-time synchronization. During synchronization, the test team may want to test several tables in the task, and during the testing process, the destination will have INSERT/UPDATE/DELETE operations. I want to be able to automate the execution of scripts to pause the synchronization of several tables before testing. At the end of the test, the tables are resynchronized by an automated script, and the destination data needs to be consistent with the online data (that is, all the dirty data generated by the test needs to be cleaned up).
Applicable description of the scene
Source / destination: relational database destination
Read mode: no limit (full / incremental recognition field mode may need to be enabled [allow target table data to be cleared before batch synchronization is performed on a regular basis])
Operation steps
Required DataPipeline version > = 2.6.0
Execute the script provided by DataPipeline before testing the destination surface
After the destination finishes the test, execute the script to add the test table
Start the script to resynchronize several tables of the test to ensure that the data after the test can continue to be consistent with the online data
Refer to the list of DataPipeline swagger APIs. Script templates are available for reference.
Scenario 4: Hive- > GP column storage synchronization rate improvement scheme
Scene description
Hive- > GP, if the GP destination table is a manually created column storage table, the synchronization rate on DataPipeline will be very slow. This is due to the limitations of the GP column store itself. If the destination is the row table created by DataPipeline, and then the row table is converted into a list by script, the efficiency will be improved dozens of times.
Applicable description of the scene
Source / destination: Hive source / GP destination
Read mode: incremental / full
Operation steps
The destination table is a row table created automatically by DataPipeline.
Write a script to convert row tables to lists
After the synchronization of the data task is completed, the row transfer script is invoked through the DataPipeline task flow
Then provide the list data for downstream use
Scenario 5: encrypt and desensitize the data
Scene description
Because it involves user privacy or other security reasons, some fields of data need to be desensitized or encrypted. This kind of scenario can be fully satisfied with the advanced cleaning function of DataPipeline.
Applicable description of the scene
Source / destination: unlimited
Read mode: no limit
Operation steps
Do not restrict the DataPipeline version
You can configure the task normally, just turn on the advanced cleaning function.
Package the written encryption code or desensitization code into a jar package, upload it to the server's execution directory, and call it directly.
Advanced cleaning codes can contact DataPipeline to provide templates
Note: the jar package you write needs to upload the / root/datapipeline/code_engine_lib (general default) directory of the server where webservice, sink, and manager are located.
Scenario 6: identify the responsibility and causes of upstream and downstream data problems through the error queue
Scene description
As a data department, it needs to receive upstream data and transmit the data to the corresponding department according to the needs of the downstream department. Therefore, when there is dirty data or data problems, it is sometimes difficult to locate the cause of the problem and divide the responsibility.
And most of the time, the dirty data is discarded directly, and the upstream can not track the cause of the dirty data. Through the advanced cleaning function of DP, you can customize to put non-conforming data into the error queue.
Applicable description of the scene
Source / destination: unlimited
Read mode: no limit
Operation steps
Do not restrict the DataPipeline version
You can configure the task normally, just turn on the advanced cleaning function.
In the advanced cleaning, the corresponding fields are logically judged according to the business, and the unwanted data is thrown into the DP error queue.
Advanced cleaning codes can contact DataPipeline to provide templates
Scenario 7: easier to support manual addition of fields to destinations
Scene description
By Oracle- > SQLServer, you want to manually add a list of TIMESTAMP types to the destination, automatically assign the default value, and record the data INSERT time.
Applicable description of the scene
Source / destination: relational database destination
Read mode: no limit
Operation steps
Required DataPipeline version > = 2.6.0
When configuring the DataPipeline mapping page, add a column, and the field name is the same as the manually added name of the destination (any scale type is given, there is no need to be consistent)
Check the blue button in this field (enable to synchronize the field data, turn off to indicate that the field does not send any data), and click Save. As shown in the following figure:
Manually add the column to the destination
This article will focus on the above 7 scenarios. If you encounter the same problems at work, you are welcome to communicate with us.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 232
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.