DataPipeline data fusion blockbuster function: one-to-many real-time distribution, batch 04/15 Update SLTechnology News&Howtos

DataPipeline data fusion blockbuster function: one-to-many real-time distribution, batch

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

It has been a month since last month's user experience upgrade post. what have DataPipeline's buddies been up to in this month?

To better serve users, the latest version of DataPipeline supports:

1. Data from one data source is distributed (real-time or timed) to multiple destinations at the same time

two。 Improve the usage scenario of Hive:

When writing to a Hive destination, you can select any destination table field as the partition field.

Hive can be regularly distributed to multiple destinations as a data source.

3. When regularly synchronizing relational database data, the read strategy can be customized to meet the synchronization incremental requirements of each table.

This article will first introduce the functions of one-to-many data distribution and batch read mode 2.0, which will be released one after another.

Launch background

The background of launching "one-to-many data distribution"

In the historical version, DataPipeline allows only one data source and destination per task, and data read from the data source is only allowed to be written to one target table. This can lead to two demand scenarios that cannot perfectly support the customer:

Requirement scenario 1:

After obtaining JSON data from an API data source or from KafkaTopic, the customer writes to multiple tables or databases of the destination through advanced cleaning parsing, but the historical version cannot be written to multiple destinations at the same time and can only create multiple tasks. This will result in repeated acquisition of the same batch of data on the data source side (and the data consistency cannot be fully guaranteed), waste of resources, and cannot be managed uniformly.

Demand scenario 2:

The customer wants to create a data task and distribute it to multiple data destinations in real time (or regularly) from a relational database table. In the historical version, users need to create multiple tasks to solve the problem, but when creating multiple tasks to implement this requirement, they will repeatedly read the data from the same table from the data source, which is a waste of resources. Customers prefer to read it once and then parse it directly into multiple tables to complete the requirement scenario.

Problems solved by the new features:

1. After the user selects a data source in a data task, it is allowed to select multiple destinations or tables as write objects, without the need to create multiple tasks to implement the requirement.

two。 Users can set each destination surface structure and writing strategy separately according to the type and characteristics of each destination in a single task, which greatly reduces the number of data source reads and management costs.

Extend Hive-related usage scenarios

In the historical version, DataPipeline supports requirements scenarios where data from various types of data sources are synchronized to Hive destinations. However, due to the different Hive usage and data storage methods of each customer, the two requirements scenarios are not supported in the historical version.

Requirement scenario 1:

Dynamic partition field. In the historical version, users are only allowed to select the time type field as the partition field. In a real customer scenario, in addition to the partitioning policy based on time, the customer wants to specify any field of the Hive table as the partitioning field.

Demand scenario 2:

Customers hope that in addition to writing data to Hive regularly with Hive as the destination, customers also want to use DataPipeline to regularly distribute Hive table data to various application systems to solve business needs.

Problems solved by the new features:

1. Allows the user to specify any field in the destination Hive table as the partition field and supports the selection of multiple partition fields.

two。 Added Hive data source, which can be read as data task.

The background of the launch of "batch read Mode 2.0 function"

Demand scenarios:

The tables of relational database (take MySQL as an example) do not have the permission to read BINLOG, but in business, customers need to synchronize incremental data periodically. In the case of permission only SELECT, they need to synchronize incremental data.

Before the new version, DataPipeline provides incremental identification fields when users choose batch read mode. You can select self-increasing fields or update time fields as conditions to complete the synchronization of incremental data, but some tables may not have such fields, or the logic of incremental synchronization does not make sense (for example, only synchronize the data of the past 1 hour, or only synchronize to the data of 5 minutes ago, etc.).

Problems solved by the new features:

1. In the case of a relational database as a data source, users are allowed to set WHERE read conditions for each table and provide lastmax methods.

two。 Using this function DataPipeline takes the maximum value of a field in the synchronized data under the task, which can be used as a reading condition for the WHERE statement.

3. Users use the last_max () function, and the read conditions related to the function are ignored when the statement or corresponding field is temporarily innumerable values for the first time.

4. Allows users to edit read conditions in conjunction with methods provided by other databases:

Example: take the time field as the reading condition, synchronize only the data from one hour ago at a time, and only synchronize the data that has not been read.

SELECT * FROMtable1 WHEREupdate_time > 'last_max (update_time)' ANDupdate_time

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.