Exploration and practice of DataPipeline (New Enterprise data Fusion platform) 04/26 Update SLTechnology News&Howtos

Exploration and practice of DataPipeline (New Enterprise data Fusion platform)

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

I. on data fusion and enterprise data fusion platform

Data fusion is the organic centralization of data from different sources, formats and characteristics logically or physically, so as to provide comprehensive data sharing for enterprises.

Enterprise data fusion platform is usually represented by a distributed system running a large number of data synchronization and conversion tasks. The source end is generally all kinds of real-time business data storage system, and the destination end is all kinds of data warehouse / object storage.

II. Typical architecture of enterprise data fusion platform

The following figure shows the typical architecture of the data fusion platform, with different data storage systems on the source side and various types of data warehouses, relational databases or file storage on the other end. The middle is the simple architecture of the data fusion platform, and the component Source connectors is responsible for data collection.

After the data is collected, it will be formatted and put into Transport Channel,Transport Channel. Generally, Source queues or other streaming data frameworks will be used to do the intermediate caching, including distributed support, data distribution, and sink connectors will be responsible for writing data to different data destinations.

Third, the key problems to be solved in enterprise data fusion: data heterogeneity.

It is faced with the tedious problems of data source and destination adaptation and the transformation of heterogeneous data sources.

Data structures that change from time to time

The structure of the data source can change at any time, causing downstream writes to fail. When the data structure changes, you need to make sure that the data is normal and that there are no problems.

Scalability of data platform

It is necessary to expand horizontally according to the business driver, and even to deal with one-to-many distribution requirements. In addition, it also needs to deal with and solve the multi-task parallel QoS.

Data consistency

In any case, it is necessary to ensure that the data is consistent, which is also a problem that needs to be guaranteed in the production process.

IV. The role of message queue in data fusion platform

The first is decoupling. Message queuing can completely decouple the data collection of the source side from the data of the mobile side. If there are any problems on the data writing side, it will not affect the stability of data acquisition.

Schema Mapping helps us decouple the data source and destination structure and reduce the complexity of developing a new connector.

At the same time, message queue provides the nature of horizontal expansion and high availability. When we need to access more data and the system can not support it, we can easily do horizontal expansion to support a larger amount of data.

In addition, the consistency of message queue and data synchronization is guaranteed, which can at least ensure the sequence of data synchronization.

5. Existing structure of DataPipeline

The following figure shows the architecture of DataPipeline based on Kafka connect message queue. Kafka itself is a very mature message queue. Kafka connect is a sub-project under it, which is equivalent to providing an encapsulation for kafka consumer and kafka producer. It implements distribution and high availability, and helps us to be responsible for interacting with kakfa.

VI. Kafka connect-offset management

Consumers will have a concept of offset, which is used to record the progress of consumption. Kafka connect will automatically manage the message offset, which can automatically submit the progress of consumption after we have consumed some data, and then store it in Kafka.

When reading the data, connector will extract the data from the data source and write it to data topic, which is used as a cache in the middle of the data. At the same time, connector will periodically submit offset to offset Topic during synchronization, which is equivalent to saving a file point for each period of time read.

If the periodic offset commit fails, it will cause the data task to not be fully recovered to the last written offset point when the data task is restarted. This situation will lead to repeated reading and writing of data, which will lead to the problem of data consistency, which can be avoided to some extent by the following solutions:

Deduplication depends on the characteristics of the destination to achieve the ultimate consistency of the data, for example, RDBMS deduplicates with the primary key.

Relying on the transaction information of the message queue avoids the repetition of the source side and ensures the transactional commit of data writing and offset writing.

The destination records a separate offset to the redis cache after the write, and filters it according to the offset after the task is resumed to avoid repeated writes. Reduce data duplication caused by offset rewind, but since writing data and recording offset is not a transactional operation, exactly once delivery is not guaranteed.

Depending on the transactionality of the destination, the offset of temporary space records is established at the destination, and filtered according to offset after the task is restored to avoid repeated writes, which can guarantee exactly once delivery. However, the destination is required to be transactional, and there will be additional data storage at the destination.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.