How to build a robust CDC pipeline using Apache Hudi and Debezium 04/27 Update SLTechnology News&Howtos

How to build a robust CDC pipeline using Apache Hudi and Debezium

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article will explain in detail how to use Apache Hudi and Debezium to build a robust CDC pipeline, the content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

An article shared on Bangalore Hadoop Meetup to build CDC pipes using Apache Hudi and Debezium, shared by Pratyaksh, an active contributor to the Apache Hudi community.

CDC (CHANGE DATA CAPTURE): a software design pattern used to identify and track changed data so that action can be taken on changed data. A simple example is to capture records of MySQL changes and then import them into the data lake.

Business units require business insight; service owners require validation of each version of the record over time, and data engineers require low maintenance pipelines to have low latency from transaction processing systems (MySQL, Postgres,Cassandra,Mongo) to analysis systems (HDFS) CDC. CDC has the following advantages: event handling, real-time analysis and display board, audit log, 24-hour load work.

There are different solutions for CDC, such as log-based Debezium and query-based JDBC Connector, such as Sqoop, most companies are using Sqoop to process data, handle schema changes in data sources, and deal with file storage formats, but it is difficult to deal with formats such as CSV.

In the past, we used Maxwell because we had to give up openness and community support.

As long as high-frequency flow processing is avoided, NiFi is a good data flow tool with a high IO, so the disk may become a bottleneck and there is no data redundancy, so AWS EBS should be configured. In addition, we must patch the CatpureChangeMySql processor to handle memory buffering.

Debezium is an active project supported by redhat. It is built on KafkaConnect and supports SQL and NOSQL databases, and it updates cached schemas by merging SQL info schemas with Alter statements.

Bootstrap: because binlog/WAL doesn't last long, the entire database snapshot is processed the first time it starts.

Databricks's recently open source Delta.io (not long ago supports Presto and Authena. Uber open source Apache hudi, the storage format only has the rewrite split function (Athena) parquet file input format. The Parquet format-which seems controversial-but the file format of the Spark community (DS) evolves better. Although Hive- has LLAP support, it still feels slow (MR,TEZ)

The overall structure of the system is as follows, and the database can be SQL or NOSQL,BinLog and WAL. The entire service runs on Kubernetes, and we build an abstraction layer to support Debezium's NRT requirements-because freshness is always accompanied by higher costs. JDBC for Batch and DB, but getting change logs is not supported.

Hudi represents updates, deletions and increments of Hadoop. In other words, hudi provides an effective platform for data extraction, coordination and query. For data extraction and coordination, it retains the hudi key.

Deduplication, multiple updates to the same record need to go to the same partition path. Hudi uses indexes. (bloom or hbase).

If present, marks the current location of the record and passes in the record.

When writing, the minimum hdfs file size is maintained, which is also the way to solve the problem of small files in hudi.

In COW mode, you can use a cleanup policy to clean up all outdated data

For queries, multiple views are supported-read optimized views, real-time views and incremental views.

COW supports reading optimized and incremental views.

MOR supports all three views.

The following is the system architecture of Apache Hudi, which uses Spark micro-batch to read data and supports indexing, synchronizes tables to support Hive Metasore, and supports three views for queries.

There are also the following challenges when using Hudi

The following contributions have been made to the Hudi community and the Debezium community

Line diagram: such as building UI for orchestration, UI for data analysis, authentication and authentication, etc.

Start hudi spark task commands and properties of Hive Metastore

Cleanup policy configuration in Hudi

On how to use Apache Hudi and Debezium to build a robust CDC pipeline to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.