What is OnZoom's Apache Hudi-based all-in-one architecture 07/09 Update SLTechnology News&Howtos

What is OnZoom's Apache Hudi-based all-in-one architecture

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly talks about "what is the integrated architecture of OnZoom based on Apache Hudi". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the Apache Hudi-based integrated architecture of OnZoom?"

1. Background

OnZoom is a new product of Zoom, which is a unique online activity platform and market based on Zoom Meeting. As an extension of the Zoom Unified Communications platform, OnZoom is a comprehensive solution that provides paid Zoom users with activities to create, host and make money, such as fitness classes, concerts, standing or impromptu performances, as well as music courses on the Zoom conferencing platform.

In OnZoom data platform, source data is mainly divided into MySQL DB data and Log data. Among them, Kafka data is consumed in real time through Spark Streaming job, MySQL data is synchronized through Spark Batch job timing, and source data is Sink to AWS S3. After that, Spark Batch Job is scheduled for warehouse development. Finally, Sink the data to the appropriate storage according to the actual business requirements or usage scenarios.

Architecture issues of the first edition

MySQL acquires data through sql and synchronizes it to S3 is offline processing, and in some scenarios (such as physical deletion), it can only be fully synchronized at a time.

Spark Streaming job sink to S3 need to deal with small files

The default S3 storage method does not support CDC (Change Data Capture), so only offline data warehouses are supported.

Because of security requirements, sometimes when you need to delete or update some customer data, it can only be calculated in full (or specified partition) and overwrite. Poor performance

two。 Architecture optimization and upgrade

Based on the above problems, after a large number of technical research and selection and POC, we have mainly done the following two parts of architecture optimization and upgrading.

2.1 Canal

MySQL Binlog is the binary log, which records all table structure and table data changes in MySQL.

Cannal is based on MySQL Binlog log parsing, provides incremental data subscription and consumption, and Sink data to Kafka to achieve CDC.

The timeliness and physical deletion of the above problem 1 can be solved by using Spark Streaming job to consume Binlog in real time.

2.2 Apache Hudi

We need a data lake solution that is compatible with S3 storage and supports both batch processing of large amounts of data and streaming processing of increased data. In the end, we chose Hudi as our data lake architecture solution for the following reasons:

Hudi supports efficient record-level additions and deletions by maintaining indexes.

Hudi maintains a timeline that contains all instant operations done on the dataset at different immediate times (instant time), and can get CDC data (incremental queries) at a given time. Raw Parquet read optimization queries based on the latest files are also provided. In order to achieve the flow and batch integrated architecture instead of the typical Lambda architecture.

Hudi intelligently automatically manages file size without user intervention to solve small file problems.

Support S3 storage, support Spark, Hive, Presto query engines, low entry cost only need to introduce the corresponding Hudi package

3. Hudi practical experience sharing

The default PAYLOAD_CLASS_OPT_KEY for Hudi upsert is OverwriteWithLatestAvroPayload. In this way, all fields will be updated to the currently passed DataFrame when upsert. However, in many scenarios, you may only want to update some of these fields, and the other fields are consistent with the existing data. In this case, you need to pass PAYLOAD_CLASS_OPT_KEY to OverwriteNonDefaultsWithLatestAvroPayload and set the fields that do not need to be updated to null. However, this upsert approach also has some limitations, such as not updating a value to null.

We now have scenarios of real-time data synchronization and offline rerun data, but we are currently using Hudi version 0.7.0, which does not support multiple job concurrent writing Hudi tables. The temporary solution is to pause real-time tasks every time rerun data is needed, because concurrent writing is already supported in version 0.8.0, and upgrades are considered later.

At the beginning of our task, we synchronize the Hudi metadata by default every time we change the Hive table data. However, it is a waste of resources for real-time tasks to update metadata with Hive Metastore every time, because most operations only involve data changes and do not involve table structure or partition changes. So we later turned off synchronizing hudi-hive-sync-bundle-*.jar metadata for real-time tasks, and then synchronized it separately when we needed to update the metadata.

The semantics of the Hudi incremental query returns all the change data in a given time, so it looks for all the historical commits files in timeline. However, historical commits files are cleaned up according to the retainCommits parameter, so complete change data may not be available if a given time span is large. If you are only concerned with the final state of the data, you can filter the incremental data according to _ hoodie_commit_time.

Hudi default spark partition parallelism withParallelism is 1500, which needs to adjust the appropriate shuffle parallelism according to the actual input data size. (the corresponding parameter is hoodie.[ insert | upsert | bulkinsert] .shuffle.parallelism)

Hudi is based on parquet columnar storage and supports backward compatible schema evolution, but only schema changes to the added fields of the new DataFrame are supported. Full schema evolution is expected to be implemented in version 0.10. If there is a need to delete or rename fields, you can only overwrite. In addition, adding fields may also cause hive sync metadata to fail, so you need to execute drop table in hive first.

Hudi Insert handles the same recordKey data differently according to different parameters. The decisive parameters include the following three:

Hoodie.combine.before.insert

Hoodie.parquet.small.file.limit

Hoodie.merge.allow.duplicate.on.inserts

Among them: hoodie.combine.before.insert decides whether to merge the data of the same batch by recordKey. The default is false;hoodie.parquet.small.file.limit and hoodie.merge.allow.duplicate.on.inserts to control the threshold of small file merging and how to merge small files. If hoodie.parquet.small.file.limit > 0 and hoodie.merge.allow.duplicate.on.inserts is false, data of the same recordKey will be merged when small files are merged. There is a probability that de-duplication occurs (if the data of the same recordKey is written to the same file); if hoodie.parquet.small.file.limit > 0 and hoodie.merge.allow.duplicate.on.inserts is true, then the data of the same recordKey will not be processed when the small files are merged

At this point, I believe you have a deeper understanding of "what is OnZoom's all-in-one Apache Hudi-based architecture". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.