Design idea and working principle of big data bus platform DBus 04/20 Update SLTechnology News&Howtos

Design idea and working principle of big data bus platform DBus

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Big data bus platform DBus design ideas and working principles, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

I. background

A large amount of business data in an enterprise is stored in various business system databases. In the past, there are many common ways to synchronize data, such as:

Various data users extract the required data during the business trough (the disadvantage is that there is repeated extraction and the data is inconsistent)

The data is extracted from each system by the unified data warehouse platform through sqoop (the disadvantage is that the sqoop extraction method is poor in timeliness, which is generally due to the timeliness of sqoop 1)

Obtain incremental changes based on trigger or timestamp (the disadvantage is that it is intrusive to the business side, resulting in performance loss, etc.)

These schemes are not perfect. After understanding and considering different implementation methods, we think that in order to solve the problem of data consistency and real-time at the same time, a more reasonable method should be a log-based solution. At the same time, it can provide message subscription to downstream systems.

The DBus (data bus) project was born in response to this requirement. DBus focuses on data collection and real-time data flow calculation. Through simple and flexible configuration, it collects the source data in a non-invasive way, and uses a highly available streaming computing framework to aggregate the data generated in the business process of each IT system of the company. After conversion, it becomes a unified JSON data format (UMS). Provide subscription and consumption to different data users, and act as data sources for business such as data warehouse platform, big data analysis platform, real-time reports and real-time marketing.

II. System architecture and working principle

DBUS is mainly divided into two parts: source data collection and multi-tenant data distribution. Kafka is used as the medium to connect the two parts. Users without the need for multi-tenant resources and data isolation can directly consume the data output to kafka at the source-end data collection level, without the need to configure multi-tenant data distribution.

2.1 DBUS source data acquisition

Generally speaking, data acquisition at the source end of DBUS is divided into two parts:

Read RDBMS incremental logs to obtain incremental data logs in real time, and support full pull

Based on logtash,flume,filebeat and other crawling tools to obtain real-time data, the structured output of the data in a visual way.

The following is the specific implementation principle

The main modules are as follows:

Log crawling module: read incremental logs from the standby library of RDBMS and synchronize them to kafka in real time

Incremental conversion module: convert incremental data to UMS data in real time, deal with schema changes, desensitization, etc.

Full extraction program: pull and convert all data from RDBMS repository to UMS data

Log operator processing module: structured processing of log data from different capture ends according to operator rules

Heartbeat monitoring module: for RDMS sources, regularly send heartbeat data to the source, and monitor at the end, send early warning notification; for log class, monitor early warning directly at the end.

Web management module: manage all related modules.

2.2 Multi-tenant data distribution

For situations where different tenants have different access rights and desensitization requirements for different source data, you need to introduce a Router distribution module to distribute the source data to the Topic assigned to tenants according to the configured permissions, source tables that users have access to, different desensitization rules, and so on. The introduction of this level involves user management, Sink management, resource allocation, desensitization configuration and so on in DBUS management system. Different items of consumption are allocated to his topic.

Main functions:

Non-intrusive access to a variety of data sources: the business system does not need any modification to obtain real-time changes of incremental data by non-invasive reading of the logs of the database system. Currently, RDBMS supports mysql,oracle data sources (for Oracle data sources, please refer to Oracle related protocols), and logs support various data log extraction schemes based on logstash,flume and filebeat.

Mass data real-time transmission: the use of Storm-based streaming computing framework, second delay, no single point of overall high availability.

Multi-tenant support: provides rich functions such as user management, resource allocation, Topology management, tenant table management and so on. Different tenants can be assigned different access rights to source table data according to their needs, and different desensitization rules can be applied to achieve multi-tenant resource isolation and differentiated data security.

Aware of source-side schema changes: when a schema change occurs on the source side, it can automatically sense the schema change, adjust the UMS version number, and notify the downstream through Kafka messages and e-mails

Real-time desensitization of data: real-time desensitization of specified column data can be carried out according to demand. Desensitization strategies include: direct replacement, MD5, murmur and other desensitization algorithms, desensitization plus salt, regular expression replacement and so on. Support users to develop jar packages to achieve personalized desensitization strategy that is not covered by DBUS.

Initialization loading: support efficient initialization loading and reloading, support any specified output topic, and flexibly respond to customer needs.

Unified standardized message transmission protocol: using unified UMS (JSON format) message schema format output for easy consumption, providing data line-level ums_id to ensure data sequence, output insert,Update (before/after), Delete event data.

Reliable multi-channel message subscription distribution: multi-user subscriptions that use Kafka to store and deliver messages to ensure reliability and convenience

Support partitioned table / series table data collection: support partitioned table data aggregation into a "logical table". You can also aggregate user-defined series table data into a "logical table". Example:

Real-time monitoring & early warning: the visual monitoring system can check the real-time traffic and delay of each data line at any time; when the data line is abnormal, automatically notify the relevant person in charge by email or SMS according to the configuration policy.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.