How to design a real-time data platform (technical article) 07/01 Update SLTechnology News&Howtos

How to design a real-time data platform (technical article)

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Guide: real-time data platform (RTDP,Real-time Data Platform) is an important and common big data infrastructure platform. In the first part (design), we introduced RTDP from the perspective of modern data warehouse architecture and typical data processing, and discussed the overall design architecture of RTDP. As the second part (technical part), this paper introduces the technology selection and related components of RTDP from the technical point of view, and discusses the relevant models suitable for different application scenarios. This is where RTDP's agile road unfolds.

Extended Reading: how to Design a Real-time data platform (Design)

I. introduction of technology selection

In the design section, we present an overall architectural design of RTDP (figure 1). In the technical section, we will recommend the selection of overall technical components; make a brief introduction to each technical component, especially the four technical platforms that we abstract and implement (unified data collection platform, unified streaming processing platform, unified computing service platform, unified data visualization platform); discuss the end-to-end aspect of Pipeline, including functional integration, data management, data security and so on.

Figure 1 RTDP architecture

1.1 overall technical selection

Fig. 2 overall technical selection

First, let's take a brief look at figure 2:

Data sources, clients, enumerate the common data source types of most data application projects. Data bus platform DBus, as a unified data acquisition platform, is responsible for docking various data sources. DBus extracts the data incrementally or fully, carries out some general data processing, and finally publishes the processed message on Kafka. Distributed messaging system Kafka connects producers and consumers of messages with the capabilities of distributed, high availability, high throughput, publish-subscribe and so on. Streaming processing platform Wormhole, as a unified streaming processing platform, is responsible for streaming processing and docking all kinds of data target storage. Wormhole consumes messages from Kafka, supports configuring SQL on the stream to implement data processing logic on the stream, and supports configuration to put the data into different data target storage (Sink) with the final consistency (idempotent) effect. In the data computing storage layer, the RTDP architecture selects open technology components, and users can choose appropriate storage according to the actual data characteristics, computing mode, access mode, data volume and other information to solve specific data project problems. RTDP also supports the selection of multiple different data stores at the same time, making it more flexible to support different project requirements. Computing service platform Moonbox, as a unified computing service platform, is responsible for integration, push-down optimization and mixed computing of heterogeneous data storage for heterogeneous data storage (data virtualization technology). For data presentation and interaction, it is responsible for receiving unified metadata query, unified data computing and distribution, unified data query language (SQL), unified data service interface and so on. Visual application platform Davinci, as a unified data visualization platform, supports a variety of data visualization and interaction requirements in a configured way, and can integrate other data applications to provide partial data visualization solutions. In addition, it also supports different data practitioners to collaborate on the platform to complete daily data applications. Other data terminal consumption systems such as data development platform Zeppelin and data algorithm platform Jupyter are not introduced in this paper. Topical topics such as data management, data security, development, operation and maintenance, and driver engine can be integrated and redeveloped by connecting the service interfaces of DBus, Wormhole, Moonbox and Davinci to support end-to-end management and governance requirements.

Below, we will further refine the technical components and aspect topics involved in the above figure, introduce the functional features of the technical components, focus on the design ideas of our self-developed technical components, and discuss the aspect topics.

1.2Technical components introduction 1.2.1 data bus platform DBus

Figure 3 DBus of RTDP architecture

1.2.1.1 DBus design idea

1) look at design ideas from an external point of view

Responsible for docking different data sources, extracting incremental data in real time, using operation log extraction for the database, and supporting docking with a variety of Agent for log types. All messages are published on Kafka in a unified UMS message format. UMS is a standardized JSON format with its own metadata information. Through unified UMS, logical messages are decoupled from physical Kafka Topic, so that multiple UMS message tables can be transferred to the same Topic. It supports full data pull from the database, and is integrated with incremental data into UMS messages, which is transparent and unaware of downstream consumption.

2) look at the design idea from the internal point of view

Data is formatted based on Storm computing engine to ensure the lowest end-to-end delay of messages. Standardize and format data from different data sources to generate UMS information, including:

✔ generates a unique monotone incremental id for each message, corresponding to the system field ums_id_

✔ confirms the event timestamp (event timestamp) of each message, corresponding to the system field ums_ts_

✔ confirms the operation mode of each message (add, delete, modify, or insert only), corresponding to the system field ums_op_

The change of database table structure is sensed in real time and managed by version number to ensure that the changes of upstream metadata are clear during downstream consumption. Make sure that messages are strongly ordered (not absolutely ordered) and at least once semantics when launching Kafka. The heartbeat table mechanism is used to ensure end-to-end message snooping awareness. 1.2.1.2 DBus feature support configuration full data pull support configuration incremental data pull support configuration online formatting log support visual monitoring early warning support configuration multi-tenant security control support sub-table data aggregation into a single logical table 1.2.1.3 DBus technical architecture

Figure 4 DBus data flow architecture diagram

For more technical details and user interface of DBus, please see:

GitHub: https://github.com/BriData

1.2.2 distributed message system Kafka

Kafka has become a de facto standard × × × distributed message processing system. Of course, Kafka is constantly expanding and improving, and now it also has a certain storage capacity and streaming capacity. There are many articles about the functions and technologies of Kafka itself, and this article will not elaborate on the capabilities of Kafka.

Here we specifically discuss the topics of message metadata management (Metadata Management) and schema evolution (Schema Evolution) on Kafka.

Figure 5

Photo Source: http://cloudurable.com/images/kafka-ecosystem-rest-proxy-schema-registry.png

Figure 5 shows that a metadata management component, Schema Registry, has been introduced into the Confluent company solution behind Kafka. This component is mainly responsible for managing the metadata information and Topic information of messages transferred on Kafka, and provides a series of metadata management services. The reason for introducing such a component is that the consumers of Kafka can understand what data is being transferred on different Topic, as well as the metadata information of the data, and parse the consumption effectively.

Any data transfer link, no matter what system it flows on, will have the problem of metadata management of this data link, and Kafka is no exception. Schema Registry is a centralized Kafka data link metadata management solution, and provides the corresponding Kafka data security mechanism and schema evolution mechanism based on Schema Registry,Confluent.

For more information about Schema Registry, please see:

Kafka Tutorial:Kafka, Avro Serialization and the Schema Registry

Http://cloudurable.com/blog/kafka-avro-schema-registry/index.html

So how to solve the problem of Kafka message metadata management and schema evolution in RTDP architecture?

1.2.2.1 metadata Management (Metadata Management) DBus will automatically record real-time perceived database metadata changes and provide services DBus will automatically record online formatted log metadata information and provide services DBus will publish unified UMS messages on Kafka, and UMS itself has its own message metadata information, so there is no need to call centralized metadata services for downstream consumption Metadata information of data can be obtained directly from UMS messages 1.2.2.2 schema evolution (Schema Evolution) UMS messages come with Namespace information of Schema. Namespace is a 7-layer location string that uniquely locates any lifecycle of any table, which is equivalent to the IP address of the data table, as follows:

[Datastore]. [Datastore Instance] .[Database] .[Table] .[TableVersion]. [Database Partition]. [Table Partition]

Example: oracle.oracle01.db1.table1.v2.dbpar01.tablepar01

Where [Table Version] represents the version number of a Schema of the table, which is automatically maintained by DBus if the data source is a database.

In RTDP architecture, the downstream of Kafka is consumed by Wormhole. When consuming UMS, Wormhole will treat [TableVersion] as *, which means that when the upstream Schema of a table changes, Version will automatically raise the sign, but Wormhole will ignore this Version change and will consume incremental / full data of all versions of the table. How can Wormhole support the evolution of compatibility mode? In Wormhole, you can configure SQL and output fields to be handled on the stream, and when the upstream Schema change is a "compatibility change" (refers to adding fields, or modifying expanded field types, etc.), it does not affect the correct execution of Wormhole SQL. When an incompatible change occurs in the upstream, the Wormhole will report an error, which requires human intervention to fix the logic of the new Schema.

As you can see from the above, Schema Registry and DBus+UMS are two different design ideas for metadata management and schema evolution, each of which has its advantages and disadvantages. You can refer to the simple comparison in Table 1.

Table 1 comparison between Schema Registry and DBus+UMS

Here is an example of UMS:

Figure 6 UMS message example

1.2.3 streaming processing platform Wormhole

Figure 7 Wormhole of RTDP architecture

1.2.3.1 Wormhole design idea

1) look at design ideas from an external point of view

Consuming UMS messages and custom JSON messages from Kafka is responsible for docking different data target storage (Sink), and realizing the final consistency of Sink through idempotent logic to support configuration SQL mode to provide Flow abstraction for processing logic on the stream. Flow is defined by a Source Namespace and a Sink Namespace and is unique. Processing logic can be defined on Flow, which is a logical abstraction of processing on streams. By decoupling from physical Spark Streaming and Flink Streaming, the same Stream can handle multiple Flow processing streams, and Flow can switch arbitrarily on different Stream. Support backfill-based Kappa architecture; support Wormhole Job-based Lambda architecture

2) look at the design idea from the internal point of view

Data flow processing is carried out based on Spark Streaming and Flink computing engine. Spark Streaming supports high throughput, batch Lookup, batch write Sink and other scenarios, while Flink supports low latency, CEP rules and other scenarios. Implement idempotent storage logic of different Sink through ums_id_, ums_op_ to support functional flexibility and design consistency by computing pushdown to achieve Lookup logic optimization and abstraction.

✔ Unified DAG higher-order Fractal abstraction

✔ Unified General flow message UMS Protocol abstraction

✔ Unified data Logic Table Namespace Namespace abstraction

Abstract several interfaces to support extensibility

✔ SinkProcessor: extend more Sink support

✔ SwiftsInterface: support for processing logic on custom streams

✔ UDF: more UDF support for processing on streams

Real-time aggregate flow job dynamic indicators and statistics 1.2.3.2 Wormhole features support visualization and configuration through Feedback messages SQL-based development and implementation of streaming project support instruction dynamic streaming management, operation and maintenance, diagnosis and monitoring support unified structured UMS messages and custom semi-structured JSON messages support processing addition, deletion and modification three-state event message flow supports a single physical flow to process Lookup Anywhere on multiple logical business flows simultaneously Pushdown Anywhere supports business policy-based event timestamp streaming, UDF registration management and dynamic loading, concurrent idempotent storage of multi-objective data systems, multi-level incremental message-based data quality management, incremental message-based streaming and batch processing, Lambda architecture and Kappa architecture support seamless integration with three-party systems, and can be used as a flow control engine for three-party systems to support private cloud deployment. Security Rights Control and Multi-tenant Resource Management 1.2.3.3 Wormhole Technical Architecture

Figure 8 Wormhole data flow architecture diagram

For more technical details and user interface of Wormhole, please see:

GitHub: https://github.com/edp963/wormhole

1.2.4 selection of common data calculation and storage

The RTDP architecture adopts an open and integrated attitude towards the selection of data computing and storage. Different data systems have their own advantages and suitable scenarios, but no one data system can be suitable for a variety of storage computing scenarios. Therefore, when appropriate, mature, mainstream data systems emerge, Wormhole and Moonbox will integrate support as needed.

Here are some general selections:

Relational databases (Oracle/MySQL, etc.): complex relational computing suitable for small amounts of data

Distributed column storage system

✔ Kudu:Scan optimization, suitable for OLAP analysis and calculation scenarios

✔ HBase: random read and write, suitable for providing data service scenarios

✔ Cassandra: high-performance write, suitable for high-frequency write scenarios of massive data

✔ ClickHouse: high-performance computing, suitable for insert-only write scenarios (update and delete operations will be supported later)

Distributed file system

✔ HDFS/Parquet/Hive:append only, suitable for mass data batch computing scenarios

Distributed document system

✔ MongoDB: balance ability, suitable for medium-complex computing with large data volume

Distributed indexing system

✔ ElasticSearch: indexing ability, suitable for fuzzy query and OLAP analysis scenarios

Distributed precomputing system

✔ Druid/Kylin: precomputing capability, suitable for high-performance OLAP analysis scenarios

1.2.5 Computing service platform Moonbox

Figure 9 Moonbox of RTDP architecture

1.2.5.1 Moonbox Design idea

1) look at design ideas from an external point of view

Responsible for docking different data systems, supporting ad hoc mixed computing across heterogeneous data systems to provide three Client call modes: RESTful service, JDBC connection, ODBC connection unified metadata closure; unified query language SQL closing; unified access control closing provides two query result writing modes: Merge and Replace provide two interaction modes: Batch mode, Adhoc mode data virtualization implementation, multi-tenant implementation, which can be regarded as virtual database.

2) look at the design idea from the internal point of view

Parse the SQL, go through the conventional Catalyst processing and parsing process, and finally generate a logical execution subtree of the pushdown data system for push-down calculation, and then pull the result back for mixed calculation and return to support two-tier Namespace:database.table. To provide virtual database experience to provide distributed service module Moonbox Grid provides high availability and high concurrency for all push-down logic (no mixed computing) to provide fast execution channel 1.2.5.2 Moonbox features support seamless mixed computing across heterogeneous systems support unified SQL syntax query calculation and write support three call modes: RESTful service, JDBC connection, ODBC connection supports two interaction modes: Batch mode, Adhoc mode supports Cli Command tools and Zeppelin supports multi-tenant user rights architecture supports table-level permissions, column-level permissions, read permissions, write permissions, UDF permissions support YARN scheduler resource management support metadata service support timed tasks support security policy 1.2.5.3 Moonbox technical architecture

Figure 10 Moonbox logic module

For more technical details and user interface of Moonbox, please see:

GitHub: https://github.com/edp963/moonbox

1.2.6 Visual Application platform Davinci

Figure 11 Davinci of RTDP architecture

1.2.6.1 Davinci Design idea

1) look at design ideas from an external point of view

Responsible for various data visualization display functions to support JDBC data source to provide equal-weight user system, each user can establish their own Org, Team and Project to support SQL to write data processing logic, support drag-and-drop editing and visual display, provide multi-user social division of labor and cooperation environment, provide a variety of different chart interaction and customization capabilities To provide the ability to embed and integrate into other data applications to meet different data visualization needs

2) look at the design idea from the internal point of view

Revolve around View and Widget. View is the logical view of data; Widget is the visual view of data through user-defined selection of classified data, ordered data and quantitative data, and automatically shows the functional features of view 1.2.6.2 Davinci according to reasonable visual logic.

1) data source

Support JDBC data source support CSV file upload

2) data View

Support for defining SQL templates, supporting SQL highlighting, supporting SQL testing, supporting write-back operations

3) Visual components

Support for predefined charts, support for controller components, support for free styles

4) interaction ability

Support for full-screen display of visual components, support for local controllers, support for filtering between visual components, support for group control controllers, support for visual components, local advanced filters, support for large data display paging and sliders

5) Integration capability

Support visual components CSV download support visual components public sharing support visual components authorization sharing support dashboard public sharing support dashboard authorization sharing

6) Security permissions

Support data row permissions support LDAP login integration

For more technical details and user interface of Davinci, please see:

GitHub: https://github.com/edp963/davinci

1.3 discussion on 1.3.1 data management

1) metadata management

DBus can get metadata of data source in real time and provide service query Moonbox can get metadata of data system in real time and provide service query. For RTDP architecture, metadata information of real-time data source and impromptu data source can be collected by calling RESTful services of DBus and Moonbox, based on which an enterprise metadata management system can be built.

2) data quality

Wormhole can configure messages to fall into HDFS (hdfslog) in real time. Hdfslog-based Wormhole Job supports Lambda architecture; hdfslog-based Backfill supports Kappa architecture. You can refresh the Sink regularly by setting a scheduled task to select Lambda schema or Kappa schema to ensure the final consistency of the data. Wormhole also supports real-time Feedback of message information that handles exceptions on the stream or Sink writes exceptions to the Wormhole system, and provides RESTful services for three-party applications to call and process. Moonbox can improvise heterogeneous systems, which gives Moonbox the convenience of "Switzerland × ×". We can write timing SQL script logic through Moonbox, compare the concerned heterogeneous system data, or count the concerned data table fields, etc., and we can develop the data quality detection system based on the ability of Moonbox.

3) consanguinity analysis

The processing logic on the stream of Wormhole is usually satisfied by SQL, and these SQL can be collected through RESTful services. Moonbox is in charge of the unified entrance to data query, and all logic is SQL, and these SQL can be collected through Moonbox logs. For RTDP architecture, the SQL of real-time processing logic and impromptu processing logic can be used to build an enterprise consanguinity analysis system by calling Wormhole's RESTful service and Moonbox log collection. 1.3.2 data security

Figure 12 RTDP data Security

The figure above shows that in the RTDP architecture, the four open source platforms cover the end-to-end data flow link, and there are considerations and support for all aspects of data security on each node to ensure the end-to-end data security of the real-time data pipeline.

In addition, because Moonbox has become a unified entrance to data access in the application layer, the operation audit log based on Moonbox can obtain a lot of security information, and we can establish a data security early warning mechanism around the operation audit log, and then build an enterprise data security system.

1.3.3 Development, operation and maintenance

1) Operation and maintenance management

The operation and maintenance management of real-time data processing has always been a pain point. DBus and Wormhole provide visual operation and maintenance management capabilities through visual UI, which makes manual operation and maintenance easier. DBus and Wormhole provide health check, operation management, Backfill, Flow drift and other RESTful services, based on which an automated operation and maintenance system can be developed.

2) Monitoring and early warning

Both DBus and Wormhole provide visual monitoring interfaces, which can see the logical table-level throughput and delay information in real time. DBus and Wormhole provide RESTful services such as heartbeat, Stats, status, etc., based on which an automatic early warning system can be developed. Second, discussion on the model scene.

In the previous chapter, we introduced the design architecture and functional features of the technical components of the RTDP architecture, so readers have a specific understanding and understanding of how the RTDP architecture falls to the ground. So what common data application scenarios can the RTDP architecture solve? Below we will explore several usage patterns and which demand scenarios are adapted to different patterns.

2.1 synchronous mode 2.1.1 pattern description

Synchronization mode refers to the use mode that only configures real-time data synchronization between heterogeneous data systems and does not do any processing logic on the stream.

Specifically, the data is extracted from the data source and put on Kafka in real time by configuring DBus, and then the data on Kafka is written to Sink storage in real time by configuring Wormhole. Synchronous mode provides two main capabilities:

The subsequent data processing logic is no longer executed on the business standby database, which reduces the pressure on the use of the business standby database and provides the possibility of real-time synchronization of different physical business standby data to the same physical data storage 2.1.2 technical difficulties.

The concrete implementation is relatively simple.

IT implementers do not need to know too much about the common problems of streaming, and do not need to consider the design and implementation of the processing logic on the stream, but only need to understand the basic flow control parameter configuration.

2.1.3 Operation and maintenance management

Operation and maintenance management is relatively simple.

Manual operation and maintenance are needed. However, because there is no processing logic on the flow, it is easy to control the flow rate, and a relatively stable synchronization pipeline configuration can be given without considering the power consumption of the processing logic on the flow. It is also easy to make regular end-to-end data comparisons to ensure data quality, because the data on the source side and the destination side are exactly the same.

2.1.4 applicable scenarios: cross-departmental data real-time synchronization, sharing transaction database and analysis database decoupling support data warehouse real-time ODS layer construction, user self-help real-time simple report development, etc. 2.2 flow calculation pattern 2.2.1 pattern description

Flow computing mode refers to the usage mode of configuring processing logic on the stream on the basis of synchronous mode.

In the RTDP architecture, the configuration and support of the processing logic on the stream is mainly done on the Wormhole platform. On top of the ability of synchronous mode, the flow mode provides two main capabilities:

On-stream computing distributes the concentrated power consumption of batch computing over the stream, which greatly reduces the time delay of the result snapshot. Computing on the stream provides a new computing entrance (Lookup) 2.2.2 technical difficulties for mixed computing across heterogeneous systems.

It is relatively difficult to implement.

Users need to know what stream processing can do, what is suitable for doing, and how to transform full computing logic into incremental computing logic. Factors such as the power consumption of the processing logic on the stream and the dependent external data system should also be taken into account to adjust the configuration of more parameters.

2.2.3 Operation and maintenance management

Operation and maintenance management is relatively difficult.

Manual operation and maintenance are needed. However, it is more difficult than synchronous mode operation and maintenance management, mainly reflected in the flow control parameters configuration considerations, can not support end-to-end data comparison, to select the final consistency implementation strategy of the result snapshot, to consider the Lookup time alignment strategy on the stream and so on.

2.2.4 applicable scenarios require low latency for data application projects or reports that require low latency to invoke external services (such as calling external rule engine on stream, use of online algorithm models, etc.) support data warehouse real-time fact table + dimension table wide table construction real-time multi-table fusion, split, cleaning, standardized Mapping scenarios, etc. 2.3.1 schema description

The rotation mode means that on the basis of the flow calculation mode, the data is stored in real time, and after the short-time timing task is further calculated on the database, the result is re-put on the Kafka to be calculated on the next turn, so that the flow calculation is transferred to batch calculation and batch calculation.

In the RTDP architecture, the integration mode of Kafka- > Wormhole- > Sink- > Moonbox- > Kafka can be used to realize the rotation calculation of any round and any frequency. On top of the ability of the flow calculation model, the main capability provided by the rotation mode is to theoretically support any complex flow computing logic with low latency.

2.3.2 Technical difficulties

It is difficult to implement.

With the introduction of the ability of Moonbox to Wormhole, there are more variables to be considered than the flow calculation model, such as the choice of multi-Sink, the frequency setting of Moonbox calculation, how to split the division of calculation between Wormhole and Moonbox and so on.

2.3.3 Operation and maintenance management

Operation and maintenance is difficult to manage.

Manual operation and maintenance are needed. Compared with the flow calculation mode, it requires more consideration of data system factors, configuration and tuning of more parameters, more difficult data quality management and diagnostic monitoring.

2.3.4 suitable for low-latency multi-step complex data processing logic scenario company-level real-time data flow processing network construction 2.4 intelligent model 2.4.1 pattern description

Intelligent mode refers to the use of rules or algorithm models for optimization and efficiency.

Points that can be intelligent:

Intelligent drift of Wormhole Flow (intelligent automatic operation and maintenance) intelligent optimization of Moonbox precomputation (intelligent automatic tuning) total computing logic is intelligently converted into streaming computing logic, and then deployed in Wormhole + Moonbox (intelligent automatic development and deployment) and other 2.4.2 technical difficulties

The concrete implementation is the simplest in theory, but the effective technology is the most difficult.

Users only need to complete offline logic development, and leave the rest to intelligent tools to complete development, deployment, tuning, operation and maintenance.

2.4.3 Operation and maintenance management

Zero operation and maintenance.

2.4.4 applicable scenarios

The whole scene.

Since then, our discussion on the topic of "how to design a real-time data platform" has come to an end. We start from the conceptual background, discuss to architecture design, then introduce the technical components, and finally discuss the pattern scenario. As each topic involved here is very large, this paper only makes a shallow introduction and discussion. In the follow-up, we will discuss in detail on a specific topic from time to time, present our practice and experience, throw a brick to attract jade, and draw on collective wisdom. If you are interested in the four open source platforms in the RTDP architecture, you are welcome to find us on GitHub to learn about their use and exchange suggestions.

Author: Lu Shanwei

Source: Yixin Institute of Technology

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.