How to Design a Real-time data platform (Design) 04/21 Update SLTechnology News&Howtos

How to Design a Real-time data platform (Design)

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Introduction: this article will be divided into two parts to discuss an important and common big data infrastructure platform, namely "real-time data platform".

In the last part of the design, we first introduce the real-time data platform from two dimensions: the real-time data platform from the perspective of modern data warehouse architecture and the real-time data processing from the perspective of typical data processing; then we will discuss the overall design architecture of the real-time data platform, the consideration of specific problems and solutions.

In the next technical part, we will further give the technology selection of the real-time data platform and the introduction of related components, and explore which application scenarios are suitable for different models. It is hoped that through the discussion of this article, readers can get a real-time data platform construction scheme with rules to follow and can actually land.

Extended reading: how to design a real-time data platform (technical article)

1. Related Conceptual background 1.1 look at the real-time data platform from the perspective of modern data warehouse architecture

The modern data warehouse develops from the traditional data warehouse. Compared with the traditional data warehouse, the modern data warehouse not only has something in common with it, but also has many development points. First, let's take a look at the module architecture of traditional data warehouse (figure 1) and modern data warehouse (figure 2):

Fig. 1 traditional data warehouse

Fig. 2 Modern data warehouse

We are all familiar with the traditional data warehouse, and we don't introduce it too much here. Generally speaking, the traditional data warehouse can only support the data processing with an one-day delay, the data processing process is mainly ETL, and the final output is mainly report.

Modern data warehouse is built on the traditional data warehouse, while increasing the import and storage of more diversified data sources, more diversified data processing methods and time limits (supporting Tunable 0-day limitation), more diversified data usage and more diversified data terminal services.

Modern data warehouse is a big topic, and here we show its new features and capabilities in the way of conceptual modules. First, let's take a look at the summary of Melissa Coates in figure 3:

In the summary of figure 3 Melissa Coates, we can conclude that modern data warehouse is "modern" because it has a series of features such as multi-platform architecture, data virtualization, near-real-time analysis of data, agile delivery, and so on.

On the basis of Melissa Coates's summary of modern data warehouse and our own understanding, we also summarize and extract several important capabilities of modern data warehouse, which are as follows:

Real-time data (real-time synchronization and streaming capabilities)

Data virtualization (virtual mixed computing and unified service capabilities)

Data citizenization (visualization and self-configuration capabilities)

Data collaboration (multi-tenancy and division of labor and cooperation) 1) real-time data (real-time synchronization and streaming capabilities)

Real-time data means that from data generation (updating to business database or log) to final consumption (data reports, dashboards, analysis, mining, data applications, etc.), millisecond / second / minute delay is supported (strictly speaking, seconds / minutes belong to quasi-real time, which is called real-time here). This involves how to extract data from the data source in real time; how to transfer data in real time; in order to improve timeliness and reduce end-to-end delay, we also need to be able to support calculation and processing in the process of circulation; how to store data in real time; how to provide follow-up consumption in real time. Real-time synchronization refers to end-to-end synchronization from multi-source to multi-destination, and streaming refers to logical conversion processing on the stream.

But we need to know that not all data processing calculations can be carried out on the stream, and our goal is to reduce the end-to-end data latency as much as possible, which needs to be carried out in conjunction with other data transfer processing methods. We'll discuss it later.

2) data virtualization (virtual mixed computing and unified service capabilities)

Data virtualization means that users or user programs are faced with a unified interaction mode and query language without paying attention to the physical library and dialect where the data is actually located and the way of interaction (heterogeneous system / heterogeneous query language). The user's experience is to operate on a single database, but this is actually a virtualized database, and the data itself is not stored in the virtual database.

Virtual mixed computing refers to the ability of virtualization technology to support transparent mixed computing of heterogeneous system data, and unified service refers to providing a unified service interface and way for users.

Figure 4 data Virtualization

(figures 1-4 are all selected from "Designing a Modern Data Warehouse + Data Lake"-Melissa Coates, Solution Architect, BlueGranite)

3) data citizenization (visualization and self-configuration capabilities)

Ordinary users (data practitioners with no professional big data technical background) can use data to complete their work and requirements through visual user interface, configuration and SQL, without paying attention to the underlying technical problems (through computing resources cloud, data virtualization and other technologies). The above is our interpretation of the citizenization of data.

For an interpretation of Data Democratization, you can also refer to the following link:

Https://www.forbes.com/sites/bernardmarr/2017/07/24/what-is-data-democratization-a-super-simple-explanation-and-the-key-pros-and-cons

This paper mentions how to support the popularization of data at the technical level, and gives several examples: Data virtualization software,Data federation software,Cloud storage,Self-service BI applications and so on. Data virtualization and data federation are similar technical solutions in nature, and the concept of self-service BI is mentioned.

4) data collaboration (multi-tenancy and division of labor and cooperation)

Should technicians know more about the business, or should business people know more about technology? This has always been a controversial issue within the enterprise. We believe that modern BI is a process that can cooperate deeply, and technicians and business personnel can give full play to their respective strengths on the same platform, divide the labor and cooperate to complete daily BI activities. This puts forward higher requirements for the multi-tenant ability and division of labor and cooperation ability of the platform, and a good modern data platform can support better data collaboration ability.

We hope to design a modern real-time data platform to meet the above-mentioned real-time, virtualization, civilian, collaborative and other capabilities, and become a very important and indispensable part of modern data warehouse.

1.2 viewing real-time data processing from the perspective of typical data processing

Typical data processing can be divided into OLTP, OLAP, Streaming, Adhoc, Machine Learning and so on. Here is the definition and comparison of OLTP and OLAP:

(figure 5 is selected from the article "Relational Databases are not Designed for Mixed Workloads"-Matt Allen)

From a certain point of view, OLTP activities mainly occur on the business transaction library side, while OLAP activities mainly occur on the data analysis library side. So how does the data flow from the OLTP library to the OLAP library? If the timeliness of this data flow is very high, the traditional batch ETL method of Tunable 1 can not be satisfied.

The flow process from OLTP to OLAP is called Data Pipeline (data processing Pipeline), which refers to all the transfer and processing links between the production end and the consumer side of data, including data extraction, data synchronization, stream processing, data storage, data query and so on. Complex data processing transformations may occur here (such as the conversion of repetitive semantic multi-source heterogeneous data sources to unified Star Schema, the conversion of parts list to summary table, the combination of multi-entity tables into wide tables, etc.). How to support the real-time Pipeline processing capability has become a challenging topic. We describe this topic as "online OLPP (Online Pipeline Processing)".

Therefore, the real-time data platform discussed in this paper hopes to solve the problem of OLPP from the point of view of data processing and become the solution of the lack of real-time flow from OLTP to OLAP. Next, we will discuss how to design such a real-time data platform from an architectural level.

II. Positioning and objectives of Architecture Design solution 2.1

Real-time data platform (Real-time Data Platform, hereinafter referred to as RTDP) aims to provide end-to-end real-time data processing capability (millisecond / second / minute delay), can connect multiple data sources for real-time data extraction, and can provide real-time data consumption for multi-data application scenarios. As part of the modern data warehouse, RTDP can support real-time, virtualization, civilian, collaboration and other capabilities, making real-time data application development with lower threshold, faster iteration, better quality, more stable operation, simpler operation and maintenance, and stronger capabilities.

2.2 overall design architecture

Conceptual module architecture, which is the hierarchical architecture and capability carding of the conceptual layer of real-time data processing Pipeline, is universal and referable, more like a requirement module. Figure 6 shows the overall conceptual module architecture of RTDP, and the meaning of each module can be explained by itself, which is not detailed here.

Figure 6 overall conceptual module architecture of RTDP

Below, we will do further design discussion according to the above picture, and give the high-level design ideas from the technical level.

Fig. 7 overall design idea

As you can see from figure 7, we have unified abstractions for the four levels of the conceptual module architecture:

Unified data acquisition platform

Unified streaming platform

Unified Computing Service platform

Unified data Visualization platform

At the same time, it also keeps the principle of openness to the storage layer, which means that users can choose different storage layers to meet the needs of specific projects without destroying the overall architecture design. Users can even select multiple heterogeneous storage to provide support in Pipeline. The following is the interpretation of the four abstract layers.

1) Unified data acquisition platform

The unified data acquisition platform can not only support full extraction from different data sources, but also support enhanced extraction. Among them, the incremental extraction of the business database will choose to read the database log to reduce the reading pressure on the business database. The platform can also uniformly process the extracted data and then publish it to the data bus in a unified format. Here we choose a custom standardized unified message format UMS (Unified Message Schema) as the data layer protocol between the unified data acquisition platform and the unified streaming platform.

UMS comes with Namespace information and Schema information, which is a self-positioning and self-interpreting message protocol format, which has the following advantages:

The whole architecture does not need to rely on external metadata management platform

Messages are decoupled from physical media (here physical media such as Kafka's Topic, Spark Streaming's Stream, etc.), so multiple message flows can be supported in parallel and free drift through physical media.

The platform also supports a multi-tenant system and configuration of simple processing and cleaning capabilities.

2) Unified streaming platform

Unified streaming platform, which consumes messages from the data bus, can support UMS protocol messages or ordinary JSON format messages. Meanwhile, the platform also supports the following capabilities:

Support visualization / configuration / SQL approach to reduce the threshold of streaming logic development / deployment / management

Support configuration idempotency into multiple heterogeneous target libraries to ensure the ultimate consistency of data

Support multi-tenant system to isolate project-level computing resources / table resources / user resources. 3) Unified Computing Service platform

The unified computing service platform is an implementation of data virtualization / data federation. The platform supports push-down computing and pull-down mixed computing of multiple heterogeneous data sources internally, as well as external unified service interface (JDBC/REST) and unified query language (SQL). Because the platform can unify the closing service, we can build unified metadata management / data quality management / data security audit / data security policy and other modules based on the platform. The platform also supports a multi-tenant system.

4) Unified data visualization platform

A unified data visualization platform, coupled with multi-tenancy and a sound user system / authority system, can support the division of labor and cooperation among cross-departmental data practitioners, so that users can work closely together in a visual environment. Better able to give full play to their strengths to complete the last 10 kilometers of the data platform.

The above is based on the overall module architecture, a unified abstract design, and open storage options to improve flexibility and demand adaptation. Such RTDP platform design embodies the real-time / virtualization / civilian / collaboration capabilities of modern data warehouse, and covers the end-to-end OLPP data flow link.

2.3 specific problems and considerations

Below, we will discuss the problem considerations and solutions of this design from different dimensions based on the overall architecture design of RTDP.

1) functional considerations

Functional considerations focus on such a question: can real-time Pipeline handle all ETL complex logic?

We know that for streaming engines like Storm/Flink, it is processed on a per-piece basis; for Spark Streaming-based streaming engines, on a per-mini-batch basis; and for offline batch tasks, on a daily basis. Therefore, the processing scope is a dimension of the data (scope dimension).

In addition, streaming is for incremental data, if the data source is from a relational database, then incremental data often refers to incremental change data (revision); relative batch processing is for snapshot data (snapshot). Therefore, the presentation form is another dimension of the data (change dimension).

The change dimension of a single piece of data can be projected and converged into a single snapshot, so the change dimension can converge into a scope dimension. Therefore, the essential difference between streaming processing and batch processing is that the data range dimension is different, the streaming processing unit is "limited range" and the batch processing unit is "full table range". "full table range" data can support a variety of SQL operators, while "limited range" data can only support some SQL operators, as shown below:

Join:

✔ left join: supported. "limits" can left join external lookup tables (similar to hashjoin effect through push-down)

✔ right join: not supported. Every time you retrieve all the lookup data from lookup, this calculation is neither feasible nor reasonable.

✔ inter join: supported. Can be converted to left join + filter and can support

✔ outer join: not supported. Right join exists, so it is unreasonable

Union: supported. Can be applied to pull back local scope data to do window aggregation operation.

Agg: not supported. Local window aggregation can be done with union, but full table aggregation cannot be supported.

Filter: supported. Without shuffle, it fits perfectly.

Map: supported. Without shuffle, it fits perfectly.

Project: supported. Without shuffle, it fits perfectly.

Join often needs shuffle operation, which is the most time-consuming operation, while join (left join) converts join operation into hashjoin queue operation, and distributes the concentrated data computing resources and time of batch processing join equally in the process of data flow, so left join on stream is the most cost-effective calculation method.

Complex ETL is not a single operator, it is often composed of multiple operators. From the above, we can see that simple streaming processing can not support all ETL complex logic. So how to support more complex ETL operators in real-time Pipeline and maintain timeliness? This requires the ability to convert between "limited range" and "full table range" processing.

Imagine that the streaming platform can support suitable processing on the stream, and then drop different heterogeneous libraries in real time, and the computing service platform can mix multi-source heterogeneous libraries in batches at regular intervals (the time setting can be every few minutes or less). And each batch of calculation results are sent to the data bus to continue to flow, so that the streaming processing platform and the computing service platform form a computing closed loop, each doing good operator processing. Data in different frequencies trigger the flow process to carry out a variety of operator transformation, such an architecture pattern can theoretically support all ETL complex logic.

Figure 8 Evolution of data processing architecture

Figure 8 shows the evolution of the data processing architecture and an architectural pattern for OLPP. Wormhole and moonbox are our open source streaming platform and computing service platform respectively, which will be described in detail later.

2) quality consideration

The above diagram also leads to two mainstream real-time data processing architectures: Lambda architecture and Kappa architecture. There are a lot of materials on the Internet for the introduction of the two architectures, so I won't repeat them here. Lambda architecture and Kappa architecture have their own advantages and disadvantages, but both support the ultimate consistency of data, ensuring data quality to some extent. How to learn from each other in Lambda architecture and Kappa architecture to form some kind of fusion architecture will be discussed in detail in the new article.

Of course, data quality is also a very big topic, only supporting rerun and recharge can not completely solve all data quality problems, but gives an engineering scheme to supplement data from the technical architecture level. With regard to big data's data quality, we will also have a new topic for discussion.

3) Stability consideration

This topic involves, but is not limited to, the following points. Here are some simple ways to deal with it:

Highly available HA

High availability components should be selected for the entire real-time Pipeline link to ensure overall high availability in theory; support data backup and replay mechanisms on data critical links; and support dual run fusion mechanism on business critical links.

SLA guarantee

Dynamic expansion and automatic drift of data processing flow are supported on the premise of ensuring high availability of cluster and real-time Pipeline.

Elastic anti-fragility

Resource flexibility and scaling of ✔ based on rules and algorithms

✔ supports failure handling of event-triggered action engine

Monitoring and early warning

Multi-faceted monitoring and early warning capability at cluster facility level, physical pipeline level and data logic level

Automatic operation and maintenance

Ability to capture and archive missing data and handle exceptions, and have a regular automatic retry mechanism to repair problem data

Upstream metadata change resistance

✔ upstream business library requires compatibility metadata change

✔ real-time Pipeline processing explicit fields

4) cost consideration

This topic involves, but is not limited to, the following points. Here are some simple ways to deal with it:

Manpower cost

Reduce the manpower cost of talents by supporting the popularization of data applications

Resource cost

Reduce resource waste caused by static resource occupation by supporting dynamic resource utilization

Operation and maintenance cost

Reduce operation and maintenance costs by supporting mechanisms such as automatic operation and maintenance / high availability / elastic anti-fragility

Trial and error cost

Reduce trial and error costs by supporting agile development / fast iterations

5) Agile consideration

Agile big data is a set of theoretical system and methodology, which has been described earlier. From the point of view of data use, agile consideration means: configuration, SQL, civilian.

6) Management considerations

Data management is also a very big topic, here we will focus on two aspects: metadata management and data security management. If we uniformly manage metadata and data security in the environment of modern data warehouse and multi-data storage selection, it is a very challenging topic. We will consider these two aspects and provide built-in support on each link platform on real-time Pipeline. At the same time, we can also support external unified metadata management platform and unified data security policy.

In this paper, we discuss the related conceptual background and architecture design of real-time data platform RTDP. In the architecture design, we especially focus on the positioning and objectives of RTDP, the overall design architecture, as well as the specific issues and considerations involved. Some topics are so large that we can form separate articles for special discussion, but on the whole, we have given a whole set of design ideas and plans for RTDP. In the next technical article, we will specify the RTDP architecture design, give the recommended technology selection and our open source platform solution, and discuss the different mode applications of RTDP according to the requirements of different scenarios.

Author: Lu Shanwei

Source: Yixin Institute of Technology

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.