Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the design idea of Wormhole big data streaming processing platform?

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

This article shows you how the design idea of Wormhole big data streaming platform is, the content is concise and easy to understand, it can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Guide: with the rapid development of the Internet, data is no longer expensive, and how to get value from data more quickly becomes more and more important. Therefore, real-time data has become a major trend. More and more business scenarios need real-time analysis to analyze real-time data and give analysis results with very low latency, so as to improve business efficiency and bring higher value. As an important means of real-time processing, streaming processing is booming with the development of real-time data. This article is the opening introduction of Wormhole, a real-time streaming platform under the background of Agile big data (Agile BigData): what kind of platform is Wormhole?

1. Background introduction of Wormhole

In the field of streaming computing, more and more mature technical frameworks appear in the open source world, such as Storm, Heron, Spark, Samza, Flink, Beam and so on. Streaming technology has also gradually evolved to support rich computing syntax (SQL-like) on the stream, at least once or exactly once semantics, high reliability and high availability, high throughput and low latency, event-based computing, unified integration and access abstraction, all of which are never possible.

However, although the technology of streaming processing is already very rich, it is still difficult to implement streaming processing in enterprises, mainly because of the high cost and the long online cycle of demand, and the reasons for such problems are divided into two aspects. one is the organizational structure of the enterprise, the other is technology.

The organizational structure of traditional data warehouse and BI is to gather relevant technical personnel to set up an independent big data department, and each business department puts forward requirements to it to do customized development.

(organizational structure of the enterprise)

As pictured above, big data's department not only does big data environmental operation and maintenance, but also does customized development and online business maintenance. These two points will not only consume a lot of manpower, but also increase the cost of management and communication. Give an example of requirements development, as shown in the following figure:

(requirements development process)

The figure above is a development process commonly used by enterprises, which reflects some problems:

High labor cost

As can be seen from this figure, at least three roles are required to complete a requirement, and streaming developers spend a lot of time understanding requirements, business, table structure, and so on.

Long launch cycle and low efficiency

All requirements are put forward by the product staff, analyzed by the business staff, and then designed and developed together with the streaming developers, and it takes a lot of time to test and verify the results.

Low reuse

In the requirements, there are many businesses that are similar, but due to business and customization problems, code reuse can not be done well, resulting in more repeated development.

High cost of business maintenance

When there is a change in the requirements for the launch, it needs to be modified on the basis of the original code, and streaming developers also need to understand the business process, table structure, and so on again, or need a lot of human resources, and the cycle is also very long. At the same time, changes will increase the probability of problems.

Consume a lot of resources

In order to isolate functions and reduce the difficulty of maintenance, each customized function has to start a streaming application, which can not be reused and takes up a lot of hardware resources.

At present, various problems of streaming processing have greatly restricted the development of enterprise real-time big data, and various companies are looking for a lighter solution. Based on years of practice and experience in the real-time big data project, we independently developed a streaming processing platform-Wormhole, which has solved the above problems to a great extent. Let's introduce the details of Wormhole.

What is Wormhole?

Wormhole is a streaming processing platform for real-time big data project implementers, which is committed to unifying and simplifying big data development and management, especially for typical streaming real-time / quasi-real-time data processing application scenarios, shielding the underlying technical details and providing a very low development threshold. Project implementers only need to simply configure and write SQL to support most business scenarios, making big data business system development and management more lightweight, controllable and reliable.

(sample Wormhole data processing)

Wormhole is mainly based on Spark technology, and realizes the related functions such as data processing on stream based on SQL and idempotent writing in heterogeneous systems. As shown in the figure above, Wormhole accesses the data on the stream, and then writes the date of birth in the data to another storage system by processing the date of birth as age through a user-written SQL.

Wormhole realizes the flow processing scheme based on SQL through technical means, which greatly reduces the technical threshold of flow processing; at the same time, it realizes the change of function through platform and visualization, reduces the number of participating roles in the whole requirements life cycle, refines the whole development process, and then shortens the development cycle, but also reduces the development and maintenance costs.

III. Wormhole Design objectives 3.1 Design objectives

Based on the idea of Agile big data, the design goals of Wormhole are as follows:

Platform / componentization

Through platform support and component assembly implementation, the prototype can be quickly verified and feedback closed loop fast iteration can be formed with the demand side.

Standardization

Standardize the data format to achieve a general effect and reduce the cost of data formatting and maintenance

Configuration / Visualization

Users can visually configure, deploy, manage, monitor and lower the threshold of big data's product development to ensure high quality output.

Low latency / high performance / high availability

According to the real-time requirements, streaming processing requires lower delay, higher throughput and fault tolerance to ensure the normal operation of the system.

Self-service / automation

Let the enterprise transform from data center to platform service, so that every data practitioner can have more self-service, and release the data processing ability, the system replaces the manual to complete the repetitive low-level work, let the practitioner return to the data and business essence.

3.2 effect embodiment

The effects brought by the construction of Wormhole platform are mainly reflected in the following aspects:

The organizational structure is more reasonable:

As shown in the picture below, the relevant departments of big data no longer do customized development and business maintenance, but pay more attention to the platform and the stability of big data's environment, which greatly reduces the waste of human resources.

(organizational structure based on Wormhole)

Lowered the technical threshold for streaming development:

The development mode of streaming processing has become that business people can complete more than 80% of the business scenarios by visually configuring and writing SQL, and there is no need to have a deep understanding of streaming technology.

Shortens the demand launch cycle:

In the Wormhole-based requirements development process shown in the figure below, only product and business personnel are needed for a requirement from presentation to launch, which greatly reduces the cost of communication and learning, thus greatly shortening the requirements development launch cycle.

IV. Wormhole Design Specification

(Wormhole process design diagram)

The figure above is a design introduction of Wormhole, which reflects the process of streaming from input to output. In this process, Wormhole defines new concepts, standardizes the whole streaming, turns customized streaming into standardized streaming, and is highly abstracted from three latitudes.

Unified data logical table namespace-Namespace

Namespace: the "IP" of the data, which uniquely locates the physical location of the data through a 7-tier structure, that is,

[Data System] .[Instance] .[Database] .[Table]. [Table Version]. [Database Partition]. [Table Partition]

1) Unified General flow message Protocol-UMS

UMS is a stream message protocol specification defined by Wormhole

UMS attempts to abstract and unify all structured messages

UMS itself carries structured data Schema information to facilitate data processing.

UMS supports the existence of one Schema message and multiple data messages in each message, so that the data size can be reduced and the processing efficiency can be improved when there are multiple pieces of data.

Description:

Protocol-type currently supports data_increment_data (incremental data) and data_initial_data (initializing full data)

Schema-namespace specifies the namespace corresponding to the data

Schema-fields describes the name, type, and nullability of each field. Ums_id_ stands for record id and is required to be incremented; ums_op_ represents data operation (I: insert; u: update; d: delete); ums_ts_ represents data update time

Payload-tuple refers to the content of a record, corresponding to schema-fields one by one.

Note: after the Wormhole_v0.4.0 version, the user-defined semi-structured JSON format is supported according to the needs of the community.

2) Unified data computing logic pipeline-Flow

Flow is the streaming logic pipeline of Wormhole abstraction.

Flow consists of Source Namespace, Sink Namespace and processing logic.

Flow supports UMS and custom JSON message protocols.

Flow supports both Event and Revision Sink write modes

Flow Unified Computing Logic Standard (SQL/UDF/ Interface extension)

(Flow)

Note: the blue box and arrow in the above figure form a Flow. First, read the data of Namespace1 (SourceNamespace) from TopicA, and the data protocol is UMS or custom JSON, then handle the data processing logic configured by the user, output it to the data system corresponding to Namespace2 (SinkNameSpace), and write to support insertOnly and idempotent (to ensure the final consistency of data with the same key and different states).

As a real-time big data streaming platform, the design goal and design specification of Wormhole are to serve the data processing on the stream.

The above content is what the design idea of Wormhole big data streaming platform is. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report