The method of constructing Real-time big data processing platform by StreamWorks 04/18 Update SLTechnology News&Howtos

The method of constructing Real-time big data processing platform by StreamWorks

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "the method of StreamWorks building real-time big data processing platform". In the daily operation, I believe that many people have doubts about the method of building real-time big data processing platform by StreamWorks. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "StreamWorks building real-time big data processing platform". Next, please follow the editor to study!

Data stack is a cloud native-station data center PaaS. We have an interesting open source project on github and gitee: FlinkX,FlinkX is a unified data synchronization tool based on Flink batch stream, which can collect both static data and real-time changing data, and is a global, heterogeneous, batch-stream integrated data synchronization engine. If you like, please order us a starter starter star!

Github open source project: https://github.com/DTStack/flinkx

Gitee open source project: https://gitee.com/dtstack_dev_0/flinkx

During the Spring Festival in 2020, a sudden epidemic spread across the country, breaking everyone's original pace of work and life. During the epidemic, everyone can see a real-time map of big data's epidemic situation at any time when they stay at home, and they can browse the Douyin videos they are currently interested in at any time. The most important technology behind all this is real-time big data processing technology.

Now that the epidemic is about to pass, the state proposes to speed up the construction of new infrastructure such as big data Center, and the construction of real-time big data processing platform has become a more and more important part in the process of digital intelligence transformation of enterprises.

What is real-time computing

In the field of big data processing, tasks are usually divided into real-time calculation and offline calculation according to the different properties of the data. Take the temperature sensor scenario as an example: suppose a city has installed a large number of temperature sensors, and the temperature information collected by each sensor is uploaded every 1min, which is uniformly collected by the meteorological center and updated every 5 minutes. These data are generated continuously. And it won't stop. Real-time computing is mainly used in scenarios where "data continues to be generated and does not stop, and the calculation results need to be obtained with a minimum delay, usually in seconds or minutes."

In order to meet this kind of scene with a large amount of data and high real-time requirements, real-time computing technology is usually adopted. The "continuous flow of data" of real-time computing determines that its data processing method is completely different from offline.

The difference between Figure 1 Real-time Computing and offline Computing

The batch, high delay and actively initiated computing characteristics of offline computing are different. Real-time computing is a continuous, low-delay, event-triggered computing task. Offline computing needs to load data first, then submit offline tasks, and finally return the results of task calculation; real-time computing should first submit streaming tasks, then wait for real-time streaming data access, and then calculate the real-time result flow.

The difference between Figure 2 Real-time Computing and offline Computing (Image)

The image point can be understood as offline computing is to drive a boat to the lake (database) to fish, and real-time calculation is to build dams on rivers (data streams) to generate electricity. Further divergence, the formation of lakes depends on rivers, and the upper and lower boundaries of rivers are lakes; in fact, off-line calculation can be understood as a special case of real-time computing.

2. Problems that can be solved by real-time computing

Problems that can be solved by Figure 3 Real-time Computing

From a technical point of view, real-time computing is mainly used in the following scenarios:

Real-time data ETL based on Data Pipline: the purpose is to transfer data from point A to point B in real time. Data cleaning and integration work may be added in the process of transmission, such as real-time building the index of the search system, the ETL process in the real-time warehouse and so on.

Real-time data analysis based on Data Analysis: the process of extracting corresponding information from the original data and integrating it according to business objectives. For example, check the top 10 items in daily sales, the average turnaround time of the warehouse, the average click rate of web pages, the real-time push opening rate, and so on. Real-time data analysis is the real-time of the above process, usually reflected in the terminal as a real-time report or real-time large screen.

Data Driven-based event-driven application: a system that processes or responds to a series of subscription events. Event-driven applications usually rely on internal state, such as click fraud detection, risk control system, operation and maintenance anomaly detection system and so on. When the user's behavior triggers some risk control points, the system will capture the event and analyze the user's current and previous behavior to decide whether to control the user's risk.

Full-link process for real-time development

Full-link process for real-time development of Figure 4

Real-time data acquisition-using streaming data acquisition tools to collect and transfer data to big data message store (kafka, etc.) in real time, streaming data storage, as the upstream of real-time computing, provides a continuous flow of data to trigger the operation of streaming computing jobs. Stream data is used as the trigger source of real-time computing to drive real-time computing. Therefore, a real-time computing job must use at least one stream data as the source. Each incoming stream data will directly trigger a streaming processing of real-time computing. After the data is processed and analyzed in the real-time computing system, it is randomly written to the downstream data storage. The downstream database is generally related to the business and can be used for real-time reports, real-time large screen and other data consumption.

IV. Real-time acquisition-the key to a full-link real-time development platform

In the real-time development of the whole link, real-time acquisition is the upstream of real-time computing. For enterprises, they already have data storage systems, but a large part of them are offline relational databases. How to provide the real-time incremental data of these offline relational databases for real-time calculation and analysis is an urgent problem to be solved. The following figure shows the functional architecture of the kangaroo cloud real-time data acquisition tool.

Figure 5 Real-time data acquisition tool FlinkX data flow

As a module of StreamWorks platform, kangaroo cloud real-time data acquisition has the following functional features.

FlinkX supports batch data extraction and real-time capture of MySQL, Oracle, SQLServer and other changing data to achieve unified collection of batch streams.

The bottom layer is based on Flink distributed architecture, which supports large capacity and high concurrency synchronization. Compared with single point synchronization, it has better performance and higher stability.

Real-time synchronization is supported by reading database Binlog directly and real-time synchronization by interval polling is also supported.

Support breakpoint continuation and dirty data recording, real-time data acquisition metric curve display.

5. Introduction of StreamWorks real-time development platform

Kangaroo Cloud Real-time Development platform (StreamWorks) is a cloud native one-stop big data streaming computing platform based on Apache Flink, which covers the full link flow from real-time data collection to real-time data ETL. Subsecond processing delay and Datastream API job development, which are compatible with existing big data components, help enterprises transform real-time data intelligently and help build new infrastructure.

In the previous data development technology stack, SQL language can solve the problems of most business scenarios. The core function of StreamWorks is streaming data analysis capability (FlinkStreamSQL) based on SQL semantics, which lowers the threshold of development. Provide Exactly-Once processing semantic guarantee to ensure the accuracy and consistency of the business.

Figure 6 StreamWorks functional Architecture

As shown in the figure above, StreamWorks consists of the following modules:

Real-time data acquisition: supports real-time data acquisition from data sources such as MySQL, SQLServer, Oracle, PolarDB, Kafka, EMQ, etc. Through rate and concurrency control, users can control the acquisition process more accurately.

Data development: FlinkSQL and Flink task types are supported. FlinkSQL job provides visual storage configuration, job development, syntax checking and other functions. Flink task supports running real-time development jobs by uploading jar package.

Task operation and maintenance: task operation monitoring, data curve, running log, data delay, CkeckPoint, Failover, attribute parameters, alarm configuration and other functions.

Project management: user management, role management, overall project configuration, project member management, etc.

VI. Advantages of StreamWorks real-time big data development platform

Figure 7 StreamWorks platform level

As shown in the above figure, the StreamWorks real-time big data development platform is based on the Apache Flink computing engine and is encapsulated with SQL, and there is an online development IDE platform at the top. The platform has the following advantages:

Easy to use: provide online IDE, customized development tools for adapting to FlinkSQL!

Visual DDL: provide visual table-building tools, configure parameters to complete the DDL!

Built-in functions: provide rich FlinkSQL built-in functions to simplify development work!

Efficient operation and maintenance: provide up to dozens of operating indicators to solve open source operation and maintenance problems!

Real-time acquisition: provide real-time acquisition tools to support the full-link real-time development platform!

FlinkX: self-developed batch-stream integrated data acquisition tool, has been open source!

Figure 8 traditional development mode VS StreamWorks development mode

Seven or fourteen lines of code to deal with real-time business development

Having said so much, how our products facilitate the development of real-time business logic, we still take the most common examples of website traffic analysis to illustrate. For example, a website needs to analyze the source of access:

As shown in the following figure, read the site access log from the log service, parse the source in the log and check whether the source is in the list of interested websites (similar to the whitelist of the source website, saved in MySQL), count the traffic PV from each website, and the final result is written to MySQL.

Figure 9 business logic flow chart

It's very simple to implement in StreamSQL code, and it only takes 14 lines of pseudo code to do it.

CREATE TABLE

Log_source (dt STRING, …)

WITH (type=kafka)

CREATE TABLE

Mysql_dim (url STRING, … , PRIMARY KEY (url))

WITH (type=mysql)

CREATE TABLE

Mysql_result (url STRING, … , PRIMARY KEY (url))

WITH (type=mysql)

INSERT INTO mysql_result

SELECT

L.url, count (*) as pv...

FROM log_source l JOIN mysql_dim d ON l.url = d.url

Group by l.url

Replication VIII. Building a real-time recommendation system based on StreamWorks

General recommendation systems are implemented based on tags, tag-based recommendation applications are actually very common, such as headlines, such as Douyin, use a large number of tags, such a recommendation system has many advantages, such as simple implementation, good interpretation and so on. How to use tags to recommend real-time goods or content?

First of all, a new user will fill in some fixed data when signing up for an app account, such as age, occupation and other information. These information can be calculated and analyzed offline and stored in the long-term interest tag database. The content that the user is interested in recently (such as the information points followed in the last 10 minutes) can calculate and analyze the short-term interest tag result in real time, and then associate the short-term interest tag with the long-term interest tag library through the function of data flow association dimension table developed in real time, and finally generate new recommendation content to the client, forming a closed loop of user data flow. Thus a simple real-time recommendation system is realized. The specific process is shown in the following figure.

Figure 10 builds a real-time recommendation system based on StreamWorks

9. Conclusion-- turn the future into the present

The epidemic is about to pass, and life goes on. With the deepening of "new infrastructure" construction, more and more real-time scenes will appear in our lives. Kangaroo Cloud as a new infrastructure solution provider, our slogan is to turn the future into the present, which will enable more real-time transformation of enterprises in the future. Which is a good http://www.lnljyy.com/ for Zhengzhou abortion Hospital

At this point, the study on "StreamWorks's method of building a real-time big data processing platform" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.