Core business "slimming down" in progress! Hand in hand to lead you to build a massive data real-time processing architecture 04/19 Update SLTechnology News&Howtos

Core business "slimming down" in progress! Hand in hand to lead you to build a massive data real-time processing architecture

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

01 background

The purpose of the online trading service platform is to reduce the core system computing pressure and core performance load pressure. Through this platform, the real-time transaction data of the core system can be captured, real-time calculated and processed, and the calculation results can be saved in SequoiaDB. And can provide online transaction inquiry service for users in real time. The online transaction service platform is designed based on the real-time processing architecture, by synchronizing the data changes of the core system to the on-platform database in real time, so as to achieve the purpose of real-time data replication and providing services to the outside world.

The purpose of this paper is to analyze the technical principles and overall architecture of the real-time processing system. First of all, the technical principle of the architecture is introduced, and then the implementation of the overall architecture is introduced, which is analyzed and explained in detail from the aspects of data acquisition layer, real-time processing layer, data storage layer and so on.

02 Technical requirements

2.1 how to build a real-time collection system for database log files

The platform needs to obtain customer balance changes and transaction details data in real time from multiple trading systems of banks. This process requires that the data acquisition component can provide real-time acquisition and transmission functions with high performance, high availability, high security and reliability, so we use OGG and CDC acquisition framework with these characteristics.

CDC (Change Data Capture): real-time capture of data source changes based on database logs and real-time transmission to the target side. The CDC component captures the updated (insert, delete, update) transaction information data by reading the log files of each business production system database. After row filtering and character coding conversion, the transaction information data is sent to the target side by TCP/IP. After receiving the source side data, after numerical conversion, character coding conversion, conflict detection, the change data is transmitted to Kafka through Confluent Rest API. Data caching for message queuing before data is directly persisted.

OGG (Oracle GoldenGate) is a log-based mining technology, which obtains the incremental changes of the data by parsing the online logs or archived logs of the source database, and then transmits these changes to Kafka. Kafka caches the data of the message queue before persisting the data directly.

2.2 how to ensure the real-time processing of massive data

Compared with other real-time processing frameworks such as Spark, Storm has high real-time performance and low delay, while the online trading service platform requires high real-time performance and millisecond data processing. As a pure real-time computing framework, Storm can actually achieve millisecond computing power.

Storm is a real-time processing system based on data stream, which provides real-time computing power with large throughput. When a piece of data arrives in the system, the system will immediately carry out the corresponding calculation in memory, so Storm is suitable for real-time data analysis scenarios. In addition, Storm supports distributed parallel computing, which can be processed in real time even if a large amount of data flows in. Storm also has the following advantages: low latency, high availability, distributed, scalable, no data loss, and provides a simple and easy-to-understand interface for development.

2.3 how to realize the docking between the acquisition layer and the real-time processing layer

Between the acquisition layer and the real-time processing layer, it is often necessary to add a message queue mechanism to realize the decoupling between the acquisition layer and the real-time processing layer, and cache the data that needs to be processed in real time to ensure that all data can be processed orderly and correctly.

In addition, the data stream collected from the source side is not uniform, but sometimes more and sometimes less. Especially under the condition of high concurrency, the data of database logs will grow blowout. If the consumption speed of Storm (even if the real-time computing power of Storm is already fast) is slower than the generation speed of logs, it will inevitably lead to a large amount of data lag and loss, so we add Kafka message system as data buffer, Kafka can transform uneven data into uniform message flow, thus combine with Storm. To achieve stable streaming computing.

Kafka is a distributed, partitioned, replicable and committed logging service. As a scalable and highly reliable message system, in stream processing, it is often used to save and collect stream data and provide it to the later docked Storm stream data framework for processing. As a message queuing system, compared with most message systems, Kafka has better throughput, built-in partitioning, replication and failover functions, which is conducive to timely processing of large-scale messages.

03 advantages of SequoiaDB as a storage tier

The online transaction service platform needs to meet the high-speed storage and efficient retrieval of massive data after real-time processing, and needs to ensure the availability and reliability of the data. SequoiaDB is an excellent distributed database, which can be used to store massive data. Its underlying layer is mainly based on distributed, high availability, high performance and dynamic data type design, while taking into account many excellent designs in relational databases, such as transaction, multi-index, dynamic query and update, SQL and so on. By using the distributed storage mechanism and multi-index function of giant sequoia database, it can provide high concurrency and low latency query, update, write and delete services for applications.

SequoiaDB uses MPP (Mass parallel processing) architecture, and the whole cluster is mainly composed of three roles, namely, coordination node, cataloging node and data node. Among them, the cataloging node stores metadata, the coordinating node is responsible for the task distribution of the distributed system, and the data node is responsible for data storage and operation. When an application sends an access request to the coordinating node, the coordinating node first communicates with the cataloging node to understand the structure and rules of the underlying data storage, and then distributes the query tasks to different data nodes. then aggregate the results on all data nodes and sort the results as appropriate query results.

SequoiaDB has the following advantages:

1) have a rich query model: SequoiaDB is suitable for a variety of applications. It provides rich index and query support, including secondary index, aggregation framework and so on.

2) with common drivers: developers integrate the system environment and the native driver library of the code base, and interact with the database through the native driver library, which makes the use of SequoiaDB simple and natural.

3) support level is scalable: developers can increase the capacity of SequoiaDB systems through server and cloud infrastructure to cope with the growth in data volume and throughput.

4) High availability: multiple copies of data are maintained by remote replication. In case of failure, the system is automatically transferred to the secondary node, rack, and data center, so that the enterprise does not need to customize and optimize the code to make the system work properly.

5) memory-level performance: data is read and written directly in memory. And for the persistence of the system, the system will continue to write data to disk in the background. All these provide fast performance for the system, so that the system does not need to use a separate cache layer.

04 technical architecture

The real-time processing architecture is mainly divided into three modules: real-time data acquisition, real-time processing and real-time storage. CDC,OGG is used to obtain data, Kafka is used to save data temporarily, Strom is used to calculate data in real time, and SequoiaDB is a distributed database to save data.

The architecture of the whole real-time analysis system is first captured by OGG/CDC real-time database log files, extracted changes in the data, such as add, delete, change and other operations, and stored in the Kafka message system. The messages in the Kafka are then consumed by the Storm system, and the consumption records are managed by the Zookeeper cluster, so that the last consumption records can be found even after the Kafka downtime is restarted. Then continue to consume from the Broker of Kafka from the last downtime point, and use the defined Storm Topology to analyze the log information, output it to SequoiaDB distributed database for persistence, and finally provide online real-time query interface for users to query.

4.1 data acquisition

In the aspect of log collection process, we design different collection processes for different system environments. The peripheral system uses real-time data synchronization tool OGG for real-time data acquisition. OGG parses the database log files on the source system by capturing the process, extracting the changes in the data, such as adding, deleting, changing, and so on, and converts the relevant information into a custom intermediate format and stores it in the queue file, and then uses the transfer process to transfer the queue file to the Kafka queue through TCP/IP.

For the core system, through the deployment of InfoSphere CDC on the source side of the core system, the real-time collection of database logs and their files can capture the update (insert, delete, update) transaction record information generated by the source database, and continuously transmit the latest transaction data to the target side through the continuous mirror operation mode. InfoSphere CDC also runs on the target system, receives data from different source systems, then transmits the data to Kafka through Confluent Rest API, and caches the data of the message queue before calculating or persisting the data directly.

4.2 Real-time processing

Here, Storm is used for real-time processing. Storm as a real-time processing framework has the characteristics of low latency, high availability, distributed, scalable, no data loss and so on. These characteristics make Storm still have fast processing speed on the premise of ensuring that the data is not lost.

A daemon process running on the Master node in the Storm cluster is called "Nimbus", which is responsible for the distribution of computing programs in the cluster, task distribution, monitoring tasks and the operation of the work node, etc. The daemon process running on the Worker node is called "Supervisor", which is responsible for receiving and running tasks distributed by Nimbus. A part of the Topology program runs on each Worker, and a Topology program is run by multiple Worker on the cluster. The coordination between Nimubs and Supervisor is managed by Zookeeper. Nimbus and Supervisor themselves are stateless on the cluster, and their states are saved on Zookeeper, so the downtime and dynamic expansion of any node will not affect the operation of the whole cluster, and the fast-fail mechanism is supported.

To do real-time computing on Storm, we need to customize a computing program "Topology". A Topology program is composed of Spout and Bolt. Storm uses the Topology program to generate our target data stream Stream through reliable (ACK mechanism) distributed computing. We use Kafkaspout to continuously obtain the corresponding topic data from the queue of Kafka, and then do data processing through custom bolt to distinguish between adding, deleting and changing records, and then call SequoiaDB API to add, delete and change the SequoiaDB database through custom bolt, so as to achieve the purpose of real-time replication of source data.

4.3 data storage

After the data obtained by the data source is processed in real time by Kafka and Storm, the real-time parsed data is stored in SequoiaDB by calling SequoiaDB API interface. OLAP scenarios are supported through SQL query SequoiaDB, and OLTP services are also provided for online applications through JDBC.

Saving massive data in SequoiaDB distributed database, using its own distributed storage mechanism and multi-index function, can provide high concurrency and low latency query, update, write and delete operations and other services for applications.

The bottom layer of SequoiaDB database uses multi-dimensional partition to distribute massive data to multiple data partition groups for storage. By combining the advantages of Hash distribution and Partition distribution, this method distributes the data in the collection to multiple data partition groups of the database with smaller granularity, thus improving the performance of the database.

The main purpose of adopting partition is to solve the problem of limited hardware resources of a single server, such as memory or disk Imax O bottleneck, so that the machine can be scaled horizontally; in addition, the system pressure can be distributed to multiple machines, thus improving system performance without increasing application complexity. At the same time, the copy mode of SequoiaDB is combined to ensure the high availability of the system.

05 realized value

5.1 Business value

More and more enterprises are no longer satisfied with processing information by running batch tasks at night, but are more inclined to obtain the value of data in real time. They believe that the value of data is greatest only when it is generated, and that it makes the most sense to move, process and use it as soon as it is generated. As the best practice of the real-time processing architecture, the online transaction service platform processes the data of each system in real time, integrates the valuable data, and saves it to the SequoiaDB database for users to query in real time. The real-time data processing system not only improves the satisfaction of users, but also effectively combines real-time processing technology with practical business applications. In the future, there will be more business scenarios that need the support of this technology.

5.2 Technical value

A stable, reliable and efficient real-time processing architecture is the basis for transforming real-time data into value. As a platform built by the real-time data processing architecture, the online transaction service platform can run stably in the generation environment and provide efficient services, which has a high reference value in technology. The real-time data processing architecture realizes the real-time docking between SequoiaDB and other databases, facilitates the migration and backup of data from other databases, and can be used as a middleware for real-time docking between SequoiaDB and other databases.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.