Introduction and practice of database-related middleware in Internet industry! 07/06 Update SLTechnology News&Howtos

Introduction and practice of database-related middleware in Internet industry!

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This paper mainly introduces the database-related middleware in the Internet industry. Database-related platforms mainly solve the following three problems:

Provide high-performance, high-capacity and high-availability access to massive foreground data

Provide quasi-real-time protection for the consumption of data change

Efficient remote data synchronization

The application layer accesses the database through sub-table and sub-database middleware, including read operation (Select) and write operation (update, insert and delete, DDL, DCL). The write operation will generate change records on the database. The change records of MySQL are called binlog and those of Oracle are called redolog. The incremental data subscription and consumption middleware parses these changes and stores them in a unified format, and the lower-level applications consume applications based on these data. Of course, there will also be database migration between the database and the database itself, which can be handled on its own without incremental data subscription and consumption of middleware data.

There are several kinds of database middleware:

Distributed database by table and database

Data incremental subscription and consumption

Database synchronization (full, incremental, cross-room, replication)

Migration across databases (data sources)

The whole product family diagram is as follows:

The top layer is the distributed database sub-table and sub-database middleware, which is responsible for dealing with the upper application, which can be represented as an independent database while shielding the underlying complex system details. In addition to the basic functions of sub-table and sub-database, distributed database middleware can also be enriched, such as the integration of read-write separation or horizontal expansion functions, or for example, read-write separation itself can also be used as an independent middleware. (Cobar, MyCAT, TDDL, DRDS, DDB)

Incremental data subscription and consumption, user operations on the database, such as DML, DCL, DDL, etc., these operations will produce incremental data, lower-level applications can monitor these incremental data for corresponding processing. Typical represents Canal, according to the binlog implementation of MySQL. There is also middleware for incremental data subscription and consumption for Oracle (redolog). (Canal, Erosa)

Database synchronization middleware involves synchronous operation between databases, which can realize cross-room synchronization, remote disaster recovery backup, diversion and other functions. It can involve a variety of databases, and the processed data can be stored in a variety of forms. (Otter, JingoBus, DRC)

There will be data migration (synchronization) between the database and the database. The principle of the same data synchronization is relatively simple, such as MySQL master / standby synchronization, as long as the corresponding configuration is made in the database layer, but cross-database synchronization is more complex, such as Oracle- > MySQL. Data migration generally includes three steps: full replication, migrating all the data from the original database to the new database, and new data will also be generated in the process of migration; incremental synchronization, synchronization of the newly generated data, and continue for a period of time to ensure data synchronization; stop writing in the original library and switch the new database. Expand the meaning of "cross database"-"cross data sources", such as HDFS, HBase, FTP, etc., can be synchronized with each other. (yugong, DataX)

Distributed database

With the increasing expansion of Internet products in size and scale, both Oracle and MySQL will face stand-alone bottlenecks such as disk, CPU and memory at the first time. for this reason, the product side not only needs to purchase high-specification servers whose cost is difficult to control, but also faces iterative online data migration. In this case, whether it is a large amount of structured data or the fast-growing business scale, there is an urgent need for a horizontal scale-up method to allocate the storage cost to the cost-controllable commercial server. At the same time, we also hope to reduce the impact of total data migration on online services through linear expansion, so the scheme of sub-database and sub-table arises at the historic moment.

The middleware of sub-table and sub-library class mainly provides services to applications in two forms:

One is to provide direct dependence for Java applications in the form of jar packages of JDBC. Through the JDBC packages provided, Java applications can transparently access each sub-database table in the distributed database cluster, which typically represents NetEase's DDB and Ali's TDDL.

The other is to deploy independent services for applications to meet the needs of application sublibraries and tables. In this way, Proxy is accessed through standard JDBC, while Proxy parses client requests according to MySQL standard communication protocols, restores application SQL requests, then accesses database clusters locally, and finally returns the results to the client according to MySQL standard communication protocols. It typically represents Ali's Cobar, Cobar variant MyCAT, Ali's DRDS, NetEase's DDB proxy model and DDB's private cloud model.

Cobar

Cobar is a middleware that provides distributed services of relational database (MySQL). It can make the traditional database get good linear expansion, and it still looks like a database and remains transparent to the application.

Cobar is located between the foreground application and the actual database in the form of Proxy, and the open interface to the foreground is the MySQL communication protocol. Change the foreground SQL statement and send it to the appropriate background data sub-database according to the data distribution rules, and then merge the returned results to simulate the database behavior under a single database.

Cobar belongs to Ali B2B business group, began in 2008, served in Ali for more than 3 years, took over 3000 + MySQL databases of schema, and the cluster handles more than 5 billion online SQL requests per day. Due to the departure of the Cobar sponsor, Cobar stops maintenance. Subsequent similar middleware, such as MyCAT based on Cobar, including RDRS now serving in Ali, which also reuses Cobar-Proxy-related code.

Cobar structure

The interaction with the application through MySQL protocol is a proxy structure that exposes jdbc:mysql://CobarIP:port/schema to the outside. Transparent to the application.

There is no need to introduce a new jar package, the migration from access to database access Cobar can reuse the original JDBC-based DAO.

Both the front and back ends of Cobar implement the MySQL protocol. When a SQL request is received, interpretation (SQL Parser) and routing (SQL Router) are performed at one time, and then the SQL Executor backend module is used to obtain the dataset (the backend module is also responsible for heartbeat checking). If the dataset comes from multiple data sources, Cobar needs to combine the dataset (Result Merge) and finally return the response.

Database connection reuse. Cobar uses connectives to interact with the background database. In practical application, the number of database connections can be saved by 2-10 times after using proxy structure according to different applications. )

Cobar transaction, Cobar maintains strong consistency of transaction in the case of single library, and weak consistency of transaction in the case of sub-library. 2PC protocol is adopted in sub-library transaction, including execution phase and commit phase.

The front end of Cobar is NIO, while the interaction between the back end and MySQL is a blocking mode. The NIO code only gives the framework and has not been implemented yet. It is said that the non-open source version of Cobar implements the back-end NIO.

Fake death will occur in Cobar. After fake death, Cobar will frequently switch between master and slave (if configured), and the automatic switching itself has hidden dangers.

It can be calculated that the TPS=5000000000/ of Cobar (30002406060) = 20.

Cobar is also related to a total of Cobar-Client.

Cobar implements data access by forwarding SQL statements. For the SQL statement sent by the user, Cobar parses its content, determines which sub-database the data involved in the statement is distributed in, and then forwards the statement to this sub-library for execution. When the split field involved in the SQL statement has multiple values, such as IN, or the split field does not appear in the where condition, the statement will be forwarded to all backend sub-libraries for execution, and then the execution result will be sent back to the application in the form of MySQL protocol packet.

The communication module is responsible for identifying each MySQL protocol packet from the continuous network data flow, and then parsing the protocol packet to identify the SQL statement and outputting the SQL statement to the Parser module. At the same time, the execution result input by the Result Merge module is encoded into the MySQL protocol packet. It is implemented in NIO and has high execution efficiency. After optimization, a ByteBuffer pool is introduced to manage the Buffer of NIO, which reduces garbage collection during NIO data interaction.

The Cobar front end uses the optimized NIO communication module, and in order for this module to be used in the back end, Cobar removes JDBC. When interacting with the back-end database, Cobar is directly oriented to the protocol. At present, the back-end interaction based on MySQL protocol is realized.

After the horizontal split, there are multiple data sources in the background, and the management of them is divided into two levels: DataNode and replica (HA Pool).

DataNode management split, a DataNode stores a shard of data, no data intersection with each other. Each shard stores multiple copies of data to ensure high availability, each called a replica, which is managed by the HA layer. Each replica represents a specific data source, which is a connection pool in which each specific JDBC connection is managed. Routing operations focus only on the DataNode layer, and the layers below are not visible to it.

Data replication and synchronization between each replica is done by MySQL's own replication protocol, and only one replica provides services (called Master, the rest of the replica is called Slave) at a time. Cobar keeps the heartbeat with it, and once it is found that it is not available, it will switch to another replica to solve the second problem of the Oracle single point.

To save the number of machines in the database, you can deploy it as shown in the following figure:

When the user has configured MySQL heartbeat, Cobar can automatically generate heartbeat to the MySQL connected to the backend to determine the operation status of the MySQL. If the operation is abnormal, the Cobar can automatically switch to the slave to work. However, it is important to emphasize:

There are two ways to trigger the master / slave handover of Cobar, one is triggered manually by the user, and the other is triggered automatically when the heartbeat statement of Cobar detects an exception. If heartbeat detects an exception from the host, switch to the slave. If the host is restored, the user needs to switch back to the host manually. Cobar will not automatically switch back to the host when the host resumes, unless the heartbeat of the slave also returns abnormal.

Cobar only checks MySQL master / slave exceptions and does not care about data synchronization between master and slave. Therefore, users need to configure bi-directional synchronization on MySQL master / slave before using Cobar. For more information, please see MySQL reference manual.

Problems solved by Cobar

Distributed: Cobar's distribution is mainly achieved by putting tables into different libraries.

Cobar supports splitting a table horizontally into multiple parts and putting them into different libraries to achieve horizontal splitting of tables.

Cobar also supports putting different tables into different libraries.

In most cases, users mix the above two methods

It should be emphasized here that Cobar does not support splitting a table, such as test table, into test_1, test_2, test_3... . Put in the same library, the split tables must be put into different libraries to achieve distribution.

Constraints of Cobar

Join, paging, sorting and subquery operations across libraries are not supported.

SET statement execution is ignored, except for transactions and character set settings

In the case of sub-libraries, the insert statement must include split field column names

In the case of sub-library, the update statement cannot update the value of the split field

SAVEPOINT operation is not supported

Only MySQL data nodes are supported temporarily

When using JDBC, the rewriteBatchedStatements=true parameter setting is not supported (default is false)

When using JDBC, the useServerPrepStmts=true parameter setting is not supported (default is false)

When using JDBC, the BLOB, BINARY, VARBINARY fields cannot use the setBlob () or setBinaryStream () methods to set parameters

MyCAT

From the point of view of definition and classification, it is an open source distributed database system and a Server that implements MySQL protocol. Front-end users can regard it as a database agent, which can be accessed by MySQL client tools and command line, while its back-end can communicate with multiple MySQL servers using MySQL Native Protocol or with most mainstream database servers using JDBC protocol. Its core function is to divide tables and libraries. Split a large table horizontally into N small tables and store them in a back-end MySQL server or other database.

With the development of the current version, MyCAT is no longer a simple MySQL agent, its back-end can support mainstream databases such as MySQL, SQL Server, Oracle, DB2, PostgreSQL, etc., as well as MongoDB, a new type of NoSQL storage, and will support more types of storage in the future.

MyCAT is a powerful database middleware, which can be used not only for read-write separation, table separation and disaster recovery management, but also for multi-tenant application development and cloud platform infrastructure, so that your architecture has strong adaptability and flexibility. With the upcoming release of MyCAT, you can only optimize modules, and the data access bottlenecks and hotspots of the system are clear at a glance. According to these statistics, analyze the data. You can automatically or manually adjust the back-end storage to hide different tables onto different storage engines without changing a single line of code for the entire application.

MyCAT is a version developed on the basis of Cobar with two significant improvements:

The backend has been changed from BIO to NIO, and the concurrency has been greatly improved.

Added aggregation functions such as Order By, Group By, Limit (although Cobar can also support Order By, Group By, Limit syntax, but the result is not aggregated, it is simply returned to the front end, the aggregation function still needs to be completed by the business system itself)

MyCAT architecture

Transaction is weak XA

One of the most important verbs in the principle of MyCAT is "intercept", which intercepts the SQL statement sent by the user. First, it makes some specific analysis of the SQL statement, such as fragmentation analysis, routing analysis, read-write separation analysis, cache analysis, etc., and then sends the SQL to the real database at the back end, and the returned results are processed properly, and finally returned to the user.

MyCAT provides a solution for SQL statements that it does not support-adding an additional piece of code organized by the annotated SQL before the SQL statement to be executed so that SQL can execute correctly, which is called "annotations." The use of annotations is equivalent to a transparent proxy forwarding of SQL statements that are not supported by MyCAT, which is directly delivered to the target data node for SQL statement execution.

MyCAT itself has a management and monitoring mode similar to other databases, which can be managed through the MySQL command line, logging in to the management port (9066) to execute the corresponding SQL, or through jdbc for remote connection management.

MyCAT as a proxy middleware, the high availability of MyCAT system is designed to the high availability of MyCAT itself and the high availability of back-end MySQL. In most cases, it is recommended that the MySQL master-slave replicate the high availability configuration and deliver it to the MyCAT to complete the master-slave automatic switching of the back-end MySQL nodes.

HA on the MySQL side

The MySQL node enables the configuration scheme of master-slave replication, and configures the master node as the writeNode in the dataHost of MyCAT and the slave node as readNode. At the same time, the MyCAT periodically initiates heartbeat detection on all writeHost and readHost nodes in a dataHost.

Normally, MyCAT uses the first writeHost as the write node, and all DML SQL sends this node.

If MyCAT enables read-write separation, the query node will send it to readHost (+ writeHost) for execution according to the read-write separation strategy.

If the first writeHost goes down, MyCAT automatically switches to the next available writeHost to execute the DML SQL statement after the default three heartbeats failed.

When the original configured MySQL write node is restored after downtime, as a slave node, follow the new master node and reconfigure the master-slave synchronization.

MyCAT's own HA

The official recommendation is to use hardware-based load accumulation or software HAproxy and so on.

If you are also worried about the stability and node problems of HAproxy, you can use the floating function of keepalived's VIP to enhance it.

MyCAT features and features

Support for SQL 92 standard

Mysql clusters are supported and can be used as Proxy

Support JDBC to connect to multiple databases

Support for NoSQL database

Support galera sfor mysql cluster, percona-cluster or mariadb cluster, and provide high availability sharding cluster

Automatic failover, high availability

Support read-write separation, support MySQL dual master multi-slave mode, and one master multi-slave mode

Supports global tables and automatically splits data to multiple nodes for efficient table association queries

Support consistent Hash sharding to effectively solve the problem of sharding expansion

Multi-platform support, easy to deploy and try

Support Catelet development, similar to database stored procedures, for artificial intelligence coding implementation of complex SQL across fragments

Supports two network communication mechanisms: NIO and AIO. Under windows, it is recommended that NIO be recommended under AIO,Linux.

Support for MySQL stored procedure calls

Support for SQL interception and rewriting as a plug-in

Support for self-growing, Oracle-enabled Sequence mechanism

Support Mysql, MongoDB,Oracle, SQL Server, Hive, DB2, PostgreSQL and so on.

MyCAT's current project

MyCAT-Server:MyCAT core services

MyCAT-Spider:MyCAT crawler technology

MyCAT-ConfigCenter:MyCAT configuration Center

MyCAT-BigSQL:MyCAT big data processing (not more detailed yet)

MyCAT-Web:MyCAT Monitoring and web (new version under development)

MyCAT-Balance:MyCAT load balancer (not more detailed yet)

DRDS/TDDL

Alibaba. Distributed Relational Database Service.

Ali distributed database DRDS, formerly known as Taobao distributed database layer TDDL, probably in 2012, Ali began to try to export the TDDL system to Aliyun, and also had a new name: DRDS.

TDDL

Tabao developed TDDL (Tabao Distributed Data Layer, nickname: big head) according to its own business characteristics. It mainly solves the transparency of the application of sub-database and sub-table and the data replication between heterogeneous databases. It is a jdbc datasourcce implementation based on centralized configuration, which has the functions of active / standby, read-write separation, dynamic database configuration and so on.

TDDL is not independent middleware, can only be regarded as the middle tier, is provided to the application call in the form of Jar package. The idea that belongs to JDBC Shard.

TDDL is between the business tier and the JDBC tier.

In fact, TDDL can be divided into three layers, namely Matrix layer, Group layer and Atom layer. The Matrix layer is used to implement the logic of sub-database and sub-table, and there are multiple Group instances at the bottom. Group and Atom together constitute a dynamic data source, the Group layer implements the write separation logic of the Master/Slave schema of the database, and the bottom layer holds multiple Atom instances. Finally, the Atom layer (holds the data source) realizes the dynamic push of the database ip, port, password, connectionProperties and other information, as well as the JBoss data source which holds the courtyard data source separation.

The TDDL community is at a standstill and there are few resources available online.

RDRS

DRDS/TDDL is a distributed database service independently developed by Alibaba. DRDS is born from Alibaba's open source Cobar distributed database engine, absorbs the Cobar-Proxy source code of Cobar core, and implements a set of independent parsing side similar to MySQL-Proxy protocol, which can parse and process the incoming SQL, shield applications from various complex underlying DB topologies, get the same experience as a stand-alone database, and learn from Taobao TDDL's rich practical experience in distributed database. Realize the support of distributed Join, SUM/MAX/COUNT/AVG and other aggregate functions, sorting and other functions, through heterogeneous indexes, small table broadcasting and other functions to solve a series of problems derived from the use of distributed database, and finally form a complete distributed database scheme.

The location of DRDS in the entire Ali system:

For many applications, stand-alone databases will eventually encounter the ceiling of stand-alone performance, and will encounter various restrictions on a series of system resources such as TPS/QPS/ memory capacity / disk capacity. The main goal of DRDS is to help you solve all kinds of problems in this area. It mainly provides two functions, read-write separation and database sharding:

Read and write separation, can run to achieve a machine to write, multiple machines to read, which for more read and write less applications, can solve the bottleneck of the system at a very low cost.

Database segmentation is the ultimate solution to solve the bottleneck of system storage. The core idea of database segmentation is actually very simple, that is, divide and rule. By distributing the data to multiple machines and ensuring that requests can be evenly distributed to these machines, various performance bottlenecks of the business can be solved at a very low cost. Of course, segmentation also comes at a price. The most obvious price is that the distributed database will restrict some scenarios of original stand-alone data, because these operations are very inefficient or delayed in a distributed environment. Even if they can be implemented, they can not be used because of performance problems.

Other functional features

1. Distributed MySQL execution engine

The main goal is to achieve full compatibility with the stand-alone database SQL engine, to achieve intelligent push-down of SQL, to intelligently analyze SQL, to resolve which SQL can be sent directly, which SQL needs to be optimized, what to optimize, and which instance nodes to route to, to give full play to the full capacity of database instances, and to reduce data transmission between networks. Finally, a small number of results processed by different instances are aggregated and returned to the application caller. This is the intelligent push-down function of distributed SQL engines.

The responsibility of the distributed engine includes four processes: SQL parsing, optimizing, executing and merging.

It supports almost all languages on the market (those with MySQL access) and is compatible with more than 90% MySQL syntax.

Case study:

For example, for a simple AVG operation, for some basic distributed database models, the common practice is to send AVG directly to all storage nodes, resulting in syntax compatibility, semantic incompatibility, and error results. On the other hand, the intelligent push-down engine of DRDS makes full semantic compatibility adaptation to the syntax of SQL. For AVG operation, the engine can only optimize the logical AVG SQL parsing to the SQL of SUM and COUNT, and then push down. The underlying database instance node completes the SUM and COUNT calculation, makes full use of the computing power of the underlying node, aggregates the SUM and COUNT results of each storage node in the engine layer, and finally calculates the AVG.

two。 On-line smooth expansion

The focus of online data expansion is the word "online", that is, users do not need to stop business for splicing operations, and can directly add new RDS nodes to the cluster to achieve seamless free expansion. RDRS divides the whole expansion process into several stages, including full migration, incremental synchronization, database switching and other steps. The data will be moved ahead of time, and incremental parallel synchronization will be carried out for a period of time. Therefore, we can complete the final expansion and switching of the database in a very short time (second level) without affecting the business.

3. Small watch broadcast

After the segmentation of some large business tables, there will always be some original information tables with small amount of data and small updating. These tables often perform join operations with our split large tables, which physically results in distributed join queries, and the efficiency is relatively underground as a whole. Aiming at this distributed join scenario, a special tool for OETL is developed to broadcast the small table, and all the data of the original information table (including incremental updates) are automatically broadcast to the machine of the large table, so that the original distributed query can be turned into a stand-alone local query.

4. Globally unique ID

The goal of the DRDS sequence function is only to ensure the global uniqueness of the data. Although it is basically obtained by time series, it is not globally ordered.

5. Heterogeneous index

Solve the inefficiency problem caused by the inconsistency between the data split dimension and the data query dimension in the distributed scenario.

When the data table is divided into multiple sub-databases and sub-tables, the distribution rules of data in the sub-database and sub-tables are fixed. However, the business usage scenario of data is usually very complex. If the query dimension of data and the rules of data split distribution are constant, a single SQL will be executed on a sub-database and table. If the query dimension of data is inconsistent with the specification of data split distribution, a single SQL may be executed on multiple sub-database sub-tables. If cross-database query occurs, cross-database query will increase the IO cost, and query efficiency will inevitably decline.

The idea to solve this problem is still the consistent principle of distributed database, let SQL execute on a single database, the actual way is to use the "space for efficiency" solution, that is, the same data table, redundant storage of multiple copies, split according to different business usage scenarios, keep split dimensions and use dimensions unified, and real-time data replication between multiple data to solve the problem of data consistency. This is the heterogeneous indexing scheme. Of course, heterogeneous index tables can not be abused indefinitely, too many heterogeneous index tables will affect the synchronization efficiency and cause synchronization pressure on the source data table.

Other middleware of the same style

Altas, Vitess, Heisenberg, CDS, DDB, OneProxy, etc.

Atlas

Qihoo 360.

A data middle tier project based on MySQL protocol developed and maintained by the infrastructure team of Web platform Department, which is optimized on version 0.8.2 of mysql-proxy and added some new features.

Atlas is located between the application and the MySQL. It implements the client and server protocols of MySQL, communicates with the application as the server, and communicates with the MySQL as the client. It shields the details of DB from the application.

Altas can not implement distributed sub-tables, all word tables must be in the same DataBase of the same DB and all word tables must be built. Altas does not have the function of automatically building tables.

Heisenberg

Baidu.

Its advantages: the sub-database table is separated from the application, the sub-database table is the same as using a single database table, reducing the pressure on the number of db connections, hot restart configuration, horizontal expansion, compliance with MySQL native protocol, read-write separation, no language restrictions, mysqlclient, c, Java can use Heisenberg server to view through management commands, such as the number of connections, thread pool, nodes, etc., and can adjust the use of velocity partition script to customize the partition table Very flexible.

(the open source version has stopped maintenance)

CDS

JD. Completed Database Sharding.

CDS is a client-based development of sub-database and sub-table middleware products, the implementation of the JDBC standard API, support sub-database sub-table, read-write separation and data operation and maintenance and many other common, to provide high-performance, high concurrency and high reliable massive data routing access services, business systems can be close to zero cost intervention, currently supports MySQL, Oracle and SQL Server.

(similar to Cobar,MyCAT in architecture, it directly uses jdbc docking, does not implement similar MySQL protocol, and no NIO,AIO,SQL Parser module uses JSqlParser. Sql parsers include: druid > JSqlParser > fdbparser.)

DDB

A pig farm. Distributed DataBase.

DDB has undergone three major service model changes: Driver model-> Proxy model-> cloud model.

Driver mode: based on JDBC driver access, provides a db.jar, similar to TDDL, located between the application layer and JDBC.

Proxy mode: a group of proxy servers are built in DDB to provide standard MySQL services, and the logic of sub-library and sub-table is implemented in the proxy server. The application accesses the DDB Proxy through the standard database driver, and the Proxy internally restores the request to SQL through the MySQL decoder, and the result is obtained by DDB Driver execution.

Private cloud model: Cloudadmin, a platform management tool developed based on NetEase's private cloud, breaks up the original Master functions of DDB, integrates some sub-library related functions into proxy, such as sub-library management, table management, user management, and some centralized functions into Cloudadmin, such as alarm monitoring. In addition, Cloudadmin provides platform features such as one-click deployment, automatic and manual backup, and version management.

Data incremental subscription and consumption

Based on database incremental log parsing, it provides incremental data subscription-consumption, which mainly supports mysql.

A review of middleware on data incremental subscription and consumption:

The incremental subscription and consumption module should include binlog log crawling, binlog log parsing, event distribution filtering (EventSink), storage (EventStore) and other main modules.

If you need to make sure that HA can save the state of each sub-module in Zookeeper, and make the entire incremental subscription and consumption module stateless, of course, the state as a consumer (client) can also be saved in zk.

As a whole, it is managed centrally and resources are allocated through a Manager System.

Canal

Canal architecture diagram:

Description:

Server represents a canal running instance, corresponding to a jvm

Instance corresponds to a data queue (1 server corresponds to 1.. n instance)

Instance module:

EventParser (data source access, simulation of slave protocol and master interaction, protocol parsing)

EventSink (Parser and Store linkers for data filtering, processing, distribution)

EventStore (data storage)

MetaManager (incremental subscription & consumption Information Manager)

Description: a canal is deployed under a machine, and a canal can run multiple instance (by configuring destinations, etc.). In general, a client is connected to an instance (each instance can be configured with standby function), and multiple client can be connected to the same instance, but only one client can consume instance data at a time, which is controlled by zookeeper.

Database synchronization

Otter

Background: because of the characteristics of alibaba B2B business, sellers are mainly concentrated in China and buyers are mainly concentrated abroad, so it derives the demand for computer rooms in Hangzhou and the United States. At the same time, in order to enhance the user experience, the structure of the whole computer room is double A, which can be written on both sides, thus giving birth to a product like otter.

The first version of otter can be traced back to 2004-05, and the external open source version is version 4. The development time has lasted from July 2011 to the present. At present, the synchronization requirements of local / remote computer rooms in Alibaba's B2B are basically on otter4.

Based on the database incremental log analysis, quasi-real-time synchronization to the local computer room or remote computer room mysql/oracle database, a distributed database synchronization system.

working principle

Principle description:

Based on Canal open source products, obtain database incremental log data.

Typical management system architecture, manager (Web management) + node (work node)

Manager runtime pushes the synchronization configuration to the node node

The node node feedback the synchronization status to the manager

Based on zookeeper, it can solve the distributed state scheduling and allow multi-node nodes to work together.

The role of Otter

Heterogeneous library

Mysql- > mysql, oracle. (currently, the original version only supports mysql increments, and the target library can be mysql or oracle, depending on the function of canal)

Synchronization of single computer room (RTT (Round-Trip Time) 200ms between databases)

Computer room disaster recovery

Two-way synchronization

Avoid loopback algorithm (a general solution that supports most relational databases)

Data consistency algorithm (to ensure the ultimate straightness of data in double A computer room mode)

File synchronization

Site mirroring (when data is copied, the associated pictures are copied, such as product data and co-workers copy product pictures.

Copy schematic diagram of single computer room

Description:

-data On-Fly, do not hit the ground as much as possible, and synchronize data faster. (enable the node load balance algorithm. If the Node node S+ETL falls on a different Node, the data will have a network transmission process.)

-node nodes can have failover/loadBalancer.

SETL

S: Select

In order to solve the differences in data sources, such as accessing canal to obtain incremental data, you can also access other systems to obtain other data and so on.

E: Extract

T: Transform

L: Load

Similar to the ETL model of data warehouse, it can be used for data join, data transformation, and data loading.

Copy schematic diagram across computer rooms

Data is related to network transmission. Several stages of S/E/T/L will be scattered on 2 or more Node nodes, and multiple Node will work together through zookeeper (usually the Node of Select and Extract in one computer room, and the Node of Transform/Load in another)

Node nodes can have failover/loadBalancer. (the Node node of each computer room can be a cluster, one or more machines)

Otter scheduling model: batch processing + two-node deployment.

Otter data storage algorithm

Otter two-way loop control

Otter data consistency

Otter high availability

Otter extensibility

Remote dual active data Architecture Infrastructure DRC

The so-called DRC is the abbreviation of Data Replication Center, data replication center. This kind of replication is synchronous, heterogeneous, highly available (with strict disaster recovery system, good real-time), and supports subscription and distribution. At the beginning of the project, it was established for remote disaster recovery of Taobao, which is used for the synchronization of main and standby databases. Later, this technical solution was used to derive DRC-TAIR, DRC-DUMP and other projects.

The so-called remote dual activity mainly focuses on two things, one is data synchronization and the other is data distribution.

What kind of application will require double work in different places? There are three common scenarios:

There are a large number of users in two or more regions. For example, users in China want them to use RDS services in Hangzhou, and users in the United States use RDS services in the United States, which requires data synchronization in different places. Many games, finance, media, e-commerce business have this kind of demand. The difficulty to meet this demand lies in cross-regional networks, such as long network latency, high packet loss, and the risk of data leakage when data is transmitted in the public network.

There are many data sources, which need to be involved in a variety of heterogeneous data scenarios. For example, an application needs to intervene in data from several services such as ODPS, RDS, OTS, OceanBase and PostgreSQL. Their data structures and interfaces are different, and the cost of this kind of access will be high. So another available method is to write one when the data is written, thank you for different data structures.

There are many downstream subscriptions, such as a data backup system, notification system, big data analysis system, index system, and so on. It is possible to deal with the scheme of writing more than one piece of data above, but there are other difficulties, that is, data consistency, scalability, stability of cross-network synchronization, and real-time synchronization.

DRC supports reading real-time incremental data from many different data sources, such as MySQL, RDS, OceanBase, Hbase, Oracle, etc., and supports writing to database, MetaQ, ODPS and other storage media.

In the past, when we used to do dual computer room master / slave in a city, the two data rooms were equal to each other, and the writes were randomly distributed, and then the data was synchronized through the master / slave HA. In this way, the idea of peer-to-peer computer rooms will lead to business growth and data growth can only be solved through the continuous pile of machines in two computer rooms. On the other hand, if there is a power outage in the whole city, then double work will become a double death. The next idea is to do cross-city, early common practice is to write in one city, another city cold standby, is to do synchronization at night, but this means that if something happens during the day, the data on this day is more dangerous. Another idea is that the two cities write more and the data falls on both sides. The problem is that if the application calls frequently, if the remote data is called more than once or twice, the delay of the whole application will be very long. The further development of this idea is to close within the unit to reduce remote calls, which involves the transformation of the business.

Along this line of thinking, Ali's long-distance double work focused on several things. One is hot-swappable, which can add nodes at the peak of the business and shut down the additional nodes after the peak. One of the keys to this is real-time traffic switching. DRC can migrate traffic from one region to another in less than 20 seconds. The other is real-time data recovery, that is, through a certain redundant design, once one unit is dead, you can do full recovery in another unit.

The challenge of living in data in different places is very great. During the Singles' Day holiday, there will be a surge in trading, so the trading link has been unitized. The data of the transaction link is divided into three dimensions: buyer, seller and commodity. There is usually not much crossover between buyers, which naturally adapts to this isolation, and the seller is very sensitive to delay, so it is segmented according to the seller's dimension and closed in the unit, while both the seller and the goods are written in the center.

Two core requirements for data:

Consistency requires that the seller and the product are consistent, the unit and the center are consistent, that is, data synchronization can not lose data, can not be wrong data, but also to ensure the transaction.

Real-time, need to achieve the level of delay in seconds.

There are two types of dual-unit synchronization architectures:

One is the separation of reading and writing, central writing and unit reading. If the data needed by the unit is not synchronized from the center in time, or if the synchronization is wrong, then the transactions during this period of time will all be affected. The core here is to ensure second latency while ensuring consistency. (JD's multi-center trading system uses this approach.)

The second synchronization architecture is a cell-closed approach. The center and the unit are each written, and we can take over the center and unit at any time through redundancy. (similar to Otter)

The key here is:

Avoid circular replication: by transparently passing and marking transactions in DB.

Current limit: peak pressure. We have selected the traffic surge service as a unit, and both sides synchronize 100% of the total data in real time, and the peak pressure on each system continues unabated. Both store and congo of DRC can be limited according to TPS or traffic. The core idea of the speed limit algorithm is divided into batch sampling, reward and punishment time, smooth speed change.

The difference between Otter and DRC:

-Otter is the product of Ali B2B, and DRC is the product of Ali Technical support team.

-Otter is for MySQL, and DRC can support multiple types of data sources

-DRC is divided in terms of business, and can be closed within the unit. The implementation of Otter does not involve business, but the technology of getting through in the pure database layer.

-Otter is double-write, DRC is central write, split-center read, or partial write, synchronized with each other.

-the network environment of Otter is worse than that of DRC, and it is more complex to solve the consistency problem (one-way loopback remedy based on trusted source, remediation based on time intersection). There are two ways to implement DRC. For more information, please see above.

The core competencies of DRC in remote multi-activity are low latency, consistency and high availability.

Consistency: based on log stream crawling, playback of database table structure changes, transaction-based conflict detection.

Low latency: maximum delay less than 1s, message protocol optimization, three-level data storage, pre-read optimization IO, multi-connection multiplexing and transmission compression, efficient concurrent replication algorithm.

High availability: active / standby switching, topology change, heartbeat tracking, multi-dimensional disaster recovery.

JD multi-center trading system

JD data replication middleware investigates and draws lessons from the implementation of the open source community, such as Databus, Canal/Otter, OpenReplicator, etc., and uses Canal's DBSync in the parsing part.

Multi-center transaction is essentially a larger distributed system, and the data and services dependent on and generated in the transaction process have different characteristics. It inevitably involves common problems in distributed fields, such as data partitioning, routing, replication, read-write consistency, delay and so on.

Among them, data consistency is the primary problem that e-commerce websites need to face. the more traffic increases, the more it is necessary to ensure the immediacy and accuracy of data updates. Seller data and commodity data need to be synchronized between multi-centers. If the delay of synchronization is too long, buyers and sellers are unacceptable. For example, if the seller changes the price or inventory, the user can't see it for a long time. Similarly, the accuracy of the data is also a great challenge, the goods sold can be reduced in time, and the goods returned can be increased in time. All these always test the robustness of the back-end system and database platform.

In addition to data consistency, how to ensure the consistency of routing rules is also a key issue. From a technical point of view, to ensure that a single user from login to access the service, to access the database, the full-link routing rules are completely consistent. If the routing is incorrect and the data you see is incorrect, it will also affect the end user experience. Architecture

The system includes a main center and a plurality of sub-centers, and the data is exchanged between the main center and the sub-center through the data bus. In the data flow, the flow of master data (commodity data, merchant data, user data, etc.) is synchronized from the main center to the sub-center through the data bus, and the sub-center is read-only; while the flow of transaction data (order data) is synchronized from the sub-center to the main center in real time; in case of failure, it will be transferred from the sub-center to the main center.

In this system, there are many ways to embody the concept of diversion. First of all, when buyers visit JD.com 's website to place orders, they will be preferentially diverted to nearby trading centers; secondly, according to the characteristics of the trading system, before taking orders (including shopping carts, settlement pages, etc.), polycentric transactions are diverted according to the user dimension, as shown in the following figure. When a user logs in, query the mapping table between the user and the region (similar to which zone you belong to), identify which sub-center the user belongs to, save the identity to the cookie, and then route the user to the specified sub-center. When users access other systems, such as shopping carts and checkout pages, they read the identity from the cookie and redirect to the corresponding sub-center page.

By distributing users to the corresponding sub-centers, on the one hand, the response speed is faster, the user experience is better, and there is no need to access the data center across regions; on the other hand, each center serves a certain number of users with good horizontal scalability. It can also support a larger transaction scale. Of course, multiple data centers can not work blindly, but also take into account the problem of disaster recovery backup. (Alipay Optical Fiber incident)

The trading system includes the application and data part, and the application part is stateless, that is to say, there is no difference in these jobs. If there is something wrong with one server, I will just change to another server to deal with it. It is easier to achieve multiple activities in multiple computer rooms. But the data is different, multicenter transaction is essentially a larger distributed system, which inevitably involves common problems in distributed fields, such as data partitioning, routing, replication, read-write consistency, latency and so on.

In addition, the data and services that are dependent on and generated in the transaction process have different characteristics. For example, the reading services of goods, promotions, prices and inventory can be called basic master data, which cannot be partitioned in the process of placing orders, otherwise it is impossible to achieve closed-loop traffic in a stand-alone room, that is to say, the inconsistency of zoning data can not cause the same user to see different data in a single process (if you join the shopping cart for a promotion of 20 yuan and checkout is 25 yuan, will your expression be distorted? On the other hand, the writing services of goods, promotions and prices are called by third-party POP merchant applications. The availability goal of this business scenario can be met by the deployment of master server room and cold backup mode, and the write replication delay will be offset by the operation process of business personnel.

To put it simply, the problems of data are shown in the following aspects: first, how to ensure the immediacy and accuracy of the data. Seller data and commodity data need to be synchronized between multiple centers. If the synchronization delay is too long, buyers and sellers are unacceptable. As it is deployed in different places, it is best to control the delay within 1 second. For example, if the seller changes the price or inventory, the user can't see it for a long time. Similarly, data correctness is also a great challenge, because the data failure is different from the application layer fault, the application failure may only affect the user access; the data can not be recovered if it is written wrong. 2. How to ensure the consistency of the routing rules? to ensure that the routing rules of the full link are the same from the user's entry to the access service to the database; if the routing is incorrect, the data you see is incorrect.

The change from the distribution of double computer rooms in the same city to the distribution of multi-computer rooms in different places brings new challenges to data synchronization, so how to design the data bus is also the key factor of whether the project can be realized. JD.com 's multicenter trading system carries out fast data exchange through the data bus JingoBus, the synchronization performance is more than three times that of mysql, and the availability is high and the architecture is flexible. Among them, the brand-new bus design solves the problems of database replication across computer rooms in multi-center transactions and heterogeneous data synchronization among multiple data sources, and realizes a high-performance, low-latency and robust data synchronization mechanism.

As shown in the figure, the data bus is mainly composed of Relay, Snapshot and Replicator, in which Relay extracts transaction logs from the source database and provides log subscription service to Replicator, which is equivalent to Mysql Slave IO Thread in role. Snapshot subscribes to all transaction logs from Relay, writes persistent storage as snapshots, and provides bulk log subscription services to Replicator, which is equivalent to Mysql Slave Relay Log in role. Replicator: the consumer side of the transaction log that pulls the transaction log from Relay or Snapshot and applies the transaction log to the target database according to the configured consistency, which is equivalent to Mysql Slave SQL Thread in role. (refer to the schematic diagram of MySQL active / standby replication below)

Normally, Replicator connects directly to Relay and consumes transaction logs in the Relay memory queue. However, in some cases, due to network jitter, high load of the target library and other factors, Replicator may lag far behind Relay. In addition, when the new consumer adds subscribers to the same data source, the new consumer has the problem of cold start. In order to avoid taking full snapshots from the data source again, Snapshot, as a special consumer of Relay, consumes online transaction logs continuously from Relay through a high throughput consumption mode, and through effective processing of transaction logs, finally saves a consistent snapshot of the data source (Consistent Snapshot), that is, a snapshot of the latest status of each row in the data source database table. At the same time, a transaction log (Log Store) that is older than Relay buffer is retained. From this point of view, the data bus, as a general CDC component of the data layer, provides an overall solution for multi-center trading projects and asynchronous replication scenarios, and lays the core content of the project.

Migration across databases (data sources)

Yugong

Go to Oracle data Migration synchronization tool. Location: database migration (currently mainly supports Oracle- > mysql/DRDS)

Around 2008, Alibaba began to try MySQL-related research, and developed a related product based on MySQL sub-library and sub-table technology, Cobar/TDDL (currently Aliyun DRDS product), which solved the scalability problem that stand-alone Oracle could not meet. At that time, it also set off a wave of going to IOE project, which led to the birth of Yugong project. Its goal is to help users complete the migration from Oracle data to MySQL and complete the first step to IOE.

Overview

The whole data migration process is divided into two parts:

Full migration

Incremental migration

Process description:

Incremental data collection (create an incremental materialized view of the Oracle table)

Make a full copy

Perform incremental replication (data validation can be performed in parallel)

Stop writing the original library and switch to the new library

The whole amount of Oracle is based on JDBC pulling data, and the increment is realized based on materialized view.

Architecture

Description:

A JVM Container corresponds to multiple instance, and each instance corresponds to a table migration task

Instance is divided into three parts.

Extractor (extracting data from the data source database, which can be divided into full / incremental implementation)

Translator (custom transform the data on the source database according to the requirements of the target database)

Applier (update data to the target database, which can be divided into full / incremental / contrast implementation)

Custom data conversion

If the Oracle and mysql to be migrated have different table structures, such as table names, field names, and incompatible field types, you need to use custom data transformations. If it is exactly the same, you can skip it.

The whole data flow is: DB- > Extractor- > DataTranslator- > Applier- > DB. This program reserves DataTranslator interface (only supports Java) and allows external users to customize data processing logic. For example:

Table name is different

Field names are different

Different field types

The number of fields is different

Running process join other table data to do calculation, etc.

Introduction of operation mode

1.MARK mode (MARK)

Turn on the incremental log mode. If it is Oracle, you will create a materialized view (materialized view).

2.CLEAR mode (CLEAR)

The probability of cleaning up the incremental log. If it is Oracle, the materialized view is deleted.

3. Full volume mode (FULL)

Full mode, as the name suggests, is to perform a full operation on the source table and insert the target table after traversing all the data in the source table.

There are two ways to deal with the total quantity:

Paging processing: if there is a primary key in the source table, there is only one primary key field, and the primary key field type is Number, the paging mode is selected by default. Advantages: support breakpoint to continue to do, the pressure on the source library is relatively small. Disadvantages: slow migration

Once processing: access the data of a certain mvcc version of the entire source table through select * from, and traverse the entire result set through cursor.next. Advantages: the migration speed is fast, which is about 5 times that of paging processing. Disadvantages: great pressure on the source database, if the source database concurrent modification is large, it will lead to too many MVCC versions of the database and stack errors. There is no support for breakpoint continuation.

4. Incremental mode (INC)

The full mode, as the name suggests, is to insert the target table for incremental changes in the source table, and the incremental mode depends on logging function.

At present, the logging function of incremental mode is through the materialized view function of oracle.

5. Automatic mode (ALL)

Automatic mode, is a combination of full + incremental mode, automatic operation, reduce operating costs.

Internal implementation steps for automatic mode:

Turn on the logging function. (create a materialized view)

Run full synchronization mode. (after full completion, automatically proceed to the next step)

Run incremental synchronization mode. (incremental mode, there is no concept of completion, so it will not exit automatically. You need to judge whether the business can exit or not. You can take a look at the switching process.)

6. Contrast mode (CHECK)

The comparison mode is to make a full comparison of the data of the source database and the target database to verify the migration results. The comparison mode is an optional operation. After doing the full / incremental / automatic mode, you can selectively run the comparison mode to ensure the correctness of this migration.

DataX

DataX is a tool for high-speed data exchange between heterogeneous databases / file systems, which realizes the data exchange between arbitrary data processing systems (RDBMS/Hdfs/Local filesystem).

At present, there are many mature data import and export tools, but generally they can only be used for data import or export, and can only support one or several specific types of databases.

One problem with this is that if we have many different types of database / file systems (Mysql/Oracle/Rac/Hive/Other...) And often need to import and export data between them, then we may need to develop / maintain / learn to use a number of such tools (jdbcdump/dbloader/multithread/getmerge+sqlloader/mysqldumper …) . And as each library type is added later, the number of tools we need will increase linearly. When we need to import mysql data into oracle, do we ever want to break it off from jdbcdump and dbloader and put it together? Some of these tools use files to transfer data, and some use pipes, which bring extra overhead to data transfer to varying degrees, and the efficiency varies greatly. Many tools can not meet the common requirements of ETL tasks, such as date format conversion, feature character conversion, encoding conversion. In addition, sometimes we want to export a piece of data from one database to multiple different types of databases at the same time in a very short time window. DataX is created to solve these problems.

Left: add the n + 1 data source, do you need to develop n data synchronization tools?

Right: you only need to develop a set of Reader/Writer plug-ins for the new data sources to realize the mutual derivation of arbitrary data.

Design concept

In order to solve the synchronization problem of heterogeneous data sources, DataX changes the complex mesh synchronization link into a star data link, and DataX is responsible for connecting various data sources as an intermediate transmission carrier. When you need to access a new data source, you only need to connect the data source to DataX to achieve seamless data synchronization with existing data sources. DataX is widely used in Alibaba Group, undertaking all big data's offline synchronization business, and has been running steadily for 6 years. At present, more than 8w synchronous jobs are completed every day, and the amount of data transmitted every day exceeds that of 300TB.

Frame design

DataX itself, as an offline data synchronization framework, is built using Framework+plugin architecture. The data source read and write abstraction is called the Reader/Writer plug-in, which is integrated into the entire synchronization framework.

Reader: Reader is the data acquisition module, which is responsible for collecting data from the data source and sending the data to Framework.

Writer:Writer is a data writing module, which is responsible for continuously fetching data from Framework and writing it to the destination.

Framework:Framework is used to connect reader and writer, as a data transmission channel for both, and to deal with core technical issues such as cache, flow control, concurrency, data conversion and so on.

DataX framework focuses on the problems encountered in high-speed data exchange through technologies such as double buffer queue and thread pool encapsulation, and provides a simple interface to interact with plug-ins. Plug-ins are divided into Reader and Writer. Based on the plug-in interface provided by the framework, it is very convenient to develop needed plug-ins. For example, if you want to export data from oracle to mysql, all you need to do is develop OracleReader and MysqlWriter plug-ins and assemble them into the framework. And such plug-ins can generally be used in other data exchange situations.

Core architecture

The open source version of DataX3.0 supports stand-alone multi-thread mode to run synchronous jobs. Here, according to a timing chart of the life cycle of DataX jobs, the overall architecture design is a very brief description of the relationship between DataX modules.

Core module introduction:

DataX completes a single data synchronization job, which we call Job,DataX. After receiving a Job, a process is initiated to complete the entire job synchronization process. DataX Job module is the central management node of a single job, which undertakes the functions of data cleaning, sub-task segmentation (transforming a single job calculation into multiple sub-Task), TaskGroup management and so on.

After DataXJob starts, the Job will be split into several small Task (subtasks) according to different source-side sharding strategies to facilitate concurrent execution. Task is the smallest unit of a DataX job, and each Task is responsible for synchronizing part of the data.

After splitting multiple Task, DataX Job will call the Scheduler module and reassemble the split Task into TaskGroup (task group) according to the configured amount of concurrent data. Each TaskGroup is responsible for all the Task allocated at the end of a certain concurrency run, and the default number of concurrency for a single task group is 5.

Each Task is started by TaskGroup. After Task is started, the thread of Reader- > Channel- > Writer will be started to complete the task synchronization.

After the DataX job is running, Job monitors and waits for multiple TaskGroup module tasks to complete, and waits for Job to exit successfully after all TaskGroup tasks are completed. Otherwise, the process exits abnormally and the process exit value is not 0.

DataX scheduling process:

For example, the user submits a DataX job and configures 20 concurrency to synchronize a 100-table mysql data into the odps. The scheduling decision idea of DataX is as follows:

1. DataXJob is divided into 100Task according to the sub-library table.

two。 Based on 20 concurrency, DataX calculation requires a total of 4 TaskGroup to be allocated.

3. 4 TaskGroup halves 100 Task, each TaskGroup is responsible for running a total of 25 Task with 5 concurrency.

[this article is reproduced from the Golang language community, original link: https://mp.weixin.qq.com/s?src=11×tamp=1534989045&ver=1077&signature=ToJO30*XGVmdovXyPH*VxtSZ1-luOF39RVYMhRawyBaEzImS5l9ypwOPovM3wRIfmjfKaqafI7bDrhL57krSP6E9fd2eQAJUHmSG2Yu9YhwRe-*0XuBcM6O8JeuNAeV9&new=1]

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.