Database Design under High concurrency 04/17 Update SLTechnology News&Howtos

Database Design under High concurrency

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Database Learning: high concurrency Database Design

With the continuous upgrading of Letv's hardware rush, the request pressure that Letv Group is facing has soared a hundred times or even a thousand times. As the last part of the purchase of goods, it is particularly important to ensure that users complete the payment quickly and stably. So in November 15, we made a comprehensive architectural upgrade of the entire payment system to enable it to steadily handle 100000 orders per second. It provides a strong support for Letv ecology to snap up flash sale activity in various forms.

I. Library sub-table

In the Internet era when cache systems such as Redis,memcached prevail, it is not complicated to build a system that supports 100, 000 reads per second, nothing more than expanding cache nodes through consistent hashing, horizontally expanding web servers and so on. The payment system needs hundreds of thousands of database update operations per second (insert plus update) to process 100, 000 orders per second, which is an impossible task on any independent database, so the first thing we have to do is to divide the order table (order for short) into databases and tables.

When doing database operations, there are usually user ID (uid for short) fields, so we choose to divide the database and tables by uid.

In the strategy of sub-database, we choose "binary tree sub-database". The so-called "binary tree sub-database" refers to: when we expand the database, we always expand the capacity by a multiple of 2. For example: 1 to 2, 2 to 4, 4 to 8, and so on. The advantage of this sub-database method is that when we expand our capacity, we only need DBA for table-level data synchronization, rather than writing our own scripts for row-level data synchronization.

It is not enough to have a sub-database. After continuous stress testing, we found that in the same database, the efficiency of concurrent updates to multiple tables is much higher than that to a single table, so we split the order table into 10 parts in each sub-database: order_0,order_1, … ., order_9 .

Finally, we put the order table in 8 sublibraries (numbered 1 to 8, corresponding to DB1 to DB8), and 10 subtables in each sublibrary (numbered 0 to 9, corresponding to order_0 to order_9). The deployment structure is shown in the following figure:

Calculate the database number according to uid:

Database number = (uid / 10)% 8 + 1

According to the uid calculation sheet number:

Table number = uid% 10

When uid=9527, according to the above algorithm, uid is actually divided into two parts: 952 and 7, in which module 8 plus 1 equals database number and 7 is table number. Therefore, the order information of uid=9527 needs to be looked up in the order_7 table in the DB1 library. For more information on the algorithm flow, please see the following figure:

With the structure and algorithm of sub-database and sub-table, the last thing is to find the implementation tools of sub-database and sub-table. at present, there are about two types of tools on the market.

Client sub-database sub-table, complete sub-database sub-table operation on the client side, directly connected to the database

The use of sub-database sub-table middleware, the client side connected to sub-database sub-table middleware, by the middleware to complete the operation of sub-library and sub-table.

These two types of tools are available on the market, and they are not listed here. Generally speaking, these two types of tools have their own advantages and disadvantages. The performance of client sub-database sub-table is 15% to 20% higher than that of using database sub-table middleware because it is directly connected to the database. The use of sub-database sub-table middleware due to a unified middleware management, the sub-library sub-table operation and client isolation, module division is more clear, to facilitate the unified management of DBA.

We choose to divide the database and table on the client side, because we have developed and opened up a set of data layer access framework, which is code-named "Mango". The mango framework natively supports the function of sub-database and sub-table, and it is very easy to configure.

Mango home page: mango.jfaster.org

Mango source code: github.com/jfaster/mango

2. Order ID (uid dimension)

The ID of the order system must have global unique characteristics. The simplest way is to use the sequence of the database to get a globally unique self-increasing ID for each operation. If you want to support the processing of 100000 orders per second, you will need to generate at least 100000 order ID per second. Generating self-increasing ID through the database obviously can not achieve the above requirements. So we can only get a globally unique order ID through memory calculation.

The most famous only ID in the Java domain should be UUID, but the UUID is too long and contains letters to be used as an order ID. Through repeated comparison and screening, we use Twitter's Snowflake algorithm for reference to achieve a globally unique ID. The following is a simplified structure diagram of the order ID:

The above figure is divided into three parts:

Time stamp

Here, the granularity of the timestamp is millisecond, and System.currentTimeMillis () is used as the timestamp when generating the order ID.

Machine number

Each order server will be assigned a unique number, which can be directly used as the machine number when generating the order ID.

Self-increasing serial number

When there are multiple requests to generate an order ID in the same millisecond on the same server, the sequence number is incremented in the current millisecond and continues to start at 0 in the next millisecond. For example, if there are three requests to generate an order ID on the same server in the same millisecond, the self-increasing serial number of the three order ID will be 0Magi 1 Magi 2 respectively.

By combining the above three parts, we can quickly generate a globally unique order ID. However, global uniqueness is not enough. In many cases, we will only query the order information directly according to the order ID. Because there is no uid, we do not know which sub-database to query and traverse all the tables in all the libraries. This is obviously not possible. Therefore, we need to add the information of sub-database and sub-table to the order ID. The following is the simplified structure diagram of order ID with sub-database and sub-table information:

We add sub-database and sub-table information to the generated global order ID header, so that we can quickly query the corresponding order information only according to the order ID.

What does the information of sub-database and sub-table contain? As discussed in the first part, we split the order table into 8 databases according to the uid dimension, each with 10 tables. The simplest database and table information can be stored with a string of length 2, the first bit stores the database number, the value range is 1 to 8, the second bit stores the table number, and the value range is 0 to 9.

Or according to the first part of the algorithm to calculate the database number and table number according to uid, when uid=9527, sub-database information = 1, sub-table information = 7, they are combined, the two sub-database sub-table information is "17". For the specific algorithm flow, please see the following figure:

There is no problem in using the table number as the sub-table information above, but there is a hidden danger in using the database number as the sub-database information. Considering the future expansion requirements, we need to expand the 8 libraries to 16 libraries. In this case, the sub-library information with values from 1 to 8 will not be able to support the sub-library scenarios from 1 to 16, and the sub-library routing will not be completed correctly. We refer to the appeal problem as the loss of accuracy of sub-database information.

In order to solve the problem of the loss of sub-database information accuracy, we need to redundancy the sub-database information accuracy, that is, the sub-database information we save now should support future expansion. Here we assume that we will eventually expand the capacity to 64 databases, so the new sub-database information algorithm is:

Sub-library information = (uid / 10)% 64 + 1

When uid=9527, according to the new algorithm, sub-database information = 57, where 57 is not the real database number, it redundant the final expansion to 64 databases of sub-database information accuracy. We currently have only 8 databases, and the actual database number needs to be calculated according to the following formula:

Actual database number = (sub-database information-1)% 8 + 1

When uid=9527, sub-database information = 57, actual database number = 1, sub-database sub-table information = "577".

Because we choose module 64 to save the sub-database information after precision redundancy, the length of the saved sub-library information is changed from 1 to 2, and the final sub-database sub-table information length is 3. For more information on the algorithm flow, please see the following figure:

As shown in the figure above, module 64 is used to redundant the accuracy of sub-database information when calculating sub-database information, so that when our system needs to be expanded to 16, 32 and 64 libraries, there will be no problems.

When 577 cannot find the order, use (sub-database information 57-1)% 8 + 1 to find the actual database number.

It can be understood like this:

1. The expanded sub-database and sub-table information can be compatible with the pre-expansion sub-database and sub-table information, that is to say: 9527 this uid was originally designed in accordance with 64 libraries in 57 database 1 table, since there are only 8 libraries at most, all of which can only be allocated in 8 libraries, after replacement, it can be regarded as 7 database 1 table.

9527 is also designed according to the 8 libraries in the 7 libraries 1 table. To some extent, the conversion result of this algorithm is consistent.

2. After the expansion, the subsequent users will be evenly distributed in the expanded database and table; the previous database table information exists, so there is more order information in the existing database table (such as 8 database 9 table) than after the later expansion (such as 64 database 9 table). In other words, the database table information before capacity expansion is redundant.

The above order ID structure can well meet our current and future expansion needs, but considering the uncertainty of the business, we have added a version to identify the order ID in front of the order ID. This version number is redundant data and is not used at present. The following is a simplified structure diagram of the final order ID:

Snowflake algorithm: github.com/twitter/snowflake

Third, final consistency (bid dimension)

So far, we have realized the ultra-high concurrent writing and updating of the order table through the sub-database and table of the uid dimension of the order table, and can query the order information through uid and order ID. But as an open group payment system, we also need to query the order information through the business line ID (also known as merchant ID, referred to as bid), so we introduce the order table cluster of the bid dimension, and redundant a copy of the order table cluster of the uid dimension to the order table cluster of the bid dimension. When you want to query the order information according to the bid, you only need to look up the order table cluster of the bid dimension.

Although the above scheme is simple, it is troublesome to keep the data consistent between the two order table clusters. The two table clusters are obviously in different database clusters. If strong consistent distributed transactions are introduced into writes and updates, it will undoubtedly greatly reduce system efficiency and increase service response time, which is unacceptable to us. Therefore, we introduce message queues for asynchronous data synchronization to achieve the ultimate consistency of data. Of course, all kinds of exceptions in the message queue will also cause data inconsistency, so we introduce a real-time monitoring service to calculate the data differences between the two clusters in real time and synchronize them consistently.

The following is a simplified consistency synchronization diagram:

IV. High availability of database

There is no machine or service that can guarantee a stable online operation without failure. For example, at a certain time, when a database main database is down, we will not be able to read and write to the database, and the online service will be affected.

The so-called database high availability refers to: when the database has problems due to various reasons, it can restore the database service and repair the data in real time or quickly, as if there were no problems from the point of view of the whole cluster. It should be noted that the recovery database service here does not necessarily refer to repairing the original database, but also includes switching the service to another standby database.

The main work of database high availability is database recovery and data patching. Generally, we use the length of time to complete these two tasks as the standard to measure the quality of high availability. There is a vicious circle. The longer the database is recovered and the more inconsistent data is, the longer the data will be patched and the longer the overall repair time will be. Therefore, the rapid recovery of the database has become the top priority of the high availability of the database. Imagine that if we can complete the database recovery within 1 second of the database failure, the inconsistent data and cost will be greatly reduced.

The following figure is the most classic master-slave structure:

In the figure above, there is one web server and three databases, where DB1 is the master library and DB2 and DB3 are the slave libraries. We assume here that the web server is maintained by the project team, while the database server is maintained by DBA.

When there is a problem from the library DB2, DBA will notify the project team that the project team will delete the DB2 from the configuration list of the web service, restart the web server, so that the node with the error DB2 will no longer be accessed, and the entire database service will be restored. When DBA repairs the DB2, the project team will add DB2 to the web service.

When there is a problem with the master library DB1, DBA will switch DB2 to the master library and inform the project team that the project team will replace the original master library DB1 with DB2 and restart the web server, so that the web service will use the new master library DB2, while DB1 will no longer be accessed, and the entire database service will be restored. When DBA repairs DB1, DB1 can be used as a slave library of DB2.

The above classical structure has great disadvantages: no matter there is a problem with the master or slave database, it requires DBA and the project team to work together to complete the recovery of database services, which is difficult to automate, and the recovery project is too slow.

We believe that the database operation and maintenance should be separated from the project team, and when there is a problem with the database, the unified recovery should be achieved by DBA without the operation service of the project team, so as to facilitate automation and shorten the service recovery time.

First, let's take a look at the available structure diagram from the library height:

As shown in the figure above, the web server will no longer directly connect to the slave library DB2 and DB3, but will connect to LVS load balancer, and LVS will connect to the slave library. The advantage of this is that LVS can automatically perceive whether the slave library is available, and after the slave DB2 goes down, LVS will not send read data requests back to DB2. At the same time, when DBA needs to add or subtract slave database nodes, you only need to operate LVS independently, and you no longer need the project team to update the configuration file and restart the server to cooperate.

Let's take a look at the high availability structure diagram of the main library:

As shown in the figure above, the web server will no longer connect directly to the master library DB1, but to a virtual ip virtualized by KeepAlive, then map the virtual ip to the master library DB1, and add the DB_bak slave library to synchronize the data in the DB1 in real time. Normally, web still reads and writes data in DB1. When DB1 goes down, the script will automatically set DB_bak as the master library and map the virtual ip to DB_bak. The web service will use the healthy DB_bak as the master library for read and write access. This only takes a few seconds to complete the recovery of the primary database service.

Combine the above structure to get the master-slave high-availability structure diagram:

The high availability of the database also includes data patching. Because when we operate the core data, we log first and then perform updates, coupled with the implementation of near-real-time fast recovery database service, so the amount of patched data is not large. A simple recovery script can quickly complete data repair.

5. Data classification

In addition to the core payment order table and payment pipeline table, there are also some configuration information tables and some user-related information tables in the payment system. If all the reads are done on the database, the performance of the system will be greatly reduced, so we introduce a data grading mechanism.

We simply divide the data of the payment system into three levels:

Level 1: order data and payment pipeline data; these two pieces of data require high real-time and accuracy, so do not add any cache, read and write operations will directly operate the database.

Level 2: user-related data; this data is related to the user and has the characteristics of reading more and writing less, so we use redis for caching.

Level 3: payment configuration information; these data have nothing to do with the user, and have the characteristics of small amount of data, frequent reading and almost no modification, so we use local memory for caching.

Using the local memory cache has a data synchronization problem, because the configuration information is cached in memory, and the local memory is not aware of the modification of the configuration information in the database, which will cause the inconsistency between the data in the database and the data in the local memory.

In order to solve this problem, we have developed a highly available message push platform. When the configuration information is modified, we can use the push platform to push configuration file update messages to all servers in the payment system. The server will automatically update the configuration information when it receives the message and give successful feedback.

VI. Thick and thin pipes

*, front-end retry and other reasons will cause the number of requests to soar. If our service is killed by a surge of requests and wants to recover, it will be a very painful and tedious process.

To take a simple example, our current order processing capacity is an average of 100000 orders per second and a peak of 140000 orders per second. If 1 million order requests are issued into the payment system in the same second, there is no doubt that our entire payment system will collapse, and the continuous flow of subsequent requests will make our service cluster unable to start at all. The only way is to cut off all traffic and restart the entire cluster. And then slowly import the traffic.

We can solve the above problems well by adding a layer of "thick and thin pipes" to the external web server. Database Design under High concurrency Database Design under High concurrency

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.