Shen Jian, CTO of Kuaigou Taxi: best practices for Database Schema consistency 07/15 Update SLTechnology News&Howtos

Shen Jian, CTO of Kuaigou Taxi: best practices for Database Schema consistency

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This article is based on Shen Jian's live speech at the 10th China Conference of system Architects (SACC2018) on October 18, 2018.

Lecturer introduction:

Shen Jian, Fast Dog Taxi CTO, Internet architecture technology expert, author of "architect's Road" official account. He was the senior engineer of Baidu, the senior architect of 58.com, and the chairman of 58.com Technical Committee. Transferred to 58 in 2015 as Senior Director and Chairman of the Technical Committee, responsible for the construction of back-end technical systems such as infrastructure, technology platform, operation and maintenance security, information systems, etc. The current fast dog taxi-hailing CTO is responsible for the construction of the fast dog taxi-hailing technical system, which is essentially a technical person.

Summary of this article:

Shen Jian shares the consistency practice of the fast dog ride-hailing database architecture, which can reflect the evolution of the fast dog taxi-hailing database architecture in the process of consistency practice. From single database to multi-database to high availability and so on, including in the process of research, each stage may encounter different problems, what kind of technical means does fast dog taxi use to solve these problems? Share with you some of the practice of fast dog taxi hailing.

Sharing outline:

The main body is not consistent, optimize the practice

Cache inconsistency, optimization practice

Inconsistent data redundancy, optimization practice

Multi-database transaction inconsistency, optimization practice

Summary

Text of the speech:

Fast Dog Taxi (formerly 58 Express) is a start-up company. The changes in technical architecture, technical system and database architecture are very similar to many companies here. Let's talk to you today. We have encountered some problems with the consistency of the database architecture of Fast Dog Taxi.

The inconsistent optimization process is also the process of database architecture evolution.

The mainline is the process of changing our database architecture. In this process, I listed four nodes related to consistency: inconsistent master-slave, inconsistent cache, inconsistent redundant data, and inconsistent multi-database and multi-instance. The inconsistent optimization process is also a process of the evolution of our database architecture. What pits are waiting for us from the single bank to the present?

Let's take a look at the original database architecture, which looked like this at first. There was no micro-service layering at that time, and web accessed a single-database database through DAO, which I did in the first place. Single library, it does not have any high availability, high concurrency characteristics, and poor scalability. I'm sure it's the same with many startups in the early days.

What kind of bottleneck will a single library encounter at the earliest? When starting a business, the amount of data becomes larger, the concurrent volume becomes larger, and the business becomes more complex. Where does the bottleneck of the whole system first appear? My experience is a database. Where will the bottleneck of the database be? My experience is to read. Because the vast majority of the business is to read more and write less business, reading is most likely to be called the bottleneck of the system.

What is the first optimization method that comes to mind when the database is unbearable? Internet companies are talking fast. If there is a problem today, can you fix it for me tomorrow or the day after tomorrow? What is the first solution that comes to mind, and how can you quickly expand the read performance of the database?

Add two instances, master-slave synchronization, read-write separation, this is a start-up company, when database reading becomes a bottleneck, the first solution that comes to mind is to quickly expand read performance. What is the problem of master-slave synchronization? This is the first problem that this topic will talk about, the problem of master-slave consistency.

When the amount of data is increasing and the throughput is getting larger and larger, it is written to the master database, the master library is synchronized to the slave database, and there is a delay in the master-slave synchronization. During the delay window, it is possible to read an old data from the read-write separation to read the slave database. I believe we will also encounter this problem.

For this problem, many business solutions are, tolerate, some businesses if the requirements for consistency is not so high. But is there an optimization plan?

These two diagrams are two of our common practices.

The first is middleware, our service layer or site layer does not directly call the database, through an intermediate layer, to transfer the database. In the middle layer, it can know which library, which table, and which KEY has been written. If the following period of time (assuming the master-slave synchronization is completed in one second), if a read request falls on the slave database, the old data will be read. At this point, the middleware will route the read request to the main library and read the new data.

The second is the compulsory reading master. The second picture, double master synchronization, what are the benefits of forced reading the master? First, the problem of high availability is solved. The two hosts use the same VIP. If one master library fails, the other master library can be topped up at any time to ensure high availability. Second, it avoids the inconsistency between master and slave.

What are the new problems it brings? The problem of consistency has been solved, but the problem of read performance expansion has come again, and the read-write resistance of the main library has not solved the problem of readability expansion.

In addition to adding slave libraries, Internet companies also have a common way to improve system read performance, caching and servicealization. Abstract the service layer, shield the complexity of the underlying database to the caller, shield the highly available complexity of the database, shield the complexity of the cache, and provide services to the business layer.

Service plus caching is indeed an architectural solution to improve the read capacity of the system. What new problems will be encountered to improve readability through caching? With master-slave architecture, there is the problem of master inconsistency; with cache architecture, of course, there is also the problem of cache inconsistency. As long as you put the same data in multiple places, and there is a time difference in the changes in multiple places, there will be inconsistent data access.

What can we do when there is an inconsistency between the database and the data in the cache?

First of all, let's take a look at why there is inconsistency. The common way to play cache is "Cache Aside Pattern". Cache Aside Pattern, bypass cache, how do you usually play? It is Cache Aside Pattern's conclusion that caching is eliminated rather than updated.

What is the time sequence of reading and writing? There is a cache for read requests, there is no dispute, read the cache first, if the data hits, I will return directly, if the data does not hit, read and write separate from the library, take the data out of the library and put it into the cache, this is a process of read requests.

For write requests, the Cache Aside Pattern approach is to write to the database first and then eliminate the cache. Under what circumstances will inconsistencies occur? When the concurrency is relatively high, what happens when a write operation is done for the same KEY and then a read operation occurs immediately? First, a write operation occurs, first updates to the database, eliminates Cache, and then comes another read operation immediately. At this time, the master-slave synchronization is not yet completed, read the cache first, the cache has been eliminated by the write operation just now, and then read the slave database and put the dirty data from the database into the cache. Inconsistency occurs.

In the case of high concurrency, dirty data is easily entered into Cache when it is read immediately after writing.

Have you found that the data here are inconsistent, which is more serious than the data inconsistency between the master and the slave? The master is never consistent, only one active synchronization time difference is inconsistent, after synchronization, the new data can be read from the library. However, the inconsistency between the cache and the database will lead to subsequent inconsistencies. Once the dirty data is cached, the dirty data will not be eliminated until the next write occurs, so it is actually more serious.

How to solve the problem? Cache and database data inconsistency, our two practices: asynchronous elimination of cache to ensure that the slave database has been synchronized successfully; set the timeout time, there is a chance to correct the limit.

First, wait until the slave library has been fully synchronized, and then eliminate the cache asynchronously? As long as the binlog of the slave library is monitored and the binlog of the slave library is completed, the write operation must be completed, and the cache can be eliminated at this time to avoid the time lag.

The second is that if Cache miss is allowed, do not set the cache expiration time to permanent, if you set it to an infinitely long expiration time, there is no chance to correct the inconsistency.

With the development of the business, in addition to the increase in traffic, we need to improve the read performance of the system. What other problem will we face if we want to improve the high availability of the database of the system? By the way, the amount of data will increase. The amount of data in our business is getting larger and larger, what kind of way do we usually use to solve it? Start-up companies, these two programs should be the most used by everyone.

The first one is the sub-library. Reduce each library and reduce the amount of data per instance, so that you can carry more data. What new problems does the sub-library bring? Take, for example, an order library, which has multiple dimensional queries, order ID queries, user ID queries, driver ID queries, and a library without any problems.

But after dividing into multiple libraries, once you use one dimension sub-library, you will find that the queries of other dimensions will become multiple libraries, won't you?

Generally speaking, the sub-database is divided through the user's ID, and the sub-library factor is put in the order ID, so that the relevant data can be located through the user ID and the order ID. But for the driver ID is different, the driver ID and the user ID is a many-to-many relationship. A user may place orders for multiple drivers, and a driver receives orders from multiple users. Through the driver ID to query, it is not possible to query all the data at once. The orders of the same driver must be distributed in multiple databases. What should I do? The most common solution at this point is data redundancy.

I use one to store metadata and one to store relational data. The metadata is distributed through the user ID to ensure that all orders of the same user are in the same library. The relational data is divided into libraries by the driver ID to ensure that all orders from the same driver are in the same warehouse. For the same data, because it has two-dimensional queries, these two-dimensional queries can be realized through data redundancy without exaggerating the database, which is a very common scheme in the industry.

What will happen if the data is redundant? Let's take a look. The above is the application, the middle is the service, and one data is stored in two repositories, one through the user's ID, and the other through the driver's ID. The caller makes a request to write a data in the first piece of data and then write a redundant data to another database. Can you ensure the consistency of redundant data? There is no guarantee that these two libraries will be written successfully at the same time, so what should we do?

This is the consistency problem of redundant data. Today, we introduce three methods for the inconsistent optimization of redundant data. In fact, the essential methodology is the ultimate consistency.

The first plan is to sweep the whole volume. How do I find out that redundant data is inconsistent? Write a script and run every night. In theory, there are some B libraries in A library. Once you scan the library and find that there is no inconsistency in B library, it is necessary to make compensation according to the business characteristics. Whether to fill in the second half or delete the first half is related to the business characteristics, but the idea is roughly like this, an asynchronous way to ensure consistency.

The second option is to scan the increment. Operate two libraries through the service, write a log for the first library, and write a log for the second library. What's in these logs is the data that changes every day. You don't have to scan all the data every day, just scan the data that changes every day. If the scan log does not match, it is repaired asynchronously to ensure final consistency.

The third way is more real-time than the first two ways. Instead of keeping a diary, I sent a message. With a message component, the database forward table operation is successful, send a message, the redundant table operation is successful, send another message. Use an asynchronous service to listen for these two messages, and if only one message arrives, go to the database to check for consistency and compensate in an asynchronous way.

Finally, there is multi-instance and multi-database, which is also a common solution to solve the large amount of data. What kind of inconsistencies will it bring? Here is a case in which there may be three data to be modified in the operation of issuing an order, one is the balance data, I may have to deduct some balance, the other is the order data, I need to add an order, and the other is the pipelining data. add a new pipeline. It used to be a single database transaction to ensure consistency, but now the amount of data is large, it has become multiple databases, the balance is a single instance, the order is a single instance, and pipelining is a single instance, so the original transaction, in the multi-database state, becomes three transactions.

Multi-instance, multi-database transaction, inconsistent, what to do? We have two optimization practices in this area.

The first is compensation transactions, which should also be frequently used in the industry.

Balance operation, the positive operation is to deduct the balance, and the compensation transaction is to add the balance back.

Order operation, the positive operation is to add the order, and the compensation transaction is to delete the order.

Pipeline operation, the positive operation is to add pipelining, and the compensation transaction is to delete the pipelining.

In short, a compensation transaction is to perform an application-layer transaction and roll back an action when you find that the previous transaction failed.

Another way, the solution for pseudo-distributed transactions, is post commit.

Let's take a detailed look at how the three transactions are executed. The first transaction executes and then commits, the second transaction executes and then commits, and the third transaction executes and commits again. The execution process of the transaction is very slow, and the transaction commit process is very fast. In the example above, it may take 200 milliseconds to execute and a few milliseconds to submit. When will there be inconsistencies? After the first transaction is successfully committed, an exception anywhere in the middle before the last transaction is successfully committed will lead to inconsistency.

Optimization is actually very simple, post-submission. The first transaction executes, the second transaction executes, the third transaction executes; the first transaction commits, the second transaction commits, and the third transaction commits. When will there be inconsistencies? It is still the time interval before the successful commit of the first transaction and the third transaction. If there is a network exception and the server crashes, it will be inconsistent. But this interval is only two milliseconds behind, so the overall probability of inconsistency is reduced by about a hundredfold.

Finally, make a simple summary. In my experience, only 10% of a 40-minute and 50-minute technology sharing can be remembered the next day. If you only remember 10%, then I hope you can remember the contents of this page, and hope your logic is clear.

Database architecture is originally a single library, what problems will be encountered in a single database? You will encounter the problem of reading performance bottleneck. What is the earliest way to solve the read performance bottleneck? What problems will it bring if the master is separated from synchronous read and write? What is the solution to the inconsistency between the master and the subordinate? Our practice is middleware, as well as mandatory readers.

It is also a common solution to improve read performance and service plus cache. what new problems do they bring? Inconsistency between cache and database. In the case of Cache Aside Pattern, there is a problem of reading immediately after writing, and the old data may be cached. In our practice, we can eliminate the cache when the write operation is actually completed on the slave library by means of asynchronous elimination. At the same time, we recommend that you set a timeout for all data that allows Cache miss.

Database architecture, how to solve the problem of large amount of data? The common solution is sub-library, multi-instance. What new problems does the sub-library bring? Remember my example, after dividing the library, you can guarantee that the data of the same user is in the same database, but you cannot guarantee that the same driver data is also in the same database. How to solve this problem? Use data redundancy. What are the problems with redundancy? The direction of the inconsistency of redundant data is the ultimate consistency. How do you finally ensure consistency? Sweep full volume, scan increment, real-time message pair. In addition to multiple libraries, multiple instances can also expand the amount of data storage, what problems will be encountered? Multi-database transactions can not guarantee principle, compensation transactions, post-commit, are our optimization practice.

There is so much content today. I hope you have something to gain. Thank you.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.