Practical information | how to ensure the high availability of the cloud database when it is on the cloud? 07/19 Update SLTechnology News&Howtos

Practical information | how to ensure the high availability of the cloud database when it is on the cloud?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

My friend and I complained about the shared responsibility model. Since the system he was responsible for went to the cloud, he experienced several failures in the cloud database, and the subsequent failures were actually their own responsibilities and problems, which made him very passive. What's more embarrassing is that after the cloud was installed, the database problems were all the responsibility of the public cloud vendors, so they didn't recruit DBAs in their operation and maintenance team, and there was no good optimization idea at present, so they asked me to discuss this problem together. My friend's Case is very typical. He thinks that everything will be fine on the cloud. Once there is a problem after the cloud, he will feel that the cloud is unreliable. Among public cloud vendors, the widely accepted view is the "shared responsibility model." Overseas, Amazon AWS and Microsoft Azure have adopted a security strategy that shares risks with users. For example, AWS, as an IaaS+ PaaS-based service provider, is responsible for managing the security of the cloud itself, while business system security is the responsibility of the customer. Customers can choose the right products in the AWS Security Marketplace to secure their content, platforms, applications, systems, and networks. Microsoft Azure also explores the "de-burdening" model for IaaS, PaaS, and SaaS users. Here, we do not intend to discuss this issue, but hope to introduce this concept, so that we can establish a preliminary understanding: after the cloud, it is still necessary for both customers and platforms to work together to achieve good results. What did he experience after ascending the cloud? The following is the fault described by my friend. It is limited to the repetition of the cause of the fault. I deleted some Cases. After listening to my friend, I was very surprised. I thought to myself, what does this have to do with Shangyun? These problems will still happen if you don't go to Shangyun. I can only say that you are lucky. It happened during Shangyun. We have some tolerance for new things. Otherwise, we dare not think about the consequences. Back-end modules are restarted in batches. Business data needs to be loaded from the database during restart. Due to concurrent restart and slow SQL (dozens of seconds), the cloud database load increases rapidly, some requests start to timeout, and then the failed modules retry indefinitely, resulting in cloud database crash due to excessive load, and all other businesses relying on this database fail; A large number of concurrent requests to the database in a short period of time, and the concurrency reaches about 2200 at peak periods, resulting in a large number of slow SQL in the database, and then the database performance drops sharply, and multiple business pages become slower, and performance degradation is obvious; the execution time of tasks created in batches is completely consistent, and the system requests a large amount of data to the database in an instant, and the number of connections rises, resulting in failure of all businesses on the database; A second kill activity, the system request volume increased sharply, the peak traffic reached 30 times the usual traffic, far exceeding the previous estimated traffic. A large number of database requests were made by some functions, which filled up the database links, causing the database to crash, which caused the system to fail to run the purchased security scanning products normally. The interface made empty parameter requests, while the interface performed full table scanning of the database for empty parameter requests. The pressure on the database soared, and slow SQL appeared one after another. The CPU utilization rate of the database continued to be 100%, causing all other services on the database to fail. A service accesses the database incorrectly, resulting in a null result in the list of legitimate users responding to downstream requests. Downstream modules directly delete all user permissions, resulting in complete unavailability of the system. There is no dedicated DBA, and changes to the cloud database are directly performed by R & D itself. R & D has abnormal database modifications for many times, resulting in service failure and data loss. R & D suspected that the database performance deteriorated, so it restarted the database. During the restart, one module failed to request the database and crashed directly; a set of business systems deployed in South China connected to the database in North China caused the response time of the system to remain high for a long time. The reason was that one page contained many database requests, and the delay of a single request increased by 40ms, but dozens of requests were executed in series, and the delay increased by more than 2 s. failure cause analysis

After analyzing the above Case with the students of Jingdong Cloud Platform Quality Department, we summarized the following reasons:

slow SQL

Under normal conditions, there are many slow SQL in the system, and its execution time is less than 15s and more than 60s. If the execution times of slow SQL increase, the pressure on the cloud database will inevitably increase, the database connection will be occupied, and the speed of processing other requests will slow down until the number of connections is consumed, resulting in service exceptions. Or before the number of connections is consumed, the service exception will be caused because the database CPU utilization rate is 100%.

High frequency SQL

High-frequency SQL may not seem like a problem, but once latency increases or the network shakes, high-frequency SQL can become slower SQL, based on its base being large enough to wear down the system.

multiplexing

The above multiple failures are all caused by a certain business exception that affects all businesses on the database. This may be due to the desire to reduce the complexity of operation and maintenance, so a database with the largest specification is established. Indeed, it is definitely simpler from the management point of view for all businesses to share a database.

read-write separation

Most of the above cases are failures caused by read requests. Suddenly, for various reasons, requests rise, and there is only one database instance, which has no horizontal expansion, so it is easy to hang up.

The database connection number setting is unreasonable

As can be seen from the fault description, any request can increase the number of concurrent connections in the database to more than 2000, resulting in the unavailability of other services, and there is no reasonable resource allocation for different services.

Lack of change process

R & D directly modifies data in the online database. The reasons for modifying errors include the wrong name of the table, the wrong where condition, or adjusting the larger table structure. It is easy to cause major accidents without offline test verification before operation and database backup before operation.

Permission management confusion

Multiple CASE are R & D direct operation of online data, which is a manifestation of authority management confusion, but also a very dangerous thing. Imagine that everyone can modify the database, and everyone should know what the consequences will be. If you modify the data related to the transaction data, or delete the database and run away, it will be troublesome.

Unlimited multiple CASE also see this problem, all the interfaces are not limited, we can initiate arbitrary levels of access, so any user to initiate a batch request is enough to bring down the system. Suggestions of Cloud Platform Quality Department Combined with the situation of this friend, after discussion, the students of Cloud Platform Quality Department give the following suggestions for improving the database. For some more common problems, such as direct crash after system exception, empty parameters, etc., we will not discuss them here. We will have a special article to explain TOP-N SQL current limiting. TOP-N SQL is divided into two situations: Slow SQL, i.e. TOP-N that takes time to execute· SQL optimization·Reasonable setting of database connection number·Direct kill of SQL that takes more than 1s to execute (some scenarios can be customized, such as step tasks, writing SQL, SQL with high importance, etc.)·Emergency ban of accounts with many SQL problems, i.e. TOP-N that executes frequently·Reduce high-frequency SQL through cache function of business layer

On Jingdong Cloud, it provides performance optimization function, which can query all slow SQL. Be sure to use it.

Finally, we must find a way to implement the automatic kill slow SQL function on the cluster, and don't wait for a problem to find someone to see if they can kill these SQL, it's too late, experience value, once it reaches this point, the failure time starts 40 minutes.

Separate database instances must be used for isolated deployment of core services, and shared database instances can be considered for non-core services only. This prevents problems with a single user from affecting all services. However, isolation is not only based on business perspective, but also can be separated in other dimensions according to business conditions, such as separating some report businesses from core businesses. For similar ideas, there are many isolation methods for business operation and maintenance. Please refer to How to Improve Availability through Isolation in Task Scheduling System. From a cost point of view, Jingdong Cloud takes this into account very well. The price of two small instances is equal to the price of one large instance, so splitting does not increase the cost, and the increase in management cost is also very low. read-write separation

Jingdong Cloud's cloud database provides read-only instances, so you need to make good use of this feature. The simple point is to add several read-only instances to migrate read requests. The more complex point is to allocate read requests of different business types to different read-only instances, and use the isolation feature to control failures within a small range, thus ensuring the normal use of most functions.

Current limiting is not only controlled by the number of connections at the database level, but also needs to be performed on the service side in advance. After all, the current limiting mechanism on the service side will be more flexible and customized to better meet the needs of the service. How to limit the current, you can refer to the "three-board axe current limiting method of the plan." data backup

Any modifications and adjustments to the database need to be backed up to avoid the problems mentioned above. Jingdong Cloud provides flexible database backup management functions, which need to be used well. The importance of this place cannot be overstated.

Monitoring of databases

Before going to the cloud, there may be a dedicated DBA team to monitor the database. After going to the cloud, if there is no full-time DBA, then the business operation and maintenance team needs to assume this responsibility. The following are several key indicators intercepted from Jingdong Cloud's monitoring. Of course, monitoring of database functions is also required. In this regard, the cloud platform quality department has rich experience, you can also refer to "monitoring is not in place, downtime two lines of tears."

Process establishment For change and permission management, it is necessary to gradually establish relevant processes and automate them as much as possible. At the same time, for a variety of high-frequency operations, you can also provide such as operation manuals, checklist manuals, etc., to minimize manual operations. Three axes my personal habit, any problem, after providing multiple solutions, finally through three axes to prioritize, so that everyone can grasp the key points: isolation deployment && read and write separation, using Jingdong cloud's ability, can be quickly done, so put the first place;TOP-N SQL, easy to find, optimization requires R & D cooperation, so put the second place, you can start from those SQL execution time dozens of seconds; Current limiting, either at the access layer or on the core module, takes a little longer to develop, so it is placed in the third place.

Finally, thanks to a number of small partners of the platform quality department for their concerted efforts to complete the above scheme.

References:

How to improve the task scheduling system through isolation Available https://www.infoq.cn/article/vsb2jGCAgXPPdqPS38S6 Plan Three-board axe current limiting method https://www.infoq.cn/article/L1FThcLIgzHSYlIaDk0R Monitoring is not in place, downtime two lines of tears https://www.infoq.cn/article/txmNQW_d7Hpi8KyXf4wz Responsibility sharing model https://aws.amazon.com/cn/compliance/shared-responsibility-model/ Click "Link" to learn more about cloud database SQL Server! Welcome to click "Jingdong Cloud" to learn more wonderful content.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.