How to gracefully solve the complex faults of distributed database 10/15 Update SLTechnology News&Howtos

How to gracefully solve the complex faults of distributed database

2025-10-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

How to elegantly solve the complex faults of distributed database, I believe that many inexperienced people do not know what to do about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Fault introduction

ACID is the four characteristics of transactions, among which D (Duration) is about persistence. One of the great values of the database lies in the failure that can be handled effectively to ensure that the data will not be lost. With the development of distributed database, the complexity of deployment increases, and the database faces more and more failure scenarios.

Common hardware failure

Next, let's take a look at the failure probability of common data centers.

"Designs, Lessons and Advice from Building Large Distributed Systems", jeff dean

Network malfunction

In addition to the above failures, there are some additional network failures to consider for distributed system design.

Brain fissure, as the name implies, means that the system is divided into multiple independent regions because of a network failure.

Under the condition of multiple network surfaces, part of the network surface failure, this error is generally very difficult to occur, because each network surface is often logical and not bound to the network card, if the user adjusts the configuration error, it may lead to this kind of failure, if the system spans multiple network surfaces, you need to consider this fault

Fragile data center

In fact, the data center is not as stable as expected. The following picture shows the cloudharmony monitoring data intercepted by the author on November 22, 2017 to monitor the reliability of more than 300 data centers.

As you can see, even the famous azure has failed to reach the claim of 99.95%. For more information, please see here.

In addition, let me give you a few more detailed examples.

September 29th, 2017 azure Nordic data Center failure

In the Nordic data center part, an accident occurred during the regular maintenance of the conventional fire extinguishing system and the fire extinguishing agent was released. This then causes the Air handling Unit (AHU), which is designed specifically for containment and safety, to shut down automatically. While some systems in the affected area shut down and restart some machines in order to prevent the system from overheating, AHU manually recovered after 35 minutes, because part of the data needed to be restored due to the sudden shutdown of the system, and the system did not return to normal until 7 hours later. The accident caused the storage services of some users to be unavailable.

February 28th, 2017 amazon S3 malfunction

When the operation and maintenance engineer located the problem that the accounting system was slowing down, they wanted to delete a small number of servers. As a result, the command input error deleted a large number of servers, including index subsystem and placement subsystem servers, causing S3 service to be unavailable from 9:37AM to 1:54PM. The most interesting thing is that the AWS Service Health Dashboard system relies on S3, so the monitoring page does not show the fault from the failure to the 11:37AM. This glitch is said to have knocked down half the Internet world outside the wall.

13 April 2016 Google Compute Engine service is out of service

Google Compute Engine is out of service in all regions of the world and resumes in 18 minutes. The failure was caused by an operation and maintenance engineer deleting a useless ip blocks, but the operation of deleting ip did not make a reasonable configuration synchronization, which triggered the consistency check of the network configuration system. When the network configuration system detected inconsistency, it restarted, resulting in service interruption.

On May 27, 2015, Hangzhou Telecom cut off the Ali network line.

After the optical fiber is broken, some users can not use it, and it will be restored after two hours.

Failure of core database system of Ningxia Bank on July 1, 2014

The second Department of the Bank officially issued the national document (2014) 187. the description of the Ningxia bank accident is roughly as follows. On July 1, 2014, the database of the core system of Ningxia Bank malfunctioned, resulting in the interruption of the bank's (including remote branches) business of deposit and withdrawal, transfer payment, debit card, online banking, ATM and POS.

After a preliminary analysis, in the case of a large amount of settlement business at the end of the quarter, the backup storage disk read and write processing was seriously delayed due to the abnormal backup system, and the backup was inconsistent with the main storage data, after interrupting the data backup and recording operation, resulting in damage to the production database and downtime. Due to the serious lack of emergency recovery and disposal mechanism of Ningxia Bank, the progress of system recovery work was slow. The core system did not resume service until 05:40 on July 3rd, and the business system was interrupted for 37 hours and 40 minutes, during which the business was handled entirely by hand.

Fault classification

Take a look at the interconnection diagram of the data center network

Any hardware device on the diagram can fail, from hosts to switches to network cables.

We try to make a simple classification of faults by fault domain. A failure domain is a set of components that are unavailable at the same time because of a failure. Common failure domains include:

Physical machines, including local disks, network card failures, memory failures, etc.

A cabinet that shares a set of power supplies in a data center

Several cabinets that share a network device in a data center

A data center affected by a single optical fiber

Multiple groups of data centers in the same region, powered by the same city or affected by the same natural disaster

Change of fault

The probability of failure of different components is different. A study by google shows that the failure rate of disks operating in the range of 36 °C and 47 °C increases gradually over time, reaching only 1.7% in the year * * and 8.6% in the third year.

Now there is also a lot of research, big data and artificial intelligence have been introduced into the field of disk fault prediction, and achieved good results.

Database fault handling

Log system

The database records the changes of the data, which can be divided into redo logs,. Undo log, redo/undo log, now popular is redo log.

Take a look at the structure of the postgresql log:

According to different logging methods, it can be divided into the following two types:

Physical log. The above picture is the physical log. Replay is fast, but the log volume is large, and the implementation logic is relatively simple and easy to make mistakes.

Logical logs, the speed of replay is relatively slow, the log volume is small, and there is an additional benefit for the database of MVCC mechanism. The standby machine can be gc separately, independent of the host.

There are two important principles for database logging systems:

WAL principle, that is, before the page flushes the disk, the flushing disk here does not just call write, but also needs to call sync operation. At the right time, often when a transaction commits, flush the log and call sync to synchronize to disk to ensure that data can be restored in the event of a power outage. In addition to flushing the log when a transaction is committed, sync is often called to flush the data when metadata operations are involved to ensure metadata consistency.

Through log system recovery, you need not only a good log, but also a complete (which can be backward) data as a starting point.

The log system is a widely used technology in the system software, not only the database, the log represents the change of the system, it can be used to restore / backup, it can also be used to notify the system, mastering the log flow of the system is equivalent to mastering the whole state of the system, the log can be understood more abstractly as log + state machine, and the state of the state machine can be changed by constantly revisiting the log. You can pass the state change to every corner of the system by passing the log. About the log system, one of the articles I have seen is The Log: What every software engineer should know about real-time data's unifying abstraction. It is highly recommended to read it. Log is everything.

Log collection

Logs represent all the changes in the system. If the data size is expanded from 0 to 100 gigabytes, then the logs should be at least 100 gigabytes or more, and the growth of logs is positively related to the changes made by users. No system can store * growing logs.

It is an inevitable choice to recycle the storage space occupied by logs. Log collection has two advantages:

Reduce the disk space occupied by logs

Reduce the time required for system recovery

In fact, for the database implemented by the MVCC mechanism, because there is no relationship between log recovery and transaction commit, the log can be strictly controlled at the specified size to provide convenience for system operation and maintenance.

The data recovery mentioned above requires a piece of intact data as a starting point. In fact, the reason is that the original log is reclaimed. If you can keep all the logs from the initial state to the * * state, then the log alone can restore the system. But obviously, no system can keep all the logs.

Checkpoint

Checkpoint is used to recycle logs. The process of checkpoint is as follows:

Hit: record the current log location

If you flush all the data in memory in the current system and synchronize it to disk by calling sync, you still have to follow the WAL principle.

Write checkpoint logs, or use checkpoint information as metadata

Recycle logs before the starting point of checkpoint

The above is the common way to do checkpoint, also known as full checkpoint (full checkpoint), this method is simple to implement, but it is obvious that checkpoint is an IO peak, which can cause performance jitter.

There is also a way to do checkpoint, called incremental checkpoint (incremental checkpoint), the process is as follows:

The background writing process refreshes the disk in the order of * page modifications.

Hit: record the log point corresponding to the page of the current disk refresh, write checkpoint log or flash disk as metadata

This kind of formatted checkpoint is a background write operation, which only needs to be dotted when doing checkpoint, which eliminates the peak value of IO and helps to stabilize database performance.

Torn page

The size of the database page and the size of the disk sector are often different, so when the page is flushed, if the system is powered off, only part of the page may be brushed, this phenomenon, we call it torn page, this page is equivalent to being completely damaged, and the log replay needs a complete data as a starting point, which cannot be recovered at this time.

There are several ways to deal with semi-writing:

Innodb's double write,pg full page write, the principle of these two ways is similar. Before the page is brushed, the page is first written somewhere else, and after sync, the page is overwritten.

Restore from backup, restore a page separately from backup.

There are a few exceptions:

The additional writing system does not have this problem.

If the page size and sector size are the same, there is no problem, which is considered by many metadata designs.

Many file systems or distributed file systems, raid cards, or disks themselves can also handle this failure. If you use hardware that can handle half-write failures by yourself, the database can not turn on this function.

The disk is full

The problem that the disk is full can only be solved by operation and maintenance, because the transaction submission of the database must be logged. If the log cannot be written, then no transaction can be committed, which is equivalent to shutting down the library. Therefore, disk failure is generally dealt with through monitoring and early warning when disk space is about to run out.

Disk damage

As mentioned above, database recovery requires a complete copy of data and logs, so if the data or log encounters disk corruption, the log system cannot recover and can only rely on other ways.

Backup

According to the level, the common backup methods are:

Full backup: the usual practice of full backup of a traditional database is to do a checkpoint first, and then copy all the data and the log after the checkpoint point. If there is a large amount of data, this is a very heavy operation.

Incremental backup: incremental backup also needs to do a checkpoint, and then copy the changed page after the last backup and the log after the checkpoint point. How to find the changed page after the last backup? full page comparison is a method. It is also a method to record page changes through bitmap files, and percona implements the second method. Incremental backups are often superimposed and merged, and full backups and incremental backups can also be merged in chronological order.

Log archiving: log archiving refers to the regular archiving of established logs.

Obviously, the cost of the above operations is in the order of full backup > incremental backup > log archiving. Database operation and maintenance often combine these three ways to achieve the goal of reducing the fault RPO.

Amazon aurora realizes the function of near real-time backup, and the backup time is no more than 5 minutes.

Multi-machine hot standby

If a machine fails for a variety of reasons, such as cpu burning, memory failure, or operating system bug, or even blown up, it can be restored by backup.

However, recovery through backup often takes a long time and can not meet the needs of business continuity (Business Continuity). In addition to backup, databases support stand-alone hot backup and standby machines that support read-only queries.

It is obvious that the standby should be deployed in anti-affinity according to the failure domain and customer requirements.

Master-slave (- cascade)

Each host can hang multiple standby machines, and each standby machine can hang multiple levels of backup, which is a common deployment method of traditional databases. Postgres even supports multi-level cascading (secondary connection) and so on, but it is not very common. This deployment method can effectively deal with stand-alone failures. As a standby that supports read-only operations, it can effectively share the read load, which is a delayed read operation and meets the corresponding isolation level, but there is no consistency when considered with the host.

Transaction commit time

Depending on when the host transaction is committed, there are several transaction commit levels:

The log of the host is down, and RTO0 at this time

When the host log is down, and the host log is sent to the standby machine, RTO

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.