Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Why did the database lose data?

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

Today, I would like to talk to you about why the database will lose data, many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can gain something according to this article.

Today, database management system has become an important part of software, open source MySQL, PostgreSQL and commercial Oracle and other databases have been seen everywhere, almost all services need to rely on database management system to store data.

Figure 1-Database

It sounds like a matter of course that the database will not lose data, and persistence should be the most basic guarantee of the database, but it is difficult to ensure that the data is not lost in this complex world. Today, we can find many examples of data loss caused by database problems:

MongoDB cannot guarantee persistence for a long time in the past, and it is easy to lose data [^ 1]

Data loss caused by RocksDB DeleteRange function [^ 2]

Tencent Cloud hard disk failure, resulting in complete loss of online production data of startups [^ 3]

Whether it is the open source database or the services provided by cloud service providers, data loss may occur. In this article, the reasons for the loss of data in the database can be attributed to the following aspects, which we will cover in detail:

The misconfiguration of operation and peacekeeping caused by human factors is the primary cause of database data loss.

Data loss due to damage to the disk used by the database to store data

The function and implementation of the database are complex, and if the data is not brushed into the disk in time, there is a risk of loss.

Human error

Human error is the primary cause of data loss. In the Tencent Cloud data loss incident, we will find that although the cause of the accident is a hardware failure, it is the improper operation of the operation and maintenance personnel that ultimately leads to the damage to the integrity of the data:

The first is that the normal data relocation process enables data verification by default, which can effectively detect and avoid source data anomalies and ensure the correctness of relocation data. However, in order to speed up the completion of the relocation task, the operation and maintenance personnel have illegally turned off data verification.

Second, after the completion of the normal data migration, the source warehouse data should be retained for 24 hours for data recovery in the event of abnormal relocation, but the operation and maintenance personnel illegally collected the data from the source warehouse in order to reduce the warehouse utilization as soon as possible.

The best way to reduce human error is to standardize operations such as data backup and operation and maintenance, and use automated processes to deal with operations involving data security, so as to reduce the risk of human intervention.

For software engineers, we should be in awe of the production environment, carefully perform everything in the production environment, and realize that all operations may have an impact on the services that are running online, so as to reduce the probability of similar problems.

Hardware error

In our article on why basic services should not be highly available, it is extremely accidental that any online service can function properly, as long as the time is long enough, there is no way to guarantee 100% availability of the service [^ 4]. Hardware such as disks are likely to be damaged if they are used long enough. According to the data in the Google paper, the average annual failure rate (Annualized Failure Rates,AFR) of hard drives within 5 years is 8.6% [^ 5].

In 2018, the cause of Tencent Cloud data corruption is a single copy data error caused by disk silence error (Silent data corruption) [^ 6]. Disk silence errors are errors that are not detected by the disk firmware or the host operating system, including the following situations: loose cable, unreliable power supply, external vibration, data loss caused by the network, and so on.

It is precisely because disk data corruption is so common that we need data redundancy to ensure that the disk can recover disk data in the event of an unrepairable read error (Unrecoverable Read Error). Independent redundant disk Array (Redundant Array of Independent Disks,RAID) is a data storage virtualization technology that combines multiple physical disks into a single logical disk, which can increase data redundancy and improve performance [^ 7].

Raid-strategy

Figure 2-three strategies of RAID

RAID mainly uses Striping, Mirroring and Parity to manage the data on disk. Here are a few simple examples:

RAID 0 uses data segmentation techniques, but does not have mirroring and parity. It gives almost no protection to the data on the disk, any disk damage means that the data in it cannot be recovered, but because there is no redundancy, it will also provide better performance.

RAID 1 uses data mirroring, but does not have parity and data segmentation. All data will be written to two identical disks, both of which can provide data reading services. This approach reduces disk utilization, but improves read performance and provides backup

...

The segmentation and mirroring strategies used by RAID are similar to those of Partition and Replication in distributed databases, in which data is split and distributed to different disks or machines, while mirrors and replicas are used to replicate data.

Many modern operating systems provide software-based RAID implementation, and some cloud service vendors also use self-developed file systems or redundant backup mechanisms:

Google uses the Google file system to manage files, which stores files in blocks and manages all file blocks through the main service [^ 8]

Microsoft uses erasure coding in Azure to calculate redundant data [^ 9]

Hardware errors are very common in the production environment, we can only reduce the possibility of data loss through data redundancy and verification, but the way of increasing redundancy can only continuously reduce the probability of data loss and cannot be avoided by 100%.

Implementation complexity

The database management system will eventually store the data on disk, and for many databases, the data on disk means that persistence is complete. Disk as the lower layer of the database system, the stable storage of data by disk is the basis on which the database can persist data.

Database-and-disk

Figure 3-Database dependent disk

Many people mistakenly think that data can be written to disk using write, but this is wrong. Not only can the function write not guarantee that the data is written to disk, some implementations do not even guarantee that the target space is reserved for the written data [^ 10]. In general, the write of the file will only update the page cache in memory, which will not be immediately flushed to disk, and the flusher kernel thread of the operating system will undisk the data when the following conditions are met [^ 11]:

The free memory has dropped to a certain threshold, and the memory space occupied by dirty pages needs to be freed.

If the dirty data lasts for a certain period of time, the oldest data will be written to disk.

The user process executes sync or fsync system calls

If we want to flush the data to disk immediately, we need to call functions such as fsync [^ 12] immediately after executing write. When functions such as fsync return, the database will notify the caller that the data has been successfully written.

Write-and-fsyn

Figure 4-write and drop disk

Write and fsync are very important in database management systems. They are the core methods to provide persistence assurance. Some developers misunderstand write and write the wrong code, which will lead to data loss.

In addition to persistence, databases may need to provide guarantees of ACID (Atomicity, Consistency, Isolation, Durability) or BASE (Basically Available, Soft state, Eventual consistency). Some databases also provide complex functions such as sharding, replicas and distributed transactions. The introduction of these functions also increases the complexity of the database system, and with the increase of program complexity, the possibility of problems increases.

Database management system is one of the most complex and important systems in software engineering. The normal operation of almost all services is based on the assumption that the database will not lose data. However, the database cannot fully guarantee the security of the data for the following reasons:

Operation and maintenance personnel are very likely to lose data due to operational errors during configuration and operation and maintenance.

A hardware error occurred on the underlying disk on which the database depends, resulting in unrecoverable data

The functions supported by the database system are very many and complex, and data may be lost if the data is not set down in time.

Once the accident of data loss occurs, the impact will be very great. When we use the database to store core business data, we can not fully trust the stability of the database. We can consider using hot backup and snapshots for disaster recovery.

After reading the above, do you have any further understanding of why the database lost data? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report