Is the Raid disk array really 100% secure? what are the common failures of raid? 02/15 Update SLTechnology News&Howtos

Is the Raid disk array really 100% secure? what are the common failures of raid?

2026-02-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Since data recovery is a remedy for data disaster, will there be a data disaster when designing a secure RAID disk array system?

RAID is designed for about three reasons: solving capacity issues, solving IO performance issues, and solving storage security (redundancy) issues. From the perspective of data recovery, we will not discuss capacity and IO performance for the time being, but only storage security.

The common organization schemes that can play storage security in RAID are RAID1, RAID5 and their deformations. The basic design ideas are similar, and they all can pass certain algorithms. The algorithm maintenance between multiple hard disks can be used to ensure that when part of the data is abnormal, it can be restored by specific algorithms. Take the design of RAID5, to give a simple example, if we want to record two numbers, then we can record more of their sum to achieve record redundancy, just as we record 3 and 5, and then record 8 (the sum of 3 to 5), so if we don't remember what it is and 5, we only need 8-5 to calculate the missing number, and the rest depends on it. In the disk array, a certain algorithm is also used to achieve the purpose of data preservation. When a set of 3-disk RAID5 works normally, all the data written into the RAID is correctly written to a specific disk address and regenerated to a specific calculated value (usually called checksum). At this time, the read and write efficiency is the best. But when one of the disks fails, the original data stored on the failed disk will be recovered through the data of other hard disks. Of course, the controller (hard RAID is raid card, soft RAID is actually a driver) will be responsible for this work. At the same time, in order to ensure no downtime, the controller will also ensure the normalization of storage and will not let the operating system think that there is something wrong with the hard disk system.

Judging from the above principles, the storage security provided by RAID still has some loopholes that are not easy to avoid. Although it is unlikely, the value of the data stored on RAID may not be evaluated, and the slightest failure may lead to a major information disaster.

To get to the point, the common possibilities of failure in RAID are:

1, in the degraded state, not timely rebuild:RAID is through the extra part of the storage space to provide algorithm data security redundancy, but when some disk failure offline, RAID will no longer be able to provide this kind of storage redundancy, if the administrator does not change the disk in time, REBUILD the entire volume, then the rest of the hard disk failure, RAID volumes will not work properly. The proportion of this kind of failure in RAID data recovery is quite high, and it is easy to occur when the server maintenance management can not keep up.

2. Controller failure: the controller is the data storage link between the physical hard disk and the operating system, and because the composition of RAID is not a natural convention (specific), the size of hard disk capacity, the number of hard disks, the level of RAID composition, logical disk partition, block size, check mode and other factors are combined into different RAID information (raid metadata), these RAID information is sometimes written on the array card. Sometimes it's written on the hard drive, and sometimes it's both. If the controller fails, in many cases the replacement of the new controller can not restore the RAID information, and the loopholes in the middle and low-end controllers will be much larger for the sake of cost. At the same time, even if you remember the original RAID structure, rebuilding again is the wrong method of data recovery (see related article).

3, firmware algorithm defects: RAID creation, reconstruction, degradation, protection and other work in the implementation of the controller is a very complex algorithm, of course, the complexity is more to provide as foolproof as possible flawless algorithm, although manufacturers will not easily admit the controller BUG, but there is no doubt that these problems can not be avoided in any controller. Because the firmware algorithm is BUG, there may be a lot of unexplained failures. For example, in some server data recovery cases, there are some early-produced DELL 2950 servers where the fault disk is inconsistent with the alarm light after OFFLINE on a RAID disk, which causes the customer to unplug the wrong disk when replacing the failed disk REBUILD, and the entire RAID group crashes.

4. IO channel obstruction leads to RAID disk dropping: the RAID controller is designed to avoid writing data to unstable storage media as far as possible, so that when the controller performs IO with the physical hard disk, if the time exceeds a certain threshold, or does not meet the check relationship, it will be considered that the corresponding storage device does not have the ability to work continuously, but it will be forced offline and notify the administrator to solve the problem as soon as possible. The original intention of this design is very good, and it is also the correct design method, but for random reasons, such as the physical link circuit is loose, or because the hard disk mechanical response time-out (the hard disk may still be intact) and other random reasons, it is impossible for the controller to tell whether the device has the same stable state as before, so some small links that do not care very much will lead to RAID volume failure, and this kind of failure has a great probability. And it's inevitable. This is also the reason why the hard disk does not fail after most RAID failures, and many of our data recovery service customers will question the server manufacturers. To a certain extent, the more secure the controller is designed, the more this phenomenon will occur.

5, controller stability: RAID controller in the ONLINE state (no offline disk) is the most stable, relatively speaking, when part of the hard disk damage (may be logic failure) offline, the controller will work in a more laborious state, which is a lot of low-end RAID controller in a disk offline after the rapid decline in read and write performance. The heavy load on the controller will greatly increase the possibility of IO retention during data throughput, resulting in RAID offline as mentioned in point 4 above. A controller without high-speed hardware processing chip and cache is much more likely to have this kind of failure. In order to avoid the business pause and additional overhead caused by data recovery after failure, try not to choose this kind of disk array controller.

6, bad hard disk: this kind of situation is very interesting, many people will think that there will not be a bad hard disk in the normal working RAID, because as long as the hard disk is broken, RAID will take his bad hard disk offline, and REBUILD will be a good hard drive again after replacing the new hard disk. But in fact, this is inevitable because a set of RAID volumes rarely reads all the disk space of a physical hard disk after working for a long time, let alone at the same time. In some cases, the hard drive will produce bad channels in areas that have not been read or where the previous reading is good, which is good to the controller because it has not been read or written. The most direct harm to this bad track is in the REBUILD process. When a physical hard disk is offline, usually all technicians and official materials will write and do REBUILD as soon as possible, but if other hard drives have such bad tracks that they usually do not know, and REBUILD synchronizes the whole disk, they will certainly read and write those bad tracks. At this time, the REBUILD is not completed, and the new disk cannot be online, because bad tracks are found in the old disk, which will lead to more offline hard drives in RAID. This may cause RAID to fail and cannot recover the data on its own.

7, human misoperation: a considerable part of data disasters involving data recovery can also be avoided, but there will always be such a situation: irrelevant personnel mistakenly unplug the hard disk in RAID, do not prepare spare disk, do not change the disk in time, forget the original order when dust removal to RAID, accidentally delete the original RAID configuration, and so on.

8. Other reasons that I can't remember for the moment.

In addition to human causes, most of these disasters are difficult to avoid directly, and can only be solved by combining backup and building an overall storage security scheme. Other articles will mention the reasons and security advice to put aside the topic of data recovery.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.