How can I solve the problem with the RAID5 array hard drive? 04/01 Update SLTechnology News&Howtos

How can I solve the problem with the RAID5 array hard drive?

2026-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

[fault description]

Huawei S53009,12 FC hard drives, the entire storage space is composed of 450GB and 600G FC hard drives, of which 11 hard drives form a RAID5 array, and the remaining one is used as a hot spare. Due to the failure of one hard disk in the RAID5 array, the hot spare was activated successfully, and another hard disk failed in the process of synchronization, which led to the paralysis of the RAID5 array and the normal use of the upper LUN.

[recovery process]

First, detect the disk

The entire storage is unavailable because some disks in the RAID array are offline. Therefore, after receiving the disk, all the disks are physically tested, and after the detection, it is found that one hard disk has a physical fault, and the other hard disk has no physical fault.

Second, backup data

Considering the security and reducibility of the data, all source data needs to be backed up before data recovery, in case the data cannot be recovered again for other reasons. Use the dd command or the winhex tool to mirror all disks into files.

Third, fault analysis

1. Analyze the cause of the failure

Due to the first two steps and the detection of a physical failure of the disk, it is inferred that the failure may be caused by some disk read and write instability and physical failure. Because Huawei S5300 controller has a strict disk inspection policy, once some disks are unstable, Huawei S5300 controller thinks it is a bad disk and kicks the disk that is considered to be a bad disk out of the RAID group. Once the disconnected disk in the RAID group reaches the limit of the RAID level, the RAID group will become unavailable, and the upper LUN based on the RAID group will also become unavailable, and then a new RAID will be built, and a hard disk will be damaged in the process of synchronization. at present, the preliminary understanding is that the LUN based on the RAID group is allocated to the linux system, and the important data is the Oracle database.

2. Analyze the structure of RAID group.

The LUN stored by Huawei S5300 is based on the RAID group, so it is necessary to analyze the information of the underlying RAID group first, and then reconstruct the original RAID group according to the analyzed information. Based on the analysis of each data disk, it is found that the data of one disk is different from that of other data disks, and it is considered that it may be a hot space disk. Then analyze other data disks, analyze the distribution of Oracle database pages in each disk, and get the important information of RAID group, such as stripe size, disk order and data direction, according to the data distribution.

3. Analyze the RAID group damaged disk synchronously.

According to the RAID information analyzed above, we try to virtualize the original RAID group through the RAID virtual program independently developed by North Asia. However, due to the loss of two disks in the whole RAID group and the data of one hard disk was damaged synchronously. After careful analysis of the data in each hard disk, it is found that the data of one hard disk on the same stripe is obviously different from that of other hard drives, so it is preliminarily judged that this hard disk may be damaged by synchronization, and this stripe can be checked by the RAID check program independently developed by North Asia, so it can be clearly damaged by synchronization.

4. Analyze the LUN information in the RAID group

Because the LUN is based on the RAID group, the latest status of the RAID group needs to be virtualized based on the information analyzed above. Then analyze the allocation of LUN in the RAID group and the block MAP allocated by LUN. Therefore, it is only necessary to extract the block distribution MAP of LUN. Then the corresponding program is written according to these information, the data of LUN is parsed by MAP, and then the data of LUN is exported according to the data MAP.

Parsing the EXT3 file system

1. Parse the EXT3 file system

Because the EXT3 file system cannot be mounted normally because of the virtual RAID structure of the hot spare, we can only extract the oracle database file, parse the file system using the self-developed file system parser, export the oracle database file, and hand over the database file to the database engineer for checksum verification.

Fifth, detect Oracle database files and repair them

1. Check whether the database file is complete.

Use the Oracle database file detection tool to check whether each database file is complete and find errors. Then, using the Oracle database detection tool independently developed by North Asia (more stringent inspection), it is found that some database files and log files have errors, and there are more than 100 bad blocks in system and sysaux tablespaces. There are many bad blocks in all three control files, and all the control files are corrupted. The number of bad blocks in three files in eschoolspace tablespace reaches 1000. Undotbs02 is lost. Database engineers repair such files, as shown below:

2. Repair Oracle database

We created the control file, created the undo tablespace, and started the database to mount. The bad block of the system data file prevents the database from open. Various implicit parameters can not bypass the bad blocks of system; build a database environment. Restore the database using the dmp file. Using imports after March 9, all errors are reported, and only about 10 gigabytes of data can be imported, as shown below:

VI. Data verification

With the cooperation of the user side, start the Oracle database and install the OA client in the local virtual machine. The data record is verified by the OA client, and the user arranges the personnel of different departments for remote verification.

VII. Conclusion of data recovery

Because the RAID is rebuilt after the failure, the data of a disk is damaged synchronously, which makes it difficult to recover the data in the later period. Because the hot spare synchronously writes part of the data for a period of time, using the data in the hot spare for recovery can only recover part of the data, only before March 9.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.