EMC Storage Raid Fault data Analysis report 02/14 Update SLTechnology News&Howtos

EMC Storage Raid Fault data Analysis report

2026-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Fault description

The user's EMC FC AX-4 storage crashes, and the whole storage space is composed of 12 1TB STAT hard drives, of which 10 hard drives form a RAID5 array, and the other two are used as hot spares. Due to the damage of two hard drives in the RAID5 array, and only one hot spare was successfully activated, the RAID5 array was paralyzed and the upper LUN could not be used properly.

Second, check the disk

The entire storage is unavailable because some disks are offline. Therefore, after receiving the disk, all the disks are physically detected, and after detection, it is found that there is no physical fault. Then the bad path detection tool is used to detect the disk bad path, and it is found that there is no bad path.

3. Backup data

Considering the security and reducibility of the data, it is necessary to back up all the source data before data recovery, in case the data cannot be recovered again for other reasons. Use winhex to mirror all disks into files, and since the sector size of the source disk is 520 bytes, you also need to use special tools to convert all backed up data to 520 to 512 bytes.

Fault analysis and recovery process

1. Analyze the cause of the failure

Since no physical failure or bad track was detected in the first two steps, it is inferred that the failure may be caused by some disk read and write instability. Because the EMC controller has a strict policy of checking disks, once the performance of some disks is unstable, the EMC controller thinks it is a bad disk and kicks the disk that is considered to be a bad disk out of the RAID group. Once the disconnected disk in the RAID group reaches the limit of the RAID level, the RAID group will become unavailable, and the upper RAID-based LUN will also become unavailable. At present, the preliminary understanding is that there is only one LUN based on the RAID group, which is allocated to the SUN minicomputer, and the upper file system is ZFS.

2. Analyze the structure of RAID group.

The LUN stored by EMC is based on the RAID group, so it is necessary to analyze the information of the underlying RAID group first, and then reconstruct the original RAID group according to the analyzed information. After analyzing each data disk, it is found that disk 8 and disk 11 have no data at all. from the management interface, we can see that disk 8 and disk 11 belong to Hot Spare, but the Hot Spare of disk 8 replaces the bad disk of disk 5. Therefore, it can be judged that although the Hot Spare of disk 8 is activated successfully, because the RAID level is RAID5, a hard disk is missing in the RAID group, so the data is not synchronized to the hard disk 8. Continue to analyze the other 10 hard drives, analyze the distribution of data in the hard drive, the size of RAID stripes, and the order of each disk.

3. Analyze the disconnect reel of RAID group.

According to the RAID information analyzed above, we try to virtualize the original RAID group through the RAID virtual program independently developed by North Asia. However, since a total of two disks have been dropped in the entire RAID group, it is necessary to analyze the order in which the two hard drives are dropped. Careful analysis of the data in each hard disk, it is found that there is a hard disk in the same stripe on the data and other hard drives are obviously different, so the preliminary judgment of this hard disk may be the first offline, through the North Asia independent development of the RAID check program to check this stripe, found that excluding the analysis of the hard disk data is the best, so you can identify the first offline hard disk.

4. Analyze the LUN information in the RAID group

Since the LUN is based on the RAID group, the RAID group needs to be reorganized based on the information analyzed above. Then analyze the allocation information of LUN in the RAID group and the block MAP allocated by LUN. Since there is only one LUN at the bottom, only one piece of LUN information needs to be analyzed to OK. Then use the North Asia raid recovery (datahf.net) program to interpret the data MAP of LUN and export all the data of LUN based on this information.

Explain the ZFS file system and fix it

1. Explain the ZFS file system

The ZFS file system interpreter developed by North Asian data recovery (datahf.net) is used to interpret the generated LUN, and it is found that the program reported an error when interpreting some file system metafiles. Quickly arrange the development engineer to debug the program with debug and analyze the cause of the error report. Then arrange for the file system engineer to analyze whether the ZFS file system is not supported by the program because of the version. After 7 hours of analysis and debugging, it is found that some of the metafiles in the ZFS file system are damaged due to the sudden storage paralysis, which leads to the failure of the program to interpret the ZFS file system.

2. Repair the ZFS file system

The above analysis makes it clear that some of the file system metafiles of the ZFS file system are damaged due to storage paralysis, so these damaged file system metafiles need to be repaired in order to parse the ZFS file system normally. Based on the analysis of the damaged metafiles, it is found that the storage of ZFS files is paralyzed while the IO operation is being carried out, resulting in some file system metafiles not updated and damaged. Manually repair these damaged metafiles to ensure that the ZFS file system can be parsed normally.

VI. Export all data

Use the program to parse the repaired ZFS file system, parsing all file nodes and directory structures. Screenshots of some file directories are as follows:

7. Verify the latest data

Because the data are all text types and DCM images, there are too many environments to build. By the user engineer to instruct some data for verification, the verification results are all right, and the data are complete. Some of the files are verified as follows:

VIII. Conclusion of data recovery

As the on-site environment is good after the failure, it is not used to do related dangerous operations, which is of great help to the later data recovery. Although there are many technical bottlenecks in the whole process of data recovery, they are solved one by one. Finally, the data recovery is completed within the expected time, and the data has been checked and accepted by the user, and the data recovery work is completed.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.