EMC FC AX-4 storage crash data recovery case _ raid data recovery 02/12 Update SLTechnology News&Howtos

EMC FC AX-4 storage crash data recovery case _ raid data recovery

2026-02-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Fault description

EMC FC AX-4 storage in a hospital in Beijing crashed and raid was paralyzed. After receiving the call from the customer, the North Asia data recovery Center immediately arranged for the engineer to bring the server to the user site. After initial inspection by engineers, it is found that the storage space is composed of 12 1TB STAT hard drives, of which two are hot spares. At present, two hard drives in the equipment are damaged, but only one hot spare is activated successfully, so the raid5 array is paralyzed and the upper lun cannot be used normally. First of all, the physical inspection of all the disks did not find any physical faults, and then there was no bad track detection using the bad track detection tool.

Second, backup data

Due to the particularity of data recovery, all the original data must be backed up before data recovery. In this raid5 data recovery, we use winhex to mirror all disks into files. Because the sector size of the source disk is 520 bytes, we also need to use special tools to convert all backed up data into 520 to 512 bytes.

Fault analysis and recovery process

1. Analyze the cause of the fault. Since there are no physical faults and bad channels in the device, it is inferred that the cause of the failure is caused by unstable reading and writing of some disks, because the EMC controller has a very strict disk checking policy, and if the disk performance is unstable, it will be considered as a bad disk and kicked out of the raid group by the EMC controller. When the dropped disk in the raid group reaches the allowable disk drop limit of the raid level, the raid group will not be available, and the upper lun based on this raid will not be available. In this case, only one lun is assigned to the sun machine, and the upper file system is ZFS.

2. Analyze the structure of RAID group. The LUN stored by EMC is based on the RAID group, so it is necessary to analyze the information of the underlying RAID group first, and then reconstruct the original RAID group according to the analyzed information. Through the analysis of all the hard disk data, it is found that disk 8 and disk 11 have no data at all. from the management interface, we can see that disk 8 and disk 11 belong to Hot Spare, but the Hot Spare of disk 8 replaces the bad disk of disk 5. Therefore, it can be judged that although the Hot Spare of disk 8 is activated successfully, because the RAID level is RAID5, a hard disk is missing in the RAID group, so the data is not synchronized to the No. 8 hard disk (frombyte.com). Continue to analyze the other 10 hard drives, analyze the distribution of data in the hard drive, the size of RAID stripes, and the order of each disk.

3. Analyze the disconnecting reel of RAID group. According to the RAID information analyzed above, we try to virtualize the original RAID group through the RAID virtual program independently developed by North Asia. However, since a total of two disks have been dropped in the entire RAID group, it is necessary to analyze the order in which the two hard drives are dropped. Careful analysis of the data in each hard disk, it is found that there is a hard disk in the same stripe on the data and other hard drives are obviously different, so the preliminary judgment of this hard disk may be the first offline, through the North Asia independent development of the RAID check program to check this stripe, found that excluding the analysis of the hard disk data is the best, so you can identify the first offline hard disk.

4. Analyze the LUN information in RAID group. Since the LUN is based on the RAID group, the RAID group needs to be reorganized based on the information analyzed above. Then analyze the allocation information of LUN in the RAID group and the block MAP allocated by LUN. Since there is only one LUN at the bottom, only one piece of LUN information needs to be analyzed to OK. Then use the North Asia raid recovery (frombyte.com) program to interpret the data MAP of LUN and export all the data of LUN based on this information.

Explain the ZFS file system and fix it

1. Explain the ZFS file system. The ZFS file system interpreter developed by North Asian data recovery is used to interpret the generated LUN, and it is found that the program reported an error when interpreting some file system metafiles. So arrange the development engineer to debug the program with debug and analyze the cause of the error report. Then arrange for the file system engineer to analyze whether the ZFS file system is not supported by the program because of the version. After 7 hours of analysis and debugging, it is found that some of the metafiles in the ZFS file system are damaged due to the sudden storage paralysis, which leads to the failure of the program to interpret the ZFS file system.

2. Repair the ZFS file system. The above analysis makes it clear that some of the file system metafiles of the ZFS file system are damaged due to storage paralysis, so these damaged file system metafiles need to be repaired in order to parse the ZFS file system normally. Based on the analysis of the damaged metafiles, it is found that the storage of ZFS files is paralyzed while the IO operation is being carried out, resulting in some file system metafiles not updated and damaged. Manually repair these damaged metafiles to ensure that the ZFS file system can be parsed normally.

5. Export and verify data

Use the program to parse the repaired ZFS file system, parsing all file nodes and directory structures. Because the data are all text types and DCM images, there are too many environments to build. By the user engineer to instruct some data for verification, the verification results are all right, and the data are complete.

VI. Conclusion of data recovery

As the on-site environment is good after the failure, it is not used to do related dangerous operations, which is of great help to the later data recovery. Although there are many technical bottlenecks in the whole process of data recovery, they are solved one by one. Finally, the entire data recovery is completed within the expected time, and the recovered data users are also quite satisfied.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.