HP Storage data recovery Scheme of vxfs File system under offline lvm of raid5 two hard disks 02/11 Update SLTechnology News&Howtos

HP Storage data recovery Scheme of vxfs File system under offline lvm of raid5 two hard disks

2026-02-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

fault description

HP FC MSA2000 storage, because 2 hard disks in RAID5 array are damaged and offline, and only one hot spare disk is successfully activated at this time, RAID5 array is paralyzed, and the upper LUN cannot be used normally. The user contacts North Asia Data. The entire storage space consists of 8 450GB SAS hard disks, 7 of which form a RAID5 array, and the remaining 1 is used as a hot spare disk.

Because the storage is because some disks in the RAID array are offline, the entire storage is unavailable. Therefore, after receiving the disk, all the disks are physically detected, and no physical failure is found after detection. Then use Bad Track Detection Tools to detect bad tracks on the disk and find that there are no bad tracks.

Solution:

1. Backup data

Considering the security and recoverability of the data, it is necessary to make a backup of all source data before data recovery, just in case the data cannot be recovered again for other reasons. Use dd command or winhex tool to mirror all disks into files. The backup of partial data is shown in the following figure:

2. Analyze the cause of the fault

Since the first two steps did not detect physical failure or bad track of the disk, it is inferred that the failure may be caused by unstable reading and writing of some disks. Because the HP MSA2000 controller has a strict disk inspection policy, the HP MSA2000 controller considers disks to be bad if their performance is unstable and kicks them out of the RAID group. Once a RAID group has reached the RAID level's allowable disk drop limit, the RAID group becomes unavailable, and the RAID group-based LUNs above it become unavailable. At present, the preliminary understanding is that there are 6 LUNs based on RAID groups, which are allocated to HP-Unix small computers. The upper layer is LVM logical volume, and the important data is Oracle database and OA server.

3. Analyze RAID group structure

HP MSA2000 stores LUNs based on RAID groups, so you need to analyze the information of the underlying RAID group first, and then reconstruct the original RAID group based on the analyzed information. After analyzing each data disk, it was found that the data of disk 4 was different from other data disks, and it was preliminarily considered that it might be a hot Spare disk. Then analyze other data disks, analyze the distribution of Oracle database pages in each disk, and obtain important information of RAID group such as stripe size, disk order and data trend according to the data distribution.

4. Analyze RAID group dropped disk

According to the RAID information analyzed above, try to virtualize the original RAID group through the RAID virtualization program independently developed by North Asia. However, since there are two disks dropped in the entire RAID group, it is necessary to analyze the order in which these two disks are dropped. Carefully analyze the data in each hard disk and find that the data of one hard disk on the same stripe is obviously different from that of other hard disks. Therefore, it is preliminarily judged that this hard disk may be the first to drop the line. Through the RAID verification program independently developed by North Asia, it is found that the data obtained by removing the hard disk just analyzed is the best, so it is possible to identify the first hard disk to drop.

5. Analyze LUN information in RAID group

Since LUNs are RAID group-based, the latest state of RAID groups needs to be virtualized based on the information analyzed above. Then analyze the allocation of LUNs in RAID groups and the block MAP of LUN allocations. Since there are six LUNs at the bottom, only the block distribution MAP for each LUN needs to be extracted. Then write the corresponding program for this information, parse the data MAP of all LUNs, and export the data of all LUNs according to the data MAP.

6. Parse LVM logical volumes

Analyze all generated LUNs and discover that all LUNs contain LVM logical volume information for HP-Unix. Trying to parse the LVM information in each LUN, we found that there are three sets of LVM, including one LV in the 45G LVM, which stores OA server-side data, and one LV in the 190G LVM, which stores temporary backup data. The remaining four LUNs make up an LVM of about 2.1T, which is also divided into only one LV, which stores Oracle database files. Write a program to interpret LVM, try to interpret LV volumes in each set of LVM, but find that the interpreter is wrong.

7. Repair LVM logical volumes

Carefully analyze the cause of the program error, arrange the development engineer debug program error location, and at the same time arrange the senior file system engineer to detect the recovered LUN, to detect whether the LVM information will be damaged due to storage paralysis. After careful examination, it was found that LVM information was indeed corrupted due to storage paralysis. Try to manually repair damaged areas and modify programs synchronously to re-parse LVM logical volumes.

8. Parse the VXFS file system

Set up an HP-Unix environment, map the interpreted LV volumes to HP-Unix, and try the Mount file system. Mount file system error, try to use the "fsck -F vxfs" command to repair the vxfs file system, but the repair result still can not be mounted, suspected that some metadata of the underlying vxfs file system may be damaged, need to be repaired manually.

9, Repair VXFS file system

Carefully analyze the parsed LV and verify that the file system is complete according to the underlying structure of the VXFS file system. The analysis found that there was indeed a problem with the underlying VXFS file system. It turned out that the file was performing IO operations while the storage was paralyzed, resulting in some file system metafiles not being updated and damaged. Manual repair of these damaged metafiles ensures that the VXFS file system can be parsed normally. Mount the repaired LV volume on the HP-Unix minicomputer again, try Mount file system, file system does not report error, mount successfully.

Recover all user files

After mounting the file system on an HP-Unix machine, back up all user data to the specified disk space. All user data size is around 1.2TB. Some screenshots of the file directory are as follows:

11. Check whether the database file is complete

Use Oracle Database File Detection Tools "dbv" to check each database file for completeness and find no errors. Then, using Oracle Database Detection Tools independently developed by Beiya (the inspection is more strict), it is found that some database files and log files are inconsistent in verification, and senior database engineers are arranged to repair such files and verify them again until all files pass the verification completely.

12. Start Oracle Database

Since the HP-Unix environment we provided did not have this version of Oracle data, we coordinated with the user to bring the original production environment to the North Asia Data Recovery Center, then attached the recovered Oracle database to the HP-Unix server of the original production environment, tried to start the Oracle database, and the Oracle database started successfully. Some screenshots are as follows:

13. Data validation

With the cooperation of users, start Oracle database, start OA server, install OA client in local notebook. The latest data records and historical data records are verified through OA client, and users arrange remote verification by personnel from different departments. The final data verification is correct, the data is complete, and the data recovery is successful.

Because the scene environment is good after the fault occurs, there is no dangerous operation, which is of great help to the later data recovery. Although many technical bottlenecks were encountered in the whole data recovery process, they were also solved one by one. Finally, the entire data recovery was completed within the expected time, and the recovered data users were quite satisfied. All services such as Oracle database services and OA servers could be started normally.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.