[long text + picture] HP FC MSA2000 server paralyzed data recovery process 02/14 Update SLTechnology News&Howtos

[long text + picture] HP FC MSA2000 server paralyzed data recovery process

2026-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Description of server data recovery failure

In a HP FC MSA2000 server of a certain company, a RAID5 array was built in the server, and two hard drives were damaged and offline during the normal use of the server, while only one hot spare was successfully activated, resulting in the paralysis of the RAID5 array, the normal use of the upper LUN, and the user contacted the data recovery center for server data recovery.

The entire storage is unavailable because the server storage is because some disks in the RAID array are offline. Therefore, after receiving the disk, all the disks are physically detected, and after detection, it is found that there is no physical fault. Then the bad path detection tool is used to detect the disk bad path, and it is found that there is no bad path.

Server data recovery process:

1. Backup server data

Considering the security and reducibility of the data, it is necessary to back up all the source data before data recovery, in case the data cannot be recovered again for other reasons. Use the dd command or the winhex tool to mirror all disks into files. After backing up some of the data, please see the following figure:

Figure 1:

2. Analyze the cause of server failure

Since no physical failure or bad track was detected in the first two steps, it is inferred that the failure may be caused by some disk read and write instability. Because the HP MSA2000 controller has a strict policy of checking disks, once the performance of some disks is unstable, the HP MSA2000 controller thinks it is a bad disk and kicks the disk that is considered to be a bad disk out of the RAID group. Once the disconnected disk in the RAID group reaches the limit of the RAID level, the RAID group will become unavailable, and the upper RAID-based LUN will also become unavailable. At present, the preliminary understanding is that there are 6 LUN based on RAID group, all of which are assigned to HP-Unix minicomputer, the upper layer does LVM logical volume, and the important data are Oracle database and OA server.

3. Analyze the server RAID group structure

The LUN stored by HP MSA2000 is based on the RAID group, so it is necessary to analyze the information of the underlying RAID group first, and then reconstruct the original RAID group according to the analyzed information. Based on the analysis of each data disk, it is found that the data of disk 4 is different from that of other data disks, and it is preliminarily considered that it may be a hot space disk. Then analyze other data disks, analyze the distribution of Oracle database pages in each disk, and get the important information of RAID group, such as stripe size, disk order and data direction, according to the data distribution.

4. Analyze the disconnect reel of RAID group.

According to the RAID information analyzed above, we try to virtualize the original RAID group through the RAID virtual program independently developed by North Asia. However, since a total of two disks have been dropped in the entire RAID group, it is necessary to analyze the order in which the two hard drives are dropped. Careful analysis of the data in each hard disk, it is found that there is a hard disk in the same stripe on the data and other hard drives are obviously different, so the preliminary judgment of this hard disk may be the first offline, through the North Asia independent development of the RAID check program to check this stripe, found that excluding the analysis of the hard disk data is the best, so you can identify the first offline hard disk.

5. Analyze the LUN information in the RAID group

Because the LUN is based on the RAID group, the latest status of the RAID group needs to be virtualized based on the information analyzed above. Then analyze the allocation of LUN in the RAID group and the block MAP allocated by LUN. Since there are six LUN at the bottom, only the block distribution MAP of each LUN needs to be extracted. Then write the corresponding program according to these information, parse all the data MAP of LUN, and then MAP according to the data and export the data of all LUN.

Figure 2:

6. Parsing LVM logical volumes

After analyzing all the generated LUN, it is found that all the LUN contain the LVM logical volume information of HP-Unix. Trying to parse the LVM information in each LUN, it is found that there are three sets of LVM, in which the 45G LVM is divided into a LV to store the OA server data, and the 190G LVM is divided into a LV to store the temporary backup data. The remaining four LUN make up a 2.1T LVM, which is only divided into a LV, which stores Oracle database files. Write a program to interpret LVM and try to interpret the LV volumes in each set of LVM, but find an error in the interpreter.

7. Repair the LVM logical volume

Carefully analyze the causes of the program error, arrange the location of the error in the debug program of the development engineer, and arrange the senior file system engineer to detect the recovered LUN to detect whether the LVM information will be damaged due to storage paralysis. After careful inspection, it is found that the LVM information is indeed damaged because of storage paralysis. Try to repair the damaged area manually and modify the program synchronously to re-parse the LVM logical volume.

8. Parsing the VXFS file system

Set up the HP-Unix environment, map the interpreted LV volumes to HP-Unix, and try the Mount file system. As a result, an error occurred in the Mount file system, and an attempt was made to use the "fsck-F vxfs" command to repair the vxfs file system, but the repair result still could not be mounted. It is suspected that part of the metadata of the underlying vxfs file system may be destroyed and needs to be repaired manually.

9. Repair the VXFS file system

Carefully analyze the parsed LV and verify the integrity of the VXFS file system according to the underlying structure of the file system. The analysis found that there was a problem with the underlying VXFS file system. It turned out that when the storage was paralyzed at that time, the file was performing IO operations in the system, resulting in no update and corruption of some file system metafiles. Manually repair these damaged metafiles to ensure that the VXFS file system can be parsed normally. Once again, mount the repaired LV volume to the HP-Unix machine, and try the Mount file system. The file system did not report an error, and it was mounted successfully.

10. Restore all user files

After mount the file system on the HP-Unix machine, back up all user data to the specified disk space. The size of all user data is about 1.2TB. Screenshots of some file directories are as follows:

Figure 3:

11. Check whether the database file is complete

Use the Oracle database file detection tool "dbv" to check whether each database file is complete and find no errors. Then use the Oracle database testing tool independently developed by North Asia (the inspection is more stringent), and find that some database files and log files are inconsistent, and arrange for senior database engineers to repair such files and verify them again until all file verifications are fully passed.

12. Start the Oracle database

Since the HP-Unix environment we provided does not have this version of Oracle data, coordinate with the user to bring the original generation environment to the data recovery center, and then attach the recovered Oracle database to the HP-Unix server of the original production environment. Try to start the Oracle database, and the Oracle database starts successfully. Some screenshots are as follows:

Figure 4:

13. Server data verification

With the cooperation of the user, start the Oracle database, start the OA server, and install the OA client in the local notebook. The latest data records and historical data records are verified by the OA client, and users arrange remote personnel from different departments for remote verification. The final data verification is correct, the data is complete, and the data recovery is successful.

As the on-site environment is good after the failure, it is not used to do related dangerous operations, which is of great help to the later data recovery. Although there are many technical bottlenecks in the whole process of data recovery, they are solved one by one. Finally, the data recovery of the whole server is completed within the expected time, and the recovered data users are also quite satisfied. Oracle database services, OA servers and other services can be started normally.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.