Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Data recovery method of server raid5 two hard disks offline vxfs file system

2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/03 Report--

Description of server data recovery failure

The customer's server has a total of 8 450GB SAS hard drives, of which 7 make up a RAID5 array and 1 hot spare. Two hard drives in the array were damaged and offline, which paralyzed the RAID5 array and affected the normal use of the upper LUN. The hard disk has no physical failure and no bad path.

Recovery process for server raid data:

1. Backup data

Use the dd command or the data recovery tool to mirror all disks into files.

Figure 1:

2. Analyze the structure of RAID group.

The LUN of the server is based on the RAID group, so we need to analyze the information of the underlying RAID group first, and then reconstruct the original RAID group according to these data. Through the analysis, it is known that the No. 4 disk is hot Spare disk. Continue to analyze the distribution of Oracle database pages in each disk, and then get the important information of RAID group, such as stripe size, disk order and data direction.

3. Analyze the disconnect reel of RAID group.

According to the RAID information analyzed above, try to virtualize the original RAID group through the RAID virtual program. Carefully analyze the data in each hard disk, check the stripe through the RAID check program independently developed by North Asia, and remove the first offline hard disk from the raid.

4. Analyze the LUN information in the RAID group

Because the LUN is based on the RAID group, the latest status of the RAID group needs to be virtualized based on the information analyzed above. Then analyze the allocation of LUN in the RAID group and the block MAP allocated by LUN. Since there are six LUN at the bottom, only the block distribution MAP of each LUN needs to be extracted. Then write the corresponding program according to these information, parse all the data MAP of LUN, and then MAP according to the data and export the data of all LUN.

Figure 2:

5. Parsing LVM logical volumes

After analyzing all the generated LUN, it is found that all the LUN contain the LVM logical volume information of HP-Unix. Trying to parse the LVM information in each LUN, it is found that there are three sets of LVM, in which the 45G LVM is divided into a LV to store the OA server data, and the 190G LVM is divided into a LV to store the temporary backup data. The remaining four LUN make up a 2.1T LVM, which is only divided into a LV, which stores Oracle database files. Write a program to interpret LVM and try to interpret the LV volumes in each set of LVM, but find an error in the interpreter.

6. Repair the LVM logical volume

Carefully analyze the causes of the program error, arrange the location of the error in the debug program of the development engineer, and arrange the senior file system engineer to detect the recovered LUN to detect whether the LVM information will be damaged due to storage paralysis. After careful inspection, it is found that the LVM information is indeed damaged because of storage paralysis. Try to repair the damaged area manually and modify the program synchronously to re-parse the LVM logical volume.

7. Parsing the VXFS file system

Set up the HP-Unix environment, map the interpreted LV volumes to HP-Unix, and try the Mount file system. As a result, an error occurred in the Mount file system, and an attempt was made to use the "fsck-F vxfs" command to repair the vxfs file system, but the repair result still could not be mounted. It is suspected that part of the metadata of the underlying vxfs file system may be destroyed and needs to be repaired manually.

8. Repair the VXFS file system

Carefully analyze the parsed LV and verify the integrity of the VXFS file system according to the underlying structure of the file system. The analysis found that there was a problem with the underlying VXFS file system. It turned out that when the storage was paralyzed at that time, the file was performing IO operations in the system, resulting in no update and corruption of some file system metafiles. Manually repair these damaged metafiles to ensure that the VXFS file system can be parsed normally. Once again, mount the repaired LV volume to the HP-Unix machine, and try the Mount file system. The file system did not report an error, and it was mounted successfully.

9. Restore all user files

After mount the file system on the HP-Unix machine, back up all user data to the specified disk space. The size of all user data is about 1.2TB. Screenshots of some file directories are as follows:

Figure 3:

10. Check whether the database file is complete

Use the Oracle database file detection tool "dbv" to check whether each database file is complete and find no errors. Then use the Oracle database testing tool independently developed by North Asia (the inspection is more stringent), and find that some database files and log files are inconsistent, and arrange for senior database engineers to repair such files and verify them again until all file verifications are fully passed.

11. Start the Oracle database

Since the HP-Unix environment we provided does not have this version of Oracle data, coordinate with users to bring the original generation environment to the North Asia data recovery Center, and then attach the recovered Oracle database to the HP-Unix server of the original production environment. Try to start the Oracle database, and the Oracle database starts successfully. Some screenshots are as follows:

Figure 4:

12. Data verification

With the cooperation of the user, start the Oracle database, start the OA server, and install the OA client in the local notebook. The latest data records and historical data records are verified by the OA client, and users arrange remote personnel from different departments for remote verification. The final data verification is correct, the data is complete, and the data recovery is successful.

As the on-site environment is good after the failure, it is not used to do related dangerous operations, which is of great help to the later data recovery. Although there are many technical bottlenecks in the whole process of data recovery, they are solved one by one. Finally, the entire data recovery is completed within the expected time, and the recovered data users are also quite satisfied. Oracle database services, OA servers and other services can be started normally.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report