Analyze the fault recovery scheme of IBM x3850 RAID5 server 04/05 Update SLTechnology News&Howtos

Analyze the fault recovery scheme of IBM x3850 RAID5 server

2026-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

[basic Information]

Server model: IBM X3850 server

Hard disk model: 73G SAS hard disk

Number of hard drives: 4 of the 5 hard drives form a RAID5, and the other is used as a hot spare (Hot-Spare)

Operating system: linux redhat 5.3.The application system is an oa based on oracle.

[fault performance]

Disk 3 has already been offline, but the hot spare does not automatically activate rebuild (for unknown reasons), and then disk 2 is offline and RAID crashes.

Oracle no longer provides follow-up support for this oa system, and users require data recovery + operating system recovery as much as possible.

[conclusion of preliminary examination]

The hot spare is not enabled at all, the hard disk has no obvious physical failure, and there is no obvious synchronization performance. Data is usually recoverable.

[recovery plan]

1. Protect the original environment, shut down the server, and ensure that the server is no longer turned on during the recovery process.

2. Sort the number of the faulty hard disk to ensure that the hard disk can be completely recovered after the slot is removed.

3. Mount the failed hard disk to the read-only environment and mirror all the failed hard drives completely (reference). After the backup is completed, the original failure disk is returned, and the subsequent recovery operation will not involve the original fault disk until the data is confirmed.

4. Analyze the RAID structure of the backup disk and get its original RAID level, stripe rule, stripe size, check direction, META area and so on.

5. Build a set of virtual RAID5 environment according to the RAID information.

6. Explain the virtual disk and file system.

7. Check whether the virtual structure is correct. If not, repeat the 4-7 process.

8. After confirming that the data is correct, move the data back according to the user's request. If you still use the original disk, you need to make sure that you have completely backed up the original disk, rebuild the RAID, and then move back. When moving back to the operating system, you can use linux livecd or win pe (usually not supported), etc., or you can install a back-to-use operating system with another hard disk on the failed server, and then migrate back at the sector level.

9. After the data is handed over, our data recovery center will keep the data for 3 days to avoid mistakes that may be ignored.

[estimate cycle]

Backup time: about 2 hours

Time to explain and export data: about 4 hours

Relocate the operating system: about 4 hours.

[detailed explanation of process]

1. Make a complete mirror image of the original hard disk. After mirroring, it is found that disk 2 has 10-20 bad sectors, and the rest of the disk has no bad path.

2. Through the analysis of the structure, the best structure is 0min1jing2jue 3 disk sequence, missing disk 3, block size 512sector, backward parity (Adaptec), the structure is as follows:

3. After grouping, the data verify that there is no error in the decompression of the latest compression package above 200m, and the structure is correct.

4. Directly generate virtual RAID to a single hard disk according to this structure, and open the file system without obvious error.

5. If the backup package is safe, rebuild the RAID on the original disk with the consent of the customer, and replace the damaged No. 2 disk with a new hard disk during reconstruction. Connect the recovered single disk to the faulty server by USB, then start the faulty server with linux SystemRescueCd, and then write back completely through the dd command.

6. After writing back, start the operating system.

7. After dd all the data, start the operating system and cannot enter. The error message is: / etc/rc.d/rc.sysinit:Line 1:/sbin/pidof:Permission denied. There is a problem with the permissions of this file.

8. Check after restarting with SystemRescueCd, there is an obvious error in the time, permission and size of this file, and it is obvious that the node is damaged.

9. Re-analyze the root partition in the reorganized data, locate the wrong / sbin/pidof, and find that the problem is caused by the bad path of disk 2.

10. Use the three discs 0Jing 1jue 3 to repair the damaged area of disk 2 with xor. Recheck the file system after completion, and there are still errors. Check the inode table again and find that some nodes in the damaged area of disk 2 appear as follows (part 55 / 55 in the figure):

11. It is obvious that although the uid described in the node still exists, the attribute, size, and initial allocation block are all wrong. According to all possible analysis, it is determined that there is no way to recover the damaged node. You can only want to repair this node or copy the same file. For all files that may have errors, the node information of the original node block is determined through the log, and then corrected.

12. After correction, re-dd the root partition, execute fsck-fn / dev/sda5, and still report an error, as shown below:

13. According to the prompt, it is found that multiple nodes share the same data block in the system. The bottom analysis is carried out according to this prompt, and it is found that due to the early drop of disk 3, there is an intersection of new and old node information.

14. Distinguish according to the file to which the node belongs. After clearing the error node, execute fsck-fn / dev/sda5 again, and there is still an error message, but there is very little. According to the prompt, it is found that most of these nodes are located in the doc directory, which does not affect the system startup, so directly fsck-fy / dev/sda5 forcibly repair.

15. After repair, restart the system and successfully enter the desktop. Start the database service, start the application software, everything is normal, there is no error report.

At this point, the data recovery and system relocation work is completed.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.