How to analyze the problem of disaster recovery in IBM DS4300 Storage 02/10 Update SLTechnology News&Howtos

How to analyze the problem of disaster recovery in IBM DS4300 Storage

2026-02-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about how to analyze the problem of IBM DS4300 storage disaster recovery, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

The old storage of one of the company's IBM DS4300 suddenly broke before National Day holiday, and the data mounted on the array was lost, which is more serious. The solution process is as follows, which is hereby recorded.

The hardware environment is: IBM DS4300 storage, divided into two array, each array is raid5, and each has a hot spare. Each array is divided into two logic disks, of which a 380g logic disk of array2 is shared by the minicomputer and dual-computer system, and the dual-computer is AIX+ORACLE10G. Dual control (A control suddenly broke down, but B control could not take over the LUN of A control, so that some important data could not be found). After analyzing the array log information by Server Raid management software, it was found that disk 2 was abnormal in August 11, and did not participate in RAID work, while disk 1 made an error only recently. After testing and mirroring disk 1 and 2, it is found that disk 2 has a small amount of bad channels, and disk 1 can be read normally.

Restore:

After testing and mirroring disk 1 and disk 2, we try to leave disk 1 on the DS4300 array, pull disk 2 out, then try various methods in Server Raid management software to make disk 1 state into ONLINE state, and finally turn disk 1 into ONLINE, so we try to check the relevant VG information under AIX and find that the VG information has been destroyed. Data recovery fell into the next dilemma. After comparing the head information of disk 1 and disk 2, it was found that there was relevant VG information in the head of disk 2, so Copy the VG information of disk 2 to the corresponding location of disk 1, and check the VG information and LV information again under AIX. This time the VG information was intact and the file system mount was smooth. I tried to start the oracle service and found that oracle could not start and reported an error in the redo1.log file. Finally, after several efforts Oracle can finally get up, hurry up to exp the data in oracle. The data has been recovered successfully by this time.

Summary of experience:

It is important to pay special attention to the replacement of the hard disk on the DS4300. It is best to replace the hard disk with the same model and firmware version as the original. In this case, the original No. 2 disk has been broken. If you need to replace the original No. 2 disk, then the model and firmware version of the new disk need to be the same as the original hard disk, otherwise the new hard disk will be unstable when running on the DS4300, and it is easy to get offline. DS4300 has high requirements for firmware version matching of hard disk, so we must pay attention to firmware version matching when replacing hard disk.

There is also LVM information is the key to this recovery, disk 1 LVM information is destroyed, I see a good LVM information from disk 2, COPY to the corresponding location of disk 1, so that the whole recovery can be carried out smoothly.

Another question:

A ds43000 is divided into two array, each array is a raid5, and each has a hot spare. Each array is divided into two logical disks, and now a hard disk of array2 flashes yellow light and is still running. Checking errpt under aix also reported an error in hdisk3. Q: 1. Is the hot spare automatically topped at this time, or is it to be manually configured? 2. Can I change this bad disk directly online? 3. Should I make a hot spare at this time, or continue to make a hot spare with the original hot spare? I think the reliability of the disk array is not too high, ah, always broken, either the battery is broken, or the hard disk is broken, which is frightening.

In this example, since the RAID itself is not damaged, it is not necessary to mirror each physical hard disk separately, just mirror the LUN of the disk where the error is reported. There are two ways to mirror the image: one is to use the dd command to mirror the lun to another storage space under linux, and the other is to switch the LUN to be restored to Windows, and then mirror the hard disk through the Winhex tool. After the mirroring is completed, the focus of data recovery is to analyze the structure of the XFS file system and extract the data.

You can scan each mirrored LUN through Darth D-Recovery For XFS data recovery software, collect XFS file system information (superblock,inode, directory, file name, etc.), and finally extract the data completely. Of course, if the data corruption is not serious, you can restore the partition table or superblock information to the state before the problem, and then hang it back to the Linux environment, you can directly normal the mount file system. The final result of data recovery is that the problematic LUN can properly mount by changing the partition table or superblock, and there is a LUN that needs the D-Recovery For XFS tool to export the data, and finally achieves a very perfect recovery.

Add:

RAID10 architecture: for example, 10 146GB fiber interface hard drives, every two disks are made into RAID1, a total of five groups of RAID1, and then these five groups of RAID1 are configured into a RAID0, which is the so-called hybrid RAID10 architecture. The DS4300 array is attached to the IBM minicomputer and divided into AIX JFS2 file systems. This RAID architecture seems secure, but it can still go wrong.

In this architecture, if only one of the five groups of RAID1 is broken, the whole array cannot be accessed properly, and the mount will not be available on the AIX.

Add:

Replace the damaged controller

DS4300 double controller, found that the A control can not online, and the host interface is not bright, and the network card light is not on, so in the case of no controller spare parts, turn off the host, storage, and replace the battery to ensure that the B control returns to normal, but the A control is still not good.

Now the new controller is ready (but the microcode version is unknown) and is ready to be replaced again. The idea is as follows

Storage conditions:

DS4300 basic model, dual control (A control is broken), no EXP

Firmware version: 06.12.03.00

NVSRAM version: N1722F600R912V05

Hard disk microcode: JFQ3

The general steps of the operation:

1. Backup data in different places

2. Collect ASD

3. Stop the database, shut down the host computer and storage

4. Replace the A controller and battery

5. Turn on the storage, the host is not enabled for the time being.

6. Connect two controllers at the same time, and upgrade the hard disk microcode JFQ3 to JFQ8 without IO read and write

7. Confirm the microcode version of the new controller. If it is 06.12.03.00, then the An and B controls are the same, and there is no need to upgrade.

If it is higher than 06.12, the microcode of the upgrade controller is the same as the A control.

8. Turn on the mainframe and collect ASD again

Question:

1. I have not upgraded the controller microcode before, and now I can find the microcode download of the basic model of DS4300 on the IBM website, but are there only two kinds of microcode for 06? And version 6.60.22.00 is not downloadable.

It should not matter if the microcode is different, as long as you make sure that the old controller you have left is not broken, plug in the new one, it should be synchronized automatically, and you must not shut down and replace it. Anyway, make a backup before you do it. Some suggest that the controller version that you change later should not be higher than the existing version, but in your case, the microcode of the disk is also more dangerous. I think the situation is quite complicated. Ask for advice before doing it.

If your disk microcode is still too low in JFQ3, this microcode may also cause an online plug and pull controller disk alarm.

You can explore the following steps: (on the premise of backing up data and array information)

1. Try upgrading disk microcode to JFQ8 with a single controller

2. Replace the controller

2.1: if the microcode of the new controller is lower than the original microcode, then replace it directly online, and the original high microcode should be automatically synchronized to the new controller.

2.2: if the new control microcode is higher than the original microcode, try upgrading the microcode of the original controller so that the original control microcode is higher than the new control, and then replace the controller online. Save the event log stuff before upgrading the microcode and empty it.

3. Replace the battery

After reading the above, do you have any further understanding of how to analyze the problem of IBM DS4300 storage disaster recovery? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.