The process of solving the problem that two pieces of the server hard disk are missing. 05/04 Update SLTechnology News&Howtos

The process of solving the problem that two pieces of the server hard disk are missing.

2026-05-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

one。 Description of server data recovery failure

The servers that need data recovery have a total of 10 disk cabinets, each with 24 hard drives. Nine storage cabinets are used for data storage and one storage cabinet is used for metadata storage. There are 24 146g hard drives in the metadata storage, including 9 RAID 1 arrays, 1 4-bit RAID 10 array and 4 hot spare hard drives.

In data storage, there is a set of RAID 5 arrays for every 6 hard disks, a total of 36 sets of RAID, and these 36 groups of RAID arrays are divided into 2 storage systems. In one of the storage systems, a group of RAID went offline due to the failure of two hard drives one after another, which led to the failure of the RAID array and the paralysis of the whole storage system.

The storage and file system architecture is roughly as follows:

Note: Meta_LUN (metadata volume) Data_LUN (user data volume)

two。 Disk backup

In order to avoid secondary damage to the original disk caused by misoperation in the process of server data recovery, WinHex software is used to back up the customer's storage environment.

The backup process is shown in figure 2:

Number the 6 member disks in the faulty RAID, unplug the hard disk from the storage cabinet, connect it to the prepared backup platform, and back up the 6 hard drives.

Back up the rest of the RAID arrays that did not fail at the storage level. Use optical fiber cable to connect the backup platform and storage devices, enter the Kunteng storage device management interface to configure the backup platform and storage devices can communicate normally, and use WinHex software to mirror the LUN in RAID.

In the process of backup, it is found that there are a large number of bad areas in a faulty hard disk in the faulty RAID, which fails during the backup process and cannot be backed up again. After the faulty hard drive is opened to replace the firmware and repaired with the PC3000 tool, the hard drive can continue to be backed up, but the bad path still exists. Figure 3:

Partial image file

three。 Data analysis

Firstly, the faulty RAID array is analyzed, and the relevant RAID information is obtained. The RAID array is virtually reassembled using WinHex software, and the LUN in RAID is restored to an image file. In the process of analysis, it is found that the seriously damaged hard disk is the rear offline hard disk, because there are a large number of bad channels in this hard disk, which may affect the recovery results.

Log in to the management interface of Kunteng storage device to get some basic information about volumes in the StorNext file system, as shown in figure 4 below:

Continue to analyze the Meta volume and Data volume in the StorNext file system. The customer's StorNext file system contains two Data volumes, and the complete Data volume is composed of several groups of LUN in RAID. By analyzing these LUN, we get the algorithm law of the combination between LUN, and virtual reassemble the complete Data volume.

Figure 5:

The Meta volume is analyzed, and the node information and directory item information in the Meta volume, as well as the corresponding relationship between MetaVolume and Data are analyzed. Aiming at the situation that one Meta volume manages multiple Data volumes, the index algorithm from Meta volume to Data volume is studied. The file node is shown in figure 6:

The directory block is shown in figure 7:

four。 Data recovery

Through analysis and research, we have obtained the information needed for the recovery work, and began to write a program to scan the node information and directory item information in the Meta volume. at the same time, we parsed the directory items and nodes, obtained the complete file system directory structure, parsed the pointer information in the node, and recorded the information in the database.

The file information is as follows:

Write a file extraction program, read the database, and extract the data according to the parsed information and the aggregation algorithm between the two Data volumes.

five。 Recovery result

The generated data is tested by random sampling and there is no problem with the data. Extract the files needed by the customer locally, confirm that the extraction is completed, and hand over the data to the customer thread. The data transfer is completed and the customer is satisfied with the data recovery result. Although there is a bad situation in the faulty hard disk, fortunately, the main data has not been destroyed, and the data recovery work has been successfully completed.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.