Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Recovery process for data loss caused by raid-6 disk array corruption (graphic tutorial)

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

Fault description

The sudden power outage in the computer room led to the paralysis of the entire storage, and the storage was still unusable after power-up. After diagnosis by the user engineer, it is believed that the damage to the storage array was caused by a power outage.

The whole storage is a RAID-6 disk array composed of 12 Hitachi hard drives (3T SAS hard drives), which is divided into a volume and allocated to several Vmware ESXI hosts for shared storage. A large number of Windows virtual machines are stored in the whole volume, and the virtual machines are basically created by templates, so the system disks are all 160g. The size of the data disk is uncertain, and the data disk is in reduced mode.

Second, backup data

Connect all disks stored in the failure and the target disk that backed up the sss data to a server on Windows Server 2008. The failed disks are all set to offline (read-only) status. Under the professional tool WinHex, you can see the connection status as shown in the following figure: (in the figure, HD1-HD12 is the destination backup disk, HD13-HD24 is the source failed disk, and the model is HUS723030ALS640):

Figure 1:

Using WinHex to read sectors to HD13-HD24 in a low-level way, a large number of damaged sectors were found. The preliminary judgment may be that the reading mechanism of this kind of hard disk is different from the common hard disk. Try to replace the operating host, replace the HBA card, replace the expansion cabinet, change to the Linux operating system, and all show the same failure. Contact the user engineer who responds that this controller has no special requirements for the disk.

Use professional tools to detect the distribution of damaged sectors of the hard disk, and find the following rules:

1. The distribution of damaged sectors is 256 sectors.

2. Except that the starting position of the damaged sector segment is not fixed, the subsequent damaged sectors are separated by 2816 sectors.

The distribution of damaged sectors for all disks is shown in the following table (only the first 3 damaged sectors are listed):

ID number hard disk serial number 1 damaged sector 2 damaged sector 3 damaged sector

13 YHJ7L3DD 5376 8192 11008

14 YHJ6YW9D 2304 5120 7936

15 YHJ7M77D 2048 4864 7680

16 YHJ4M5AD 1792 4608 7424

17 YHJ4MERD 1536 4352 7168

18 YHJ4MH9D 1280 6912 9728

19 YHJ7JYYD 1024 6656 9472

20 YHJ4MHMD 768 6400 9216

21 YHJ7M4YD 512 6144 8960

22 YHJ632UD 256 5888 8704

23 YHJ6LEUD 5632 8448 11264

24 YHHLDLRA 256 5888 8704

Temporarily wrote a Mini Program to bypass the damaged sectors of each disk. Use this program to mirror the data of all disks.

Third, fault analysis

1. Analyze the damaged sector

After careful analysis of the damaged sector, it is found that the damaged sector appears regularly.

-the total size of each damaged sector is 256.

-the damaged sector is distributed as a fixed area, and one bad 256 sector is encountered for every 11 256 sectors skipped.

-the location of the damaged sector has always existed in the P-check or Q-check area of the RAID.

-of all the hard drives, there is only one natural bad path in disk 10.

2. Analyze the partition size

Based on the analysis of the 0-2 sectors of HD13, HD23 and HD24, we can see that the partition size is 52735352798 sectors, which is calculated according to the RAID-6 mode, divided by 9, equal to 5859483644 sectors, which is consistent with the physical hard disk size 1049524, and the size of the RAID information area retained in the DS800 controller; at the same time, according to the underlying performance of the physical hard disk, the partition table size is 512 bytes, and there is no 8-byte check behind, and a large number of 0 sectors do not have 8-byte check. Therefore, the original storage does not enable the DA technology commonly used in storage (520-byte sectors).

The partition size is as follows (the underlying GPT partition table entry is shown, and the coloring section indicates the partition size, in 512-byte sectors, 64bit):

Figure 2:

IV. Recombination of RAID

1. Analyze the structure of RAID

The storage uses a standard RAID-6 array, and then you only need to analyze the number of RAID members and the direction of the RAID to reorganize the RAID.

-analyze RAID stripe size

The whole storage is divided into a large volume and allocated to several ESXI for shared storage, so the file system of the volume must be the VMFS file system. On the other hand, a large number of Windows virtual machines are stored in VMFS volumes. The NTFS file system is mostly used in Windows virtual machines, so the size of RAID stripes and the direction of RAID can be analyzed according to the order of MFT in NTFS.

-analyze whether there is a disconnect disk in RAID

Mirror all the disks. Later, it was found that the last hard disk did not have a large number of bad channels like other hard drives. Among them, there are a large number of undamaged sectors, most of which are all-zero sectors. Therefore, it can be judged that this hard drive is a hot spare.

2. Recombine RAID

According to the analysis of the RAID structure to reorganize the RAID, you can see the directory structure. However, it is not sure whether it is the latest state, several virtual machines are tested and found that some virtual machines are normal, but there are also many virtual machine data anomalies. It is preliminarily determined that there are disconnected disks in RAID, kick out each disk in RAID in turn, and then check where the data is abnormal, and fail. After a careful analysis of the underlying data, it is found that the problem does not lie in the RAID level, but in the VMFS file system. The VMFS file system has some other record information if it is larger than 16TB, so you need to skip this record information when building the RAID. Reorganize the RAID again to see where the previous data anomalies can be matched. Verify against one of the virtual machines, after all the disks are added to the RIAD, the virtual machine can be started, but there is a problem when the disk is missing. Therefore, it is best to judge that the whole RAID is in a state where there is no shortage of disks.

Validating data

1. Verify the virtual machine

According to the verification of the more important virtual machines of the users, it is found that most of the virtual machines can be turned on and can enter the login interface. There are some virtual machines boot blue screen or boot detection disk, but the CD can be started after repair.

Some virtual machines boot as follows:

Figure 3:

2. Verify the database

After verifying the database in the important virtual machine, it is found that the database is normal. There is a database, according to the user description is the lack of part of the data, but after careful checking, it is found that the data does not exist in the database. By querying the system view in the master database, all the original database information is found as follows:

Figure 4:

3. Check whether the entire VMFS volume is complete.

Due to the large number of virtual machines, it will take a long time to verify each one, so we tested the entire VMFS volume. During the detection of VMFS volumes, it was found that some virtual machines or virtual machine files were corrupted. The list is as follows:

Figure 5:

VI. Restore data

1. Generate data

The North Asia engineer communicated with the customer and described the current recovery situation. After the user verified several important virtual machines, the data recovered by the user was acceptable, and then the North Asian engineer immediately began to prepare to recover all the data.

First prepare the target disk and use a dell MD 1200 plus 11 3T hard drives to form a RAID array. The reassembled RAID data is then mirrored to the target array. Then use the professional tool UFS to parse the entire VMFS file system.

2. Try to mount the recovered VMFS volume

Connect the recovered VMFS volume to an ESXI5.5 host in our virtualized environment and try to mount it to the ESXI5.5 environment. However, due to the version (the customer's ESXI host is version 5.0) or the VMFS itself is damaged, the mount is not successful. Continuing to attempt to mount using ESXI's command was also unsuccessful, so the mount of the VMFS volume was discarded.

VII. Handing over data

Due to time constraints, arrange for North Asia engineers to bring data from the MD 1200 array to the user site. Then use the professional tool "UFS" to export the virtual machines in the VMFS volume in turn.

1. Connect the data on the MD 1200 array to the user's VCenter server through the HBA card.

2. Install the "UFS" tool on the VCenter server, and then use the "UFS" tool to interpret the VMFS volume.

3. Use the "UFS" tool to import the virtual machine from the VMFS volume to the VCenter server.

4. Upload the virtual machine to the storage of ESXI using the upload function of VCenter.

5. Then add the uploaded virtual machine to the list and power on to verify it.

6. If there is a problem with booting a virtual machine, try to use command line mode to fix it. Or rebuild the virtual machine and copy the restored virtual machine disks (both VMDK files) to the past.

7. Because the data disk of some virtual machines is very large, but the data is very little. In a case like this, you can export the data directly, then create a new virtual disk, and finally copy the exported data to the new virtual disk.

Counting the number of virtual machines in the entire storage, there are about 200 virtual machines. In the current situation, the restored virtual machines can only be restored to the user's ESXI one by one in the above way. Because it is transmitted through the network, the network is a bottleneck in the whole process of migration. After continuous debugging and replacement of the host finally can not achieve an ideal state, due to time constraints, finally decided to migrate data in the current environment.

VIII. Summary of data recovery

1. Fault summary

The rules of all disk bad tracks are as follows:

After careful analysis, the conclusions of the bad roads are as follows:

-except for one natural bad channel on SN:YHJ6LEUD, the other bad channels are distributed in the Q-check block of RAID-6.

-Bad track areas are mostly represented by complete 256sectors, which is exactly the size of a full RAID block when RAID- 6 was created.

-Bad channels appear in the active area, and may not appear in the inactive area. For example, if the hot spare is less than 10% online, the number of bad channels is less than that of other online disks (the image of the hot spare is completed in 4 hours, while it takes about 40 hours for other bad disks)

-other non-Q check areas are intact without any failure.

Conclusion:

In general, it can be inferred from the performance of the above bad rules that the bad channel generates Q-check for the controller, and when issuing IO instructions to the hard disk, it may appear as a non-standard instruction, and the internal processing of the hard disk is abnormal, resulting in regular bad channels.

2. Summary of data recovery

There are so many bad channels in the process of data recovery that it takes a long time to back up the data. The whole storage is caused by bad channels, resulting in partial destruction of the final recovered data, but does not affect the overall data, and the final result is within an acceptable range.

During the whole recovery process, the user asked for an emergency, and we also arranged for engineers to work overtime to recover the data in the shortest possible time. The follow-up data migration process is completed with the cooperation of our engineers and user engineers.

IX. Project members

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report