Data recovery method and data recovery process of HP-EVA4400 storage hard disk offline in a company 04/27 Update SLTechnology News&Howtos

Data recovery method and data recovery process of HP-EVA4400 storage hard disk offline in a company

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Fault description

The whole EVA storage structure is composed of an EVA4400 controller, three EVA4400 expansion cabinets and 28 FC 300G hard drives. Some storage LUN is unavailable and some LUN is missing due to the loss of two disks. Because EVA4400 is due to some disks being dropped, the entire storage is unavailable. Therefore, after receiving the disk, the North Asian engineer first made a physical inspection of all the disks, and found that there was no physical fault after the inspection. Then the bad path detection tool is used to detect the disk bad path, and it is found that there is no bad path. The disk bad path detection log is as follows:

Figure 1:

Second, backup data

Considering the security and reducibility of the data, it is necessary to back up all the source data before data recovery, in case the data cannot be recovered again due to improper operation. Use winhex to mirror all disks into files, the source disk has a large number of contents, and it takes a long time to make a data backup. Some of the data backed up are as follows:

Figure 2:

Fault analysis and recovery process

1. Analyze the cause of the failure

Since no physical failure or bad track was detected in the first two steps, it is inferred that the failure may be caused by some disk read and write instability. Because the EVA controller has a strict policy of checking disks, once the performance of some disks is unstable, the EVA controller thinks it is a bad disk and kicks the disk that is considered as a bad disk out of the disk group. Once the disk dropped in the same stripe of a LUN reaches its limit, the LUN will not be available. That is, if all the LUN in the EVA contains these disconnected disks, all LUN will be affected. It is normal for two disconnected disks to cause the entire stored LUN to be unavailable. The current situation is that there are 8 LUN, 7 LUN damaged and 6 LUN lost. All LUN data needs to be recovered.

2. Analyze the structure of LUN.

HP-EVA 's LUN stores data in the form of RAID entries. EVA combines different blocks of each disk into a RAID entry, and there can be many types of RAID entries. We need to analyze the types of RAID entries that make up the LUN and which blocks the RAID entry is made of. This information is stored in LUN_MAP, and each LUN has a LUN_MAP. EVA stores the LUN_MAP on separate disks and uses an index to specify its location. So go to each disk to find the index that points to LUN_MAP and you can find information about the existing LUN.

3. Analyze the missing LUN

Although the index pointing to LUN_MAP is recorded on disk, it only records the existing LUN, and the missing LUN does not record the index. Because deleting a LUN in EVA will only clear the index of the LUN, not the LUN_MAP of the LUN. At this time, you need to scan all disks to find all data blocks that meet the LUN_MAP, and then exclude the existing LUN_MAP. The remaining LUN_MAP is not necessarily deleted, and some are old, but at this time it is impossible to filter in the LUN_MAP. You can only restore all LUN_MAP data through the program and manually check which LUN is deleted.

4. Analyze the offline disk

As mentioned in the previous fault analysis, although there is no obvious physical failure of the disk, there is no bad path of the disk. However, it will still be detached from the EVA disk group for performance reasons. These detached disks contain some old data, so these disks need to be excluded when generating data. But how can you tell which disks are offline? Since most of the RAID structures of LUN are RAID5, we only need to calculate the check value of a LUN RAID entry through the RAID5 check algorithm, and then compare it with the original check value to determine whether there is a reel in this entry. If you check all the LUN_MAP of a LUN, you can know which RAID entries in this LUN have disconnected reels. And the disk that exists in all these RAID entries must be the reel. Exclude the reel, and then recover all LUN data according to LUN_MAP.

5. Write data recovery program

The above fault analysis and solution ideas ultimately need to be realized by programming. Write a program to scan LUN_MAP Scan_Map.exe, scan all LUN_MAP, combined with manual analysis to get the most accurate LUN_MAP. Write a program to detect RAID entries, Chk_Raid.exe, detect all offline disks in LUN, combined with manual analysis to eliminate offline disks. Write LUN data recovery program Lun_Recovery.exe, combined with LUN_MAP to recover all LUN data.

6. Recover all LUN data

According to the written program to achieve different functions, and finally use Lun_Recovery.exe combined with LUN_MAP to restore all LUN data. Then check each LUN manually to make sure it is consistent with the description of Party A's engineer. The data recovery of some LUN is as follows:

Figure 3:

IV. Data verification

According to the engineer of Party A, all the data of LUN can be divided into two parts, one is the virtual machine of Vmware, the other is the bare device on HP-UX, and the bare device stores Oracle's dbf database. Since we are restoring the LUN, we cannot see the files in it, so we need to manually check which LUN stores Vmware data and which are the bare devices of HP-UX. Then mount the LUN to different verification environments to verify that the recovered data is complete.

1. The verification environment for deploying Vmware virtual machines

The ESXI5.5 virtual host environment is installed on a dell server, and then the recovered LUN is mounted to the virtual host by iSCSI. But scanned the vmfs volume on VMware vSphere Client and found nothing. Later, it was found that the customer's virtual host was the version of EXSI3.5. The vmfs volume may not be scanned directly because of the version, so it is verified in a different way. Generate all the virtual machine files in the LUN that conform to the vmware virtual machine, then mount them to the virtual host through NFS sharing, and then add the virtual machines one by one to the list. Some of the restored virtual machine files are as follows:

Figure 4:

2. Verify the vmfs virtual machine

After adding all the virtual machines to the virtual host through NFS, and powering on all the virtual machines, it is found that all the virtual machines can start the system. It is not possible to confirm the integrity of the files in the virtual machine because there is no boot password. Later, Party An arranged for the engineer to boot all the virtual machines into the system by remote access to our server, and verify that the data in the virtual machine is all right. All data of the virtual machine was restored successfully. Some virtual machines boot as follows:

Figure 5:

3. Deploy the verification environment of Oracle database

In order to restore testing of bare equipment and later data verification work, it is necessary to build a good oracle environment.

According to the environmental information provided by the engineer of Party A, the HP minicomputer Itanium architecture, our company HP minicomputer is RX2660 (Itanium 2), which is compatible with the architecture. So it is planned to install oracle single instance software on this machine.

Operating system: HP-UX B.11.31

Database: Oracle 10.2.0.1.0 Enterprise Edition-64bit for HPUX

The following are simple steps to install the environment:

(1) Environmental testing

# uname-all

HP-UX byhpux1 B.11.31 U ia64 1447541358 unlimited-user license

This machine is IA64 architecture, the operating system is HP-UX, and the version is B.11.31.

Then check each part of the storage space information to ensure that there is enough space.

(2) detect installation dependency packages

Check the patch pack required by oracle10g according to the installation instructions "b19068.pdf".

Detect:

# swlist-l bundle | grep "GOLD"

# swlist-l patch | grep PHNE_31097

If it is not detected, it needs to be downloaded and installed on the official website. Install the fix pack:

Swinstall-s / patchCD/GOLDQPK11i-x autoreboot=true-x patch_match_target=true

(3) create users and groups

# groupadd dba

# useradd-g dba-d / home/oracle oracle

# passwd oracle

(4) create a directory and modify permissions

Create a directory:

# mkdir-p/opt/oracle/product/10.2/oracledb

# chown-R oracle:dba/opt/oracle/frombyte.com

Modify permissions:

# chown oracle:dba/usr/oracle_inst/database/

# chmod 755/usr/oracle_inst/database/

(5) set environment variables

Vi / home/oracle/.profile

(6) install oracle

The installation of Oracle requires a graphical interface, so test that the image interface starts normally.

# exoprt DISPLAY=192.168.0.1.0:0

$. / runInstaller

After the image interface is up, the installation is relatively simple, where only the software is installed, not the instance.

(7) Test database connection

# su-oracle

$sqlplus / as syssdba

4. Verify the Oracle database

(1) Mount bare equipment

Because some of the LUN are bare devices, and the LUN we recovered is in the form of files. Therefore, you need to mount the LUN in file form to the HP-UX. To install iSCSI Initiator on the HP-UX server, install the following steps:

Check whether the package is complete

# swlist-d @ / tmp/B.11.31.03d_iSCSI-00_B.11.31.03d_HP-UX_B.11.31_IA_PA.depot

Install the package

# swinstall-x autoreboot=true-s / tmp/B.11.31.03d_iSCSI-00_B.11.31.03d_HP-UX_B.11.31_IA_PA.depot iSCSI-00

Add the executable file of iSCSI to PATH

# PATH=$PATH:/opt/iscsi/bin/frombyte.com

Check whether iSCSI is installed successfully

# iscsiutil-l

Initiator name for configuring iSCSI

# iscsituil / dev/iscsi-I-N iqn.2014-10-15:LUN

Configure the mount target iSCSI device

# iscsiutil-a-I 10.10.1.9

Delete the target iscsi device

# iscsiutil-d-I 10.10.1.9

Verify that the target iSCSI is mounted successfully

# iscsiutil-pD

Discover the target target device

# / usr/sbin/ioscan-H 255

Create a device file for the target

# / usr/sbin/insf-H 255

(2) Import external VG information

Create a VG node

# mkdir / dev/vgscope/frombyte.com

Create VG device file name

# mknod / dev/vgscope/group c 64 0x030000

Check to see if PV is normal

# pvdisplay-l / dev/dsk/c2t0d0/frombyte.com

Import PV into VG

# vgimport-v / dev/vgscope / dev/dsk/c2t0d0

Activate VG information

# vgchange-a y vgscope

View VG information

# vgdisplay-v vgscope

Figure 6:

(3) modify the name of LV

Since the VG is rebuilt in the new environment, the PV is then imported into the new VG. So the names of LV have all been changed, and you need to manually change the names of LV to the ones below.

Figure 7:

Because the original database instance is 2, and is the use of bare device storage. So when creating a database instance, press the original configuration and name.

At the file system level, with simultaneous assistance, all LV is mounted and permissions are modified.

Figure 8:

Install the database instance and install and identify all bare device files according to the original configuration with the assistance of the customer DBA.

Then adjust the configuration parameters to detect the storage status of the database in preparation for starting the database.

First switch to the instance scope (most important). To start the database

SQL > startup mount

SQL > select file#,error from vastly recovered files;-- check for corrupted files.

There are no damaged files.

SQL > ALTER DATABASE OPEN

The startup did not report an error, but it was slow, and then queried the user, randomly queried two tables of a user, and the data result set returned to normal. Then the connection is suddenly interrupted, reconnect, and check that the status is that the database is closed. Then start the database, still can not start, will be forced to shut down.

After preliminary detection and regular recovery of the library state, this problem cannot be fixed.

Verify the NJYY database

Switch the environment variable to another database NJYY,open Database Times error out of memory error, unable to open the database. After preliminary detection, the data file is not damaged.

SQL > startup mount

SQL > select file#,error from v$recover_file

SQL > ALTER DATABASE OPEN

Error 4030 detected in background process

5. Repair Oracle database

Fault repair

For the scope database, according to the above operation and failure phenomena, it is initially determined that there is a problem with the undo tablespace or logs. Check the integrity and consistency of the data file, and only one undo01.dbf file is corrupted. Determine if the undo tablespace is corrupt. Delete the corrupted undo tablespace with the command and rebuild it in its original location.

Check other parts of the file and no problems are found. Restart the database, start normally, do query data, normal, do integrity test, normal.

Then do imp database full database export, after more than 3 hours to export the full database database normally.

For the NJYY database, it is detected that the memory space is not set correctly. After command adjustment, the database returns to normal, can be started normally, and is in normal use.

Finally, the whole database of imp database is exported, and the whole database is exported normally after more than 4 hours.

Concrete verification

After completing the preliminary verification, Party A requires its DBA and business personnel to do further specific verification of the database remotely. The verification environment and each database are verified together.

Finally, verify that the database is fully restored, no problem.

After validating the data, do the data migration. Consider the capacity and recovery time of the database. Choose to use expdp to export the whole database data. Because expdp is more efficient than exp.

After writing the export script and testing it without problems in the test environment, export the scope database first. 24 minutes after the start of the export, the error begins:

ORA-39171: Job is experiencing a resumable wait.

ORA-01654: unable to extend index SYSTEM.SYS_MTABLE_00003A964_IND_1 by 8 in tablespace SYSTEM

After looking for the reason, it is concluded that it is because the system tablespace is full. Export with expdp adds export record data to the SYSTEM.SYS_MTABLE_00003A964_IND_1 table in the system tablespace. When a large amount of data is exported, the amount of data in this table increases, and an error is reported when the total capacity of the system tablespace is reached. According to the analysis here, tablespaces generally increase their capacity automatically, so that errors should not be reported. Finally, the query shows that the system table space is placed on the bare device, the capacity is 1G, and can not be increased. Therefore, you cannot use the expdp tool for export. You can only export using the exp tool, which will be slower, but there will be no shortage of system tablespaces.

Finally, the full library export of scope is done through exp, and the backup is completed successfully after more than 6 hours. The backup file is 172g.

For the NJYY database, do imp export, after more than 7 hours to export the whole database normally, the backup file reaches 140g. Then make a local backup of the database backup file as a safe cold backup.

V. handing over data

1. Transfer vmware virtual machine files and Oracle dump files

After verifying that there is no problem with all the data, copy the vmware virtual machine file and Oracle dump file to a 2TB Seagate hard drive. Then copy the recovered LUN data to a single disk of two 3TB. After coming to the site of Party A, Party A first handed over the vmware virtual machine file and Oracle dump file to Party A, and Party A began to verify the dump file and vmware virtual machine file.

2. Mirror the LUN data to Party A's EVA4400 storage server

As Party A requires that all LUN data be restored to the original environment, it is necessary to reconfigure the HP-EVA4400 and recreate the LUN of the same size as before. Then use the winhex tool to mirror all the recovered LUN data to the new LUN created by EVA. Due to some problems in Party A's HP-EVA4400 controller, it took a long time to reset the HP-EVA4400. After mirroring all the LUN data, Party An arranges the Oracle database engineer to verify whether the restored Oracle is normal. After detection, it was found that two missing dbf files caused the Oracle service to fail to start. After analyzing the cause of the failure, it was found that the two missing dbf existed as files before the EVA failure, and then they were restored to the LV during recovery. When restoring LV, the engineers of Party A did not rebuild the vg but all the LV recovered according to the previous vg_map. That's why this problem arises, and the solution is to recreate the two LV, then take the two files from the underlying LUN and dd them into the newly created LV. Start the Oracle service again, start normally, and the problem is resolved.

Due to the good environment of the site after the failure, there is no related dangerous operation, which is of great help to the data recovery in the later stage. Although there are many technical bottlenecks in the whole process of data recovery, they are solved one by one. Finally, the entire data recovery is completed within the expected time, and Party An is also quite satisfied with the recovered data.

Future data security recommendations

1. Arrange the staff to inspect the computer room frequently and find that the alarm information is dealt with in time.

2. Managers should be careful in operation and storage to avoid data loss caused by misoperation.

3. Some modules of EVA controller are found to be unstable on the spot and should be replaced in time.

4. Because the EVA storage failure is caused by disk instability, this part of the disk should be the same batch of disks. As a result, the performance of these disks is also close to the limit, and it is recommended to replace these disks if there are conditions.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.