Record a recovery misoperation and delete the production server data 07/03 Update SLTechnology News&Howtos

Record a recovery misoperation and delete the production server data

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Original: https://www.enmotech.com/web/detail/1/790/1.html

Introduction: after two days of unremitting efforts, we finally recovered a misoperation to delete the production server data. The process and solutions of the accident are recorded here, warning yourself and reminding others not to make this mistake. Also hope that friends who encounter problems can find a glimmer of inspiration to solve the problem.

Accident background

Arrange for a girl to install Oracle on a production server. She studies and installs it. She feels wrong and is ready to uninstall and reinstall.

Find the uninstall method on the Internet, where you need to execute a single command to delete the installation directory of Oracle, as follows:

Rm-rf $ORACLE_BASE/*

If the variable ORACLE_BASE is not assigned, the command becomes:

Rm-rf / *

Wait, wait. the girl uses a Root account. In this way, all the files of the whole disk are deleted, including the application of Tomcat and MySQL database and so on.

Isn't the MySQL database running? Can Linux delete files that are in progress? Anyway, it was deleted completely, and finally there was a Log file for Tomcat. It was estimated that the file was too large to be deleted successfully for a while.

Looking at the self-remorse look in the girl's eyes, it was because I arranged for her to do it, and I didn't make it clear to her. Without any training, I had to bear the responsibility alone. Moreover, how could I let the beautiful woman bear this responsibility?

Call the computer room, hang the disk to another server, SSH to see that all the files have been cleared, this server is running a customer's production system, ah, has been running for more than half a year, it has to be restored as soon as possible.

So find the offline backup database, found that the backup file is only 1KB, there are only a few lines of familiar mysqldump comments (is there something wrong with the backup script executed by Crontab), and the closest backup is from December 2013.

Think of a case that a leader once said: when a production system hung up, it was found that all the backups were problematic, the burned CDs were scratched, and the tape drive was also broken (a senior in the industry, probably used CDs for backup in the past). I didn't expect that it really happened to me today, what should I do?

After knowing the situation, the department leader has made the worst plan B: the leader personally led the team and the product AA to the customer's local city on Sunday and went to the leadership to communicate on Monday; BB and CC went to the customer administrator to find a way to convince the customer.

Straws: ext3grep

Quickly go to the Internet to look up information for mistakenly deleted data recovery, and really find an ext3grep that can recover files deleted through rm-rf. Our disk is also in ext3 format, and there are many successful cases on the Internet.

So lit up a glimmer of hope, quickly umount the disk to prevent rewriting add and delete file sector. Download ext3grep and install (the compilation and installation process is arduous for the time being).

Execute the scan file name command first:

Ext3grep / dev/vgdata/LogVol00-- dump-names

Printed out all the deleted files and paths, ecstatic, do not have to implement Plan B, the files are here.

This software cannot restore files by directory, but can only execute the restore command:

Ext3grep / dev/vgdata/LogVol00-- restore-all

As a result, the current disk space is insufficient, so we have no choice but to recover files. Several files were tried, but some of them were successful and some failed:

Ext3grep / dev/vgdata/LogVol00-- restore-file var/lib/mysql/aqsh/tb_b_attench.MYD

I can't help but feel cold in my heart. Is it possible to delete the file that has been written on the disk? The probability of recovery is not big, ah, can restore a few count, perhaps important data files happen to be in the recoverable MYD files.

So redirect all file names to one file first:

Ext3grep / dev/vgdata/LogVol00-- dump-names > / u sr/allnames.txt

Filter out the file names of all MySQL databases and save them as mysqltbname.txt.

Write a script to restore files:

While read LINE

Echo "begin to restore file" $LINE

Ext3grep / dev/vgdata/LogVol00-- restore-file $LINE

If [$?! = 0]

Then

Echo "restore failed, exit"

# exit 1

Done

< ./mysqltbname.txt 执行，大概运行了 20 分钟，恢复了 40 多个文件，但不够啊，我们将近 100 张表，每张表 frm，myd，myi 三个文件，怎么说也有 300 多个左右啊！将找回来的文件附到现有数据库上，更要文件权限为 777 后，重启 MySQL，也算是找回一部分数据了，但客户重要的考勤签到数据、手机端上报数据（据说客户按这些数据做员工绩效的）还没找回来啊。咋办？中间又试了另一款工具 extundelete，跟 ext3grep 语法基本一致，原理应该也一样了，但是据说能按目录恢复。好吧，试一试： extundelete /dev/vgdata/LogVol00 --restore-directory var/lib/mysql/aqsh 果然不出所料，恢复不出来！！！！！！！！那些文件已被破坏了。跟领导汇报，执行 B 计划吧......无奈之下下班回家。（周末了，回去休息一下，想想办法吧）灵机一动：Binlog 第二天早晨一早就醒了（心里有事啊），背上电脑，去公司（这个周末算是报销了，不挨批，通报，罚款，开除就不错了，还过什么周末啊）。依旧运行 ext3grep，extundelete，也就那几招啊，把系统架到测试服务器上，看看数据能不能想办法补一补吧。在测试服务器上进行 mysqldump，恢复文件，覆盖恢复回来的文件，给文件加权限，重启 MySQL。 Wait，Wait，不是有 Binlog 吗？我们服务都要求开启 Binlog，说不定能通过 Binlog 里恢复数据呢？于是从 Dump 出来的文件名里找到 Binlog 的文件，一共三个： mysql-binlog0001 mysql-bin.000009 mysql-bin.000010 恢复一下 0001： ext3grep /dev/vgdata/LogVol00 --restore-file var/lib/mysql/mysql-bin.000001 居然失败了......再看另两个文件，mysql-bin.000010 大概几百 MB，应该靠谱一点，执行还原命令，居然成功了！赶快 SCP 到测试服务器。执行 Binlog 还原： mysqlbinlog /usr/mysql-bin.000010 | mysql -uroot -p 输入密码，卡住了（好现象），经过漫长的等待，终于结束了。打开应用，哦，感谢 CCTV，MTV，数据回来了！后记经过此次事故，虽然数据很幸运找回来了，但是过程却是惊心动魄。也为自己的错误所带来的后果，给同事和领导带来的连带责任而后怕。也希望谨记此次事故，以后不再犯同样的错误。事故反思如下：本次安排 MM 进行服务器维护时没有提前对她进行说明厉害情况，自己也未重视，管理混乱，流程混乱。一个在线的生产系统，任何一个改动一定要先谋而后动。自动备份出现问题，没有任何人检查。脱机备份人员每次从服务器上下载 1K 的文件却从未重视。需要明确大家在工作岗位上的责任。事故发生后，没有及时发现，造成部分数据写入磁盘，造成不可恢复问题。需要编写应用监控程序，服务一旦有异常，短信告警相关责任人。根据评论提醒，再加一条：不能使用 Root 用户来操作。应该在服务器上开设不同权限级别的用户。通过本次事故，几位跟这个项目和事故没有任何关系的同事，主动前来帮忙，查资料，帮测试，有一位同事还帮忙到晚上 1 点多钟进行数据恢复测试。同时产品经理在想到面向客户的巨大压力的情况下，没有慌乱而责怪开发人员和具体操作人，而让大家能静下心来想解决方案。部门领导也积极主动的帮忙想办法，陪我们加班测试，实时跟踪事情进程。通过大家的共同努力，终于事情相对圆满结束，接下来，周一上午进行集体反思，总结经验教训，这类事故一定尽最大努力进行避免。本文所用到的工具链接： ①ext3grep：https://code.google.com/p/ext3grep/ 编译安装依赖包比较多，可以到网上搜索如何安装。可惜的是作者给出的 howto 被墙了，我 FQ 将 howto 的 pdf 文档下载下来了，读完后你将会对 Linux 的文件系统有进一步的认识。这个工具有一个 Bug，出错后不会向下执行： ext3grep: init_directories.cc:534: void init_directories(): Assertion `lost_plus_found_directory_iter != all_directories.end()' failed. 从而造成恢复失败，作者放出了一个补丁，下载地址：补丁下载。不明白为什么作者新版没有把这个补丁加进去。 ②extundelete：http://extundelete.sourceforge.net/ 功能跟 ext3grep 差不多，原理应该也差不多。只是号称可以还原目录，我这里没有试验成功。您是否也有误删文件的经历呢？又是如何处理解决的呢？欢迎留言分享您的妙招。作者：zhouyu 出处：https://www.cnblogs.com/zhouyu629/p/3734494.html 想了解更多关于数据库、云技术的内容吗？快来关注"数据和云"公众号、"云和恩墨"官方网站，我们期待与大家一同学习和进步！

(scan the QR code above and follow the official account of "data and Cloud" for more science and technology articles.)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.