Hadoop Operation and maintenance record Series (16) 07/10 Update SLTechnology News&Howtos

Hadoop Operation and maintenance record Series (16)

2025-07-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

In response to the recovery of a domestic telecom operator's cluster, the cluster failure is very serious, and the cluster Namenode that has done HA is down. The exact process is unknown, but a brief review of historical fragments from the victim's words.

Active's Namenode metadata hard drive is full, full, full. The first words came up like thunder.

When the operation and maintenance staff found that the hard disk was full, they executed echo "" > edit_xxxx-xxxx... on the metadata log of active namenode. The second sentence is like thunder.

Then found that standby can not be switched, switching is useless, because standby's metadata and logs are from May. This result makes it impossible to look directly at.

Because I had to give lectures in other places at the weekend, I couldn't go to the scene in the field. Qixi Festival went to have a look at it in the way of qq remote assistance for seven or eight days. Several major problems accumulated, leading to the final tragedy.

The metadata of Namenode is stored in only one hard disk mount, and the disk space is very small. There are more than N people stuffed into it all kinds of messy files, what jar bag, tar package, shell script.

According to the description, the standby metadata is only up to May, indicating that standby is either dead or not started at all.

If there is no ZKFC, it is the automatic recovery of failure, which should be the manual recovery method (and there is actually no JournalNode, we'll talk about it later).

As for raid0, the problem of lvm is completely ignored, although it is also a big problem.

Then they messed around all day without any results, so they really couldn't get up. The best result was to start two standby namenode machines and not be able to switch active. Through the relationship to find me, hope to use paid service to ask me to help with the recovery. At first I thought it was easier, so I said yes. As a result, the fault is far more complex than imagined, which can be regarded as the most difficult cluster data recovery challenge encountered so far. Because they can not know exactly what happened after their own operation, they do not know, or are under pressure to say, I can only try to recover according to the existing data.

The first time I tried to recover, I underestimated their destructive power, so that the first recovery was unsuccessful. Qixi Festival left home and started his business after work. In this attempt, I found that HA did not use ZKFC to do automatic recovery, but completely manual recovery, so I helped them install and configure ZKFC by the way. Then initialize shareEdits first and find that they don't do shareEdits at all, which means that the original JournalNode may not work at all. Then start zookeeper, then start journalnode, then start two NN, the status of two NN is standby, and then start ZKFC, automatic switching fails. Judging from the log, it was the inconsistency of metadata in the two NN that caused the brain fissure. So use the hidden command in haadmin to forcibly specify a NN. One NN is activated, while the other automatically recovers the metadata after a brain fissure. The automatic recovery log log is as follows

INFO org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: replaying edit log: xxxx/xxxx transactions completed. (77%)

After a long wait, the SNN metadata is restored. But has not been out of the safemode state, because it is too late, did not continue, just tell them, wait until the safemode detached, it is OK, if not detached, forcibly use safemode leave to detach. However, I think everything is too simple.

The next day, I called to say that the cluster still didn't work. When I went up and took a look, I was still in safemode. So forcibly break away from safemode, but only active is separated, standby is still not detached, and HDFS is unable to write data. At this time, the data blocks reported by hdfs's web ui,active and standby are far from reaching the required number, and realize that metadata has been lost. However, the other party insisted that the metadata was backed up immediately after the failure, and that the misoperation at that time was only aimed at edits logs, and fsp_w_picpath did not move. To be honest, I'd rather they emptied the fsp_w_picpath than the edits. Fsp_w_picpath can recover from edits, but when edits is emptied, there is nothing you can do about it.

So, the next day, stop NN and Standby NN again, stop ZKFC, stop JournalNode, as a result, NN can not get up again.

The namenode has no resources availible

At first glance, too much log was generated during recovery, and the hard disk of metadata was full again. I have to calculate with each other and transfer the metadata to another relatively empty hard disk for processing. I don't know why they are looking for such a small piece of inventory metadata and storing it with the operating system. Move the metadata folder, then change the configuration, then start ZK,Jounal, NN, StandbyNN, and use bootStrapStandby to manually switch between master and slave. Start the ZKFC,HDFS recovery again, and then force you out of the safemode. Touchz and rm test HDFS can add and delete files, no problem.

The second attempt to recover, in fact, HDFS has been able to access the normal, can be recovered, but the metadata has been lost, this is really no way. So, in consultation with them, take the second approach and try to recover the metadata through the log. They agreed to try to recover. So they copy their own backup editslog and fsp_w_picpath from their own backup folder to the metadata folder, and use the recover command to restore editslog to metadata. After a period of waiting, recovery, and then restart NN and Standby NN, it is found that the recovered data in the log is even older than the previous recovery, so it is restored to the previous metadata according to the second half of the first solution. Here's why.

Finally, the recovered metadata records more 580TB and loses part of the data.

The log of the original activeNN has been emptied. I don't know if the fsp_w_picpath above has been moved. I don't know what they have done before. Because the disk on this is full, the fsp_w_picpath on this is actually not trusted.

JournalNode doesn't do initializeShareEdits, nor does it do ZKformat, so Journalnode doesn't actually work. There are no logs available for recovery under the jn folder. The recovery in scenario 2 is done with StandbyNN logs, and since standby doesn't work at all, you can only recover the metadata before doing the so-called HA through logging.

Although the original standby NN is started, it is also manually set to standby, but since there is no Journalnode to work, DN will report the operation to standby NN, but there is no logging, and the metadata is old. The last diary was recorded in May, and it was already a brain fissure. The following figure is important for understanding how NNHA works, especially the direction in which all the arrows are pointing.

Then, finally, summarize the whole process of problem occurrence and analysis and solution.

Define nouns first.

ANN = Active NameNode

SNN = Standby NameNode

JN = JournalNode

ZKFC = ZooKeeper Failover Controller

Occurrence of the problem:

ANN metadata is placed on a small hard disk, and only one copy is saved. The hard disk is full, and the operator executes echo "" > edits.... on ANN. Operation of the file.

At the beginning, I did HA, but I didn't do initializeShareEdits and formatZK, so although JN started, it didn't work, and SNN didn't work either. Just pretending to be a standby? So there are no actual edits logs available on JN.

The last backup of the operator after the problem is actually the SNN log and metadata, because the ANN editlog has been emptied, and the ANN hard disk is full, even if there is a backup, it is actually not trusted.

Recovery of the problem:

Restore the backed-up fsp_w_picpath, or restore the fsp_w_picpath through editslog.

Restore fsp_w_picpath and restart NN,JN and other related processes. Under safemode, Hadoop will try to repair brain fissure on its own, based on the current Acitve metadata.

If you encounter the loss of both metadata and edits, please find God to solve it. The trouble with this failure case is that if you are rm editslog, under the ext4 or ext3 file system, stop reading and writing the file system immediately, and it is possible to find it, but echo "" > edits, there is nothing you can do about it. And all the worst extreme cases come together: ANN hard drives are full, logs are deleted, metadata is lost, SNN doesn't work at all, JN doesn't work.

Summary of the question:

As a financier, Party A has no idea what Hadoop is, or has heard of the word. As for the specific operation details, Party A has no idea at all.

Party B who undertakes the project knows a little more than Party A, but it is very limited, knows some details of operation, but is limited to the extent to which it can run, and has almost no concept of operation and optimization.

The senior leaders of Party B believe that Hadoop can enhance learning and understanding in the process of use. As everyone knows, if the early construction of Hadoop does not do a good job in systematic and orderly planning, the trouble in the later stage will be extremely serious. Moreover, in fact, everyone of Party B is working overtime for Party A to develop the task of data analysis, and there is basically no time to understand and learn about how the system works and maintains. Otherwise, no one will perform the operation of emptying the edit, and according to the communication staff of Party B, it has been done before, but it is not as serious as this time. (so I suspect that other fatal operations must have been done after emptying the log, but they did not tell me).

Hadoop production cluster in the initial hardware and software construction planning details are very many, across the network, server, operating system in many areas of comprehensive knowledge, which part of the details are missing, there may be major problems in the future. Raid0 or lvm, for example, is actually a big problem, but many people don't pay attention to it. Yahoo benchmark shows that the performance of JBOD is 30% to 50% higher than that of RAID, and will not infinitely magnify the problem of a single disk failure, but I find that few people pay attention to similar details. Many production clusters have done RAID0 or RAID50.

A lot of training is also a mixture of fish and dragons, unexpectedly, some training told that the configuration of map and reduce slot number is useless, which is either a naked deception or a deliberate attempt to trick people.

This is a very difficult challenge for cluster data recovery, and the actual result is to recover about the metadata of 580TB data and fix the problem of brain fissure. No special commands are used in the whole process, all of which are commands in haadmin and dfsadmin that can be seen on the hadoop command line. During the whole recovery process, each server needs to open an extra CRT to observe the dynamics of the log of all processes. In order to adjust the recovery strategy and method at any time. Last but not least, seen_txid files are very important.

The whole process is completed remotely by qq, with numerous network outages, two days of operation in Beijing and one day of operation in Shanghai. In order to protect the privacy of the parties, the financier and the victim shall be used instead.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.