What if the Namenode fails to start normally after an abnormal stop in Hadoop? 10/17 Update SLTechnology News&Howtos

What if the Namenode fails to start normally after an abnormal stop in Hadoop?

2025-10-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains the "Hadoop after the abnormal stop of Namenode can not start normally how to do", the article explained the content is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in-depth, together to study and learn "what to do after the abnormal stop of Namenode in Hadoop?"

The company uses CDH5 HA mode online and has two Namenode nodes. as a result, the Standby node stops because of some error reports about edits files, and reports that various files can not be found during startup.

At first, I suspected that the problem might only occur in Standby itself, so I tried bootstrapStandby to reinitialize the Standby node, but the problem remained. Later, because I tried to restart the ZKFC (Zookeeper Failover) server, the Active node switched automatically. When I planned to switch back after the switch failed, I was unable to start the service. The error was exactly the same as the Standby node, so the whole Hadoop cluster hung up.

The problem was so serious that after searching the whole Google and couldn't find any useful information, I had to turn to the boss for help. Finally, the boss thought of an idea, that is, the fsimage (metadata) file and edits (edit log) file are decompiled into text to see what is in it and why there will be an error when loading the edits file.

As a result, this idea brought us the dawn and eventually repaired the entire cluster.

Environment introduction:

Idc2-server1: namenode, journalnode, zkfc

Idc2-server2: namenode, journalnode, zkfc

Idc2-server3: journalnode, resourcemanager

Specific process:

First, the following error occurs on Standby Namenode, and then automatically shuts down the process abnormally:

2014-11-11 02 FATAL org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer 1214 057 FATAL org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error encountered while tailing edits. Shutting down standby NN.java.io.FileNotFoundException: File does not exist: / user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part-00015 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf (INodeFile.java:65) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf (INodeFile.java:55).

It mentions "Unknown error encountered while tailing edits. Shutting down standby NN."

So we tried to start the Standby Namenode service and reported the following error:

2014-11-12 04 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader 26 Fast-forwarding stream 'http://idc2-server10.heylinux.com:8480/getJournal?jid=idc2&segmentTxId=240823073&storageInfo=-55%3A1838233660%3A0%3ACID-d77ea84b-1b24-4bc2-ad27-7d2384d222d6' to transaction ID 240741256 2014-11-12 04 14: Encountered exception on operation CloseOp [length=0, inodeId=0 Path=/user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part-00015, replication=3, mtime=1415671845582, atime=1415670522749, blockSize=134217728, blocks= [], permissions=oozie:hdfs:rw-r--r--, aclEntries=null, clientName=, clientMachine=, opCode=OP_CLOSE Txid=240823292] java.io.FileNotFoundException: File does not exist: / user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part-00015 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf (INodeFile.java:65) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf (INodeFile.java:55 ). 2014-11-12 04 Frey 26 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimagejava.io.FileNotFoundException: File does not exist: / user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part-00015 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf (INodeFile.java:65)

Said that the file "/ user/dong/data/dpp/classification/gender/vw-output-train / 2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary / attempt_1415171013961_37060_m_000015_0/part-00015" could not be found.

In fact, this file is temporary, unimportant and has been deleted.

But in the above, it reports "ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp", you can see that the edits file is loaded, and the OP_CLOSE operation prompts that the file cannot be found.

At first we suspected that there might be a problem with the fsimage file or edits file on Standby, so we executed bootstrapStandby on Standby, and the modification process automatically fetched the latest fsimage file from Active Namenode and downloaded and executed the new edits file from the Journalnode log server.

Sudo-u hdfs hadoop namenode-bootstrapStandby

However, after initialization, you still encounter the same error as above when loading edits.

Next, because I tried to restart the ZKFC (Zookeeper Failover) service, Active Namenode automatically switched to Standby, but Standby could not take over, so when Active Namenode switched back, it could not restart normally, and the error was the same as when Standby started.

As a result, the whole Hadoop cluster hung up, and after searching the whole Google and couldn't find any useful information, I called the boss, and the boss couldn't find any useful information through the above error Google. So the boss tried to grep the path above in the edits file and found some related edits files:

# cd / data1/dfs/nn/# cp-rpa current current.backup.orig# cd / data2/dfs/nn/# cp-rpa current current.backup.orig# cd / data1/dfs/nn/current# grep attempt_1415171013961_37060_m_000015_0 * Binary file edits_0000000000240687057-00000000240698453 matchesBinary file edits_0000000000240823073-000000240838096 matchesBinary file edits_inprogress_0000000000244853266 matches

So we wondered if we could find some clues in these edits or fsimage files.

Among them, some descriptions of edits files give us great hope:

In case there is some problem with hadoop cluster and the edits file is corrupted it is possible to save at least part of the edits file that is correct. This can be done by converting the binary edits to XML, edit it manually and then convert it back to binary.

From the above description, we know that the edits file may be corrupted, and it can be used as a solution to manually modify it after decompilation and replace it after compiling back to binary format. So we copied the two associated edits files found above and decompiled them:

Mkdir / tmp2/cd / data1/dfs/nncp edits_0000000000240687057-000000240698453 / tmp2/cp edits_0000000000240823073-00000000240838096 / tmp2/hdfs oev-I edits_0000000000240687057-0000000000240698453-o edits_0000000000240687057-0000000000240698453.xmlhdfs oev-I edits_0000000000240823073-00000000240838096-o edits_0000000000240823073-0000000000240838096.xml

After decompilation, two XML files are generated, and we search the XML file for "/ user/dong/data/dpp/classification/gender / vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun / _ temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part- 00015" and find the records related to OP_CLOSE and OP_DELETE:

OP_DELETE 240818498 0 / user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part-00015 1415671972595 4a38861d-3bee-40e6-abb6-d2b58f313781 676 OP_CLOSE 240823292 00 / user/dong/data/dpp / classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part-00015 3 1415671845582 1415670522749 134217728 oozie hdfs 420

As you can see, for "/ user/dong/data/dpp/classification/gender/vw-output-train / 2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary / attempt_1415171013961_37060_m_000015_0/part-00015", OP_DELETE occurs before OP_CLOSE, so "File does not exist" is prompted when OP_CLOSE is executed.

So we tried to replace this part of the OP_CLOSE code with something else, such as innocuous permission to modify an existing file and retain TXID 240823292 to ensure the integrity of the edits file.

OP_SET_PERMISSIONS 240823292 / user/oozie-heylinux/.staging/job_1415171013961_37194 504

After the modification is completed, the XML file is decompiled back to binary format.

Cd / tmp2/cp edits_0000000000240823073-0000000000240838096.xml edits_0000000000240823073-0000000000240838096.xml.origvim edits_0000000000240823073-0000000000240838096.xmlhdfs oev-I edits_0000000000240823073-0000000000240838096.xml-o edits_0000000000240823073-00000000240838096-p binary

Then synchronize the binary file to the journalnode log server:

Cd / var/hadoop/data/dfs/jn/idc2prod/cp-rpa current current.backup.1cd / tmp2/cp edits_0000000000240823073-000000240838096 / data1/dfs/nn/current/cp edits_0000000000240823073-0000000000240838096 / data2/dfs/nn/current/cp edits_0000000000240823073-000000000000240838096 / var/hadoop/data/dfs/jn/idc2prod/current/scp edits_0000000000240823073-00000000240838096 root@idc2-server2:/var/hadoop/data/dfs/jn/idc2prod/current/scp edits_0000000000240823073-0000240838096 root@idc2-server3:/var/hadoop/data/dfs/jn/idc2prod/current/

Then start the namenode service and you can see that the errors no longer exist and have been replaced by other files.

2014-11-12 08 Fast-forwarding stream 'http://idc2-server1.heylinux.com:8480/getJournal?jid=idc2prod&segmentTxId=240823073&storageInfo=-55%3A1838233660%3A0%3ACID-d77ea84b-1b24-4bc2-ad27-7d2384d222d6' to transaction ID 240299210 2014-11-12 08 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader 57 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0 Path=/user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000018_0/part-00018, replication=3, mtime=1415671845675, atime=1415670519537, blockSize=134217728, blocks= [], permissions=oozie:hdfs:rw-r--r--, aclEntries=null, clientName=, clientMachine=, opCode=OP_CLOSE Txid=240823337] java.io.FileNotFoundException: File does not exist: / user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000018_0/part-00018 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf (INodeFile.java:65) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf (INodeFile.java:55 )... 2014-11-1208 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimagejava.io.FileNotFoundException: File does not exist: / user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000018_0/part-00018 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf (INodeFile.java:65).

Then, the next step is to repeat the above actions, and sometimes some rules can be found that the OP_CLOSE of files repeatedly reported in the same directory can be replaced in batches. But more often it is random files, need to modify the XML file again and again, and then compile into binary, and then start namenode, make targeted changes, and continue repeatedly until Namenode can be started successfully.

We also encountered errors about OP_ADD_BLOCK during the specific operation, which was caused by some errors about OP_UPDATE_BLOCK when the last edits file was decompiled back to the binary file.

I replaced the error part in the above way before I successfully decompiled the edits file back to the binary file.

The specific solution is to locate the relevant configuration of OP_ADD_BLOCK according to "Encountered exception on operation AddBlockOp" and replace it.

2014-11-12 18 path=/user/dong/data/dpp/classification/gender/vw-input/2014 07 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation AddBlockOp [path=/user/dong/data/dpp/classification/gender/vw-input/2014-10-30-research-with-no-confict-fix-bug-rerun/all_labelled/_temporary/1/_temporary/attempt_1415171013961_42350_m_001474_0/part-m-01474, penultimateBlock=NULL, lastBlock=blk_1109647729_35920089, RpcClientId=, RpcCallId=-2] java.lang.IllegalStateException

Finally, after Namenode starts successfully, a lot of Block losses are reported, and the solution is to remove these incorrect Block through fsck.

Hadoop fsck /-files-blocks-locations | tee-a fsck.out

Then get all the Block information in fsck.out, execute "hadoop fsck /-move" plus Block to delete.

Finally, quit safemode and life is better again.

Hadoop dfsadmin-safemode leave "

Hadoop,Namenode

Thank you for your reading, the above is the content of "how to start normally after the abnormal stop of Namenode in Hadoop". After the study of this article, I believe you have a deeper understanding of how to start normally after the abnormal stop of Namenode in Hadoop, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.