How to deal with the unhealthy node UNHEALTHY nodes on Yarn 04/21 Update SLTechnology News&Howtos

How to deal with the unhealthy node UNHEALTHY nodes on Yarn

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you how to deal with the unhealthy node UNHEALTHY nodes on Yarn. I hope you will get something after reading this article. Let's discuss it together.

First, mistakes

Own three virtual machines hadoop001, hadoop002, hadoop003

Check 23188 and find Unhealthy Nodes. The number of normal active nodes is incorrect.

View in addition

$yarn node-list-allTotal Nodes:4 Node-Id Node-State Node-Http-Address Number-of-Running-Containers hadoop001:34354 UNHEALTHY hadoop001:23999 0 hadoop002:60027 RUNNING hadoop002:23999 0 hadoop001:50623 UNHEALTHY hadoop001:23999 0 hadoop003:39700 UNHEALTHY hadoop003:23999 0 2. Log check

Looking at resourcemanager's log, you can see

2016-09-10 12 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl 02 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node hadoop002:60027 cluster capacity: 2016-09-10 12 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl 05990 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 1 local-dirs are bad: / data/disk1/data/yarn/local 1 log-dirs are bad: / opt/beh/logs/yarn/userlog2016-09-10 12 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl 02 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12 hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10

Check the log of nodemanager and you can see

2016-09-10 12 INFO org.mortbay.log 02lav 02869 INFO org.mortbay.log: jetty-6.1.26.cloudera.42016-09-10 12 INFO org.mortbay.log: Extract Jar Juan fileParade stop optLegBehOnShaft HadoopThink Hadoopqyr HadoopMutel 2.6.0Rycdh6.4.jarqqqwebappsLever node to / tmp/Jetty_0_0_0_0_23999_node____tgfx6h/webapp2016-09-10 12:02: 03242 INFO org.mortbay.log: Started HttpServer2 $SelectChannelConnectorWithSafeStartup@0.0.0.0:239992016-09-10 12 SelectChannelConnectorWithSafeStartup@0.0.0.0:239992016 02 SelectChannelConnectorWithSafeStartup@0.0.0.0:239992016 03242 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app / node started at 239992016-09-10 12 12 V 02V 03735 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules2016-09-10 12 12 V 02V 03775 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out 0 NM container statuses: [] 2016-09-10 1202 Switzerland 03783 INFO Org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering with RM using containers: [] 2016-09-10 12 Failing over to rm22016 02 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider 03822 INFO org.apache.hadoop.io.retry.RetryInvocationHandler: Exception while invoking registerNodeManager of class ResourceTrackerPBClientImpl over rm2 after 1 fail over attempts. Trying to fail over after sleeping for 2138ms.java.net.ConnectException: Call From hadoop002/192.168.30.22 to hadoop002:23125 failed on connection exception: java.net.ConnectException: reject connection For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0 (Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance (NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance (DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance (Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage (NetUtils.java:791) At org.apache.hadoop.net.NetUtils.wrapException (NetUtils.java:731) at org.apache.hadoop.ipc.Client.call (Client.java:1472) at org.apache.hadoop.ipc.Client.call (Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke (ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy27.registerNodeManager (Unknown Source) at org.apache.hadoop.yarn.server .api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager (ResourceTrackerPBClientImpl.java:68) at sun.reflect.NativeMethodAccessorImpl.invoke0 (NativeMethod) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod (RetryInvocationHandler.java:187) At org.apache.hadoop.io.retry.RetryInvocationHandler.invoke (RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy28.registerNodeManager (Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM (NodeStatusUpdaterImpl.java:257) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart (NodeStatusUpdaterImpl.java:191) at org.apache.hadoop.service.AbstractService.start (AbstractService.java:193) At org.apache.hadoop.service.CompositeService.serviceStart (CompositeService.java:120) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart (NodeManager.java:264) at org.apache.hadoop.service.AbstractService.start (AbstractService.java:193) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager (NodeManager.java:463) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main ( NodeManager.java:509) Caused by: java.net.ConnectException: refuse to connect to at sun.nio.ch.SocketChannelImpl.checkConnect (Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect (SocketChannelImpl.java:739) at org.apache.hadoop.net.SocketIOWithTimeout.connect (SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect (NetUtils.java:530) at org.apache.hadoop.net.NetUtils. Connect (NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection (Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams (Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800 (Client.java:368) at org.apache.hadoop.ipc.Client.getConnection (Client.java:1521) at org.apache.hadoop.ipc.Client.call (Client.java : 1438). 19 more2016-09-10 12 Failing over to rm12016-09-10 12 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager 05965 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm12016-09-10 12 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: 05996 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Rolling master-key for container-tokens Got key with id-15135375062016-09-10 12 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM: Rolling master-key for container-tokens, got key with id 701920721 2016-09-10 12 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl 02purl 05999 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as hadoop002:60027 with total resource of 2016-09-10 12 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying ContainerManager to unblock new container-requests error analysis

By default, NodeManager checks the local disk (local-dirs) every two minutes to find out which directories are available. Note that if the disk is determined to be unavailable, it will not become available even if the disk is ready until NodeManager is rebooted. When the number of good disks is less than a certain amount, the machine will be turned into a unhealthy, and no more tasks will be assigned to the machine.

Check the disk situation of your virtual machine and find that the disks of 001 and 003 are almost full, so clear the unneeded files, free up the remaining space, and UNHEALTHY nodes will immediately return to normal.

$yarn node-list-allTotal Nodes:4 Node-Id Node-State Node-Http-Address Number-of-Running-Containers hadoop001:34354 RUNNING hadoop001:23999 0 hadoop002:60027 RUNNING hadoop002:23999 0 hadoop003:39700 RUNNING hadoop003:23999 0 hadoop001:50623 LOST hadoop001:23999 0

Why there are 2 hadoop001 here? because the configuration file has been modified and restarted once, there are 2 of them, one of which is in LOST status and the other is normal RUNNING, which does not affect the use. After yarn restart, it can return to normal.

After reading this article, I believe you have a certain understanding of "how to deal with the unhealthy node UNHEALTHY nodes on Yarn". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.