How to recover the ResourceManager of Hadoop 04/27 Update SLTechnology News&Howtos

How to recover the ResourceManager of Hadoop

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the knowledge of "how to recover the ResourceManager of Hadoop". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Recovery of ResourceManager

When ResourceManager hangs up and restarts, in order to enable the previous task to continue, rather than re-execute. It is necessary for yarn to record the status of the running process of the application.

The running state can be stored in the

ZooKeeper

FileSystem, like hdfs.

LevelDB

A typical configuration that uses zookeeper as a state store is

Enable RM to recover state after starting. If true, then yarn.resourcemanager.store.class must be specified yarn.resourcemanager.recovery.enabled true The class to use as the persistent store. Yarn.resourcemanager.store.class org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore Comma separated list of Host:Port pairs. Each corresponds to a ZooKeeper server (e.g. "127.0.0.1 3000127.0.1") to be used by the RM for storing RM state. HA of This must be supplied when using org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore as the value for yarn.resourcemanager.store.class hadoop.zk.address 127.0.0.1 HA 2181 Resource Manager

Automatic failover between multiple ResourceManager of active and standby is realized based on zookeeper. Active Resource Manager can have only one, while standby can have more than one

In order to prevent brain fissure during automatic failover, it is recommended that the above ResourceManager recovery state storage also use zk. At the same time, turn off the zookeeper.DigestAuthenticationProvider.superDigest configuration of zk to prevent zk administrators from accessing YARN application/user credential information.

A demo is configured as follows

Yarn.resourcemanager.ha.enabled true yarn.resourcemanager.cluster-id cluster1 yarn.resourcemanager.ha.rm-ids rm1,rm2 yarn.resourcemanager.hostname.rm1 master1 yarn.resourcemanager.hostname.rm2 master2 yarn.resourcemanager.webapp.address.rm1 master1:8088 yarn.resourcemanager.webapp.address.rm2 master2:8088 hadoop.zk.address zk1:2181,zk2:2181,zk3:2181

Documentation: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html

YARN Node Labels

Based on Label, a cluster managed by Yarn is divided into multiple partitions. Different queue can use different partitions.

Documentation: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeLabel.html

YARN Node Attributes

Define a set of attribute values for Node Manager so that the application can select Node Mananger based on these attribute values and deploy its applied container to it

Documentation: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeAttributes.html

Web Application Proxy

The application master running applications in the cluster needs to provide some web ui to the ResourceManager for unified management. However, there may be malicious applications in the cluster, providing web ui with security risks. To reduce security risks, yarn uses an application called Web Application Proxy. Own the web ui links provided by the takeover Application Master, peel off the cookies in the request, and mark out the unsafe links.

By default, Web Application Proxy is started as part of Resource Manager. No separate configuration is required. If you want to deploy separately, additional configuration is required. Documentation: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html

YARN Timeline Server

Able to store and query current and historical application execution information. The storage data structure in TimeLine Server is

Timeline Domain corresponds to a list of users' applications.

Time Entity defines an application

Timeline Events defines the execution events of the application, such as application startup, application execution, application termination, etc.

Timeline Server is divided into V1 and V2 versions, V1 version stores data in levelDb, v2 version stores data in hbase

Documentation: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/TimelineServer.html

Based on API of yarn, write an application that can be deployed to yarn cluster for execution.

Https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

Application security

In order to ensure the security of the application, yarn has a series of mechanisms to restrict the permissions of the application and so on.

Documentation: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html

Node Manager

As mentioned earlier, after ResourceMananger needs to be rebooted, it can continue to execute the task from where it is. After hanging up and restarting, Node Mananger also needs to have corresponding recovery features. For its specific configuration, see the documentation.

Documentation: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManager.html

Health Checker Service

Node Mananger's health check-up service, which provides two types of health check-ups

Disk Checker checks the disk health of node manager and reports it to Resource Manager based on this.

The External Health Script administrator can specify some custom health check scripts that can be called by Node Manager's Health Checker Service

CGroups with YARN

Yarn uses linux's CGroups for resource isolation and control. For more information, please see document: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html.

Secure Containers

Restrict each application container to the user permission to submit him, and the application container submitted by different users cannot access each other's files and folders. Specific configuration document: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/SecureContainer.html

Remove nod

There are two ways:

Normal removes the node to be removed from the cluster directly

Gracefully waits for the task on the node to be executed and then removes the document: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html

Opportunistic Containers opportunism container

Yarn generally assigns the container corresponding to a task to the NM only if the corresponding Node manager has resources. However, when Opportunistic Containers is enabled, even if the corresponding node manager has no resources, the contaienr will be assigned to NM, and when the NM is idle, it will be executed immediately, which is an opportunist. This can improve the resource utilization of the cluster to a certain extent. Documentation:

Configuration deployment

Deploy users. According to the official advice of the Hadoop. Yarn related components, using yarn to manage

Basic deployment mode

Start RM

$HADOOP_HOME/bin/yarn-daemon start resourcemanager

Start NM

$HADOOP_HOME/bin/yarn-daemon start nodemanager

Start proxyServer

$HADOOP_HOME/bin/yarn-daemon start proxyserver

Switch to mapred user and start historyserver

$HADOOP_HOME/bin/mapred-daemon start historyserver high performance deployment

Yarn itself consists of multiple components, and some components have multiple nodes, such as nodemanager, so it is tedious to start to execute on more than one machine at a time. Hadoop hairstyle package, provides two scripts sbin/start-yarn.sh and sbin/stop-yarn.sh to start and stop all the components related to yarn: such as nodemanager, resourcemanager, proxyserver.

The principle of his implementation is to log in to the corresponding machine and complete the execution of the components based on the / opt/hadoop-3.2.1/etc/hadoop/workers file in the hadoop installation package. The machine host of all datanode is defined in workers. Login method is a secret-free login method based on SSH. For more information, please see: https://www.cnblogs.com/niceshot/p/13019823.html

If the machine that initiates the script execution, you also need to deploy a nodemanager itself. Then he needs to configure himself for his own SSH secret-free login

Through yarn-site.xml, the script can already know the component machine of resource manager. So in the workers file, you only need to set up all the node manager's machine host

Generally speaking, the node manager of yarn is deployed with the datanode of hdfs, so this workers file is also used for batch startup and shutdown of hdfs.

But the above $HADOOP_HOME/bin/mapred-daemon start historyserver does not belong to mapreduce, so it should be started and stopped separately, and cannot be managed through the relevant scripts of yarn. The reason why we put this historyserver into the yarn document to write is to be lazy and not to create a mr wend alone.

Some errors error 1

In the management interface of yarn, it is found that the submitted sql has the following error

Yarn error error: cannot find or load the main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster

The reason for the problem is that there is a problem with the classpath of yarn. According to the official yarn-default.xml file of hadoop, the default value of the classpath configuration yarn.application.classpath loaded by yarn is

$HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/* For Windows:% HADOOP_CONF_DIR%,% HADOOP_COMMON_HOME%/share/hadoop/common/* HADOOP_COMMON_HOME%/share/hadoop/common/lib/*, HADOOP_HDFS_HOME%/share/hadoop/hdfs/*, HADOOP_HDFS_HOME%/share/hadoop/hdfs/lib/*, HADOOP_YARN_HOME%/share/hadoop/yarn/*, HADOOP_YARN_HOME%/share/hadoop/yarn/lib/*

Many of the environment variables in the path are not configured. There are 2 solutions.

Configure the corresponding environment variables so that the default configuration of yarn can be loaded properly. This is recommended.

Using the hadoop classpath command, take a look at the classpath used by hadoop, copy it, and configure it in yarn-site.xml, for example

Yarn.application.classpath / opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/ Hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/* ```# error 2 via historyserver During the Mapreduce phase, you see an exception

Org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist at sun.reflect.NativeConstructorAccessorImpl.newInstance0 (Native Method) at

The reason is that mr uses the mapreduce_shuffle auxiliary service, but yarn is not configured. The solution, again, is to modify the yarn-site.xml by adding the following configuration

Yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.aux-services.mapreduce_shuffle.class org.apache.hadoop.mapred.ShuffleHandler error 3ontainer [pid=26153,containerID=container_e42_1594730232763_0121_02_000006] is running 598260224B beyond the 'VIRTUAL' memory limit. Current usage: 297.5 MB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual memory used. Killing container.Dump of the process-tree for container_e42_1594730232763_0121_02_000006:-PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME (MILLIS) SYSTEM_TIME (MILLIS) VMEM_USAGE (BYTES) RSSMEM_USAGE (PAGES) FULL_CMD_LINE

The error means that the virtual memory allocated by yarn to container exceeds the limit. The reason is that the memory used by a container, in addition to physical memory, can also use the virtual memory of the operating system, that is, the hard disk.

There are two types of container, map and reduce. The formula that determines the physical memory size of map's container is mapreduce.map.memory.mb, the physical memory of reduce is mapreduce.reduce.memory.mb, and the virtual memory size that map container can apply for is: mapreduce.map.memory.mb * yarn.nodemanager.vmem-pmem-ratio determines the virtual memory size that reduce container can apply for is: mapreduce.reduce.memory.mb * yarn.nodemanager.vmem-pmem-ratio.

So there are two ways to solve virtual memory overruns:

Increase the physical memory size of container. That is, increase the mapreduce.map.memory.mb or mapreduce.reduce.memory.mb

Increase the ratio of virtual memory requests yarn.nodemanager.vmem-pmem-ratio

This is the end of the content of "how to recover the ResourceManager of Hadoop". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.