Hadoop2.6 HA deployment 07/06 Update SLTechnology News&Howtos

Hadoop2.6 HA deployment

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Because the spark environment needs to be deployed, a tested hadoop cluster has been specially reinstalled. The relevant steps are now recorded as follows:

Hardware environment: four virtual machines, hadoop1~hadoop4,3G memory, 60G hard disk, 2-core CPU

Software environment: CentOS6.5,hadoop-2.6.0-cdh6.8.2,JDK1.7

Deployment planning:

Hadoop1 (192.168.0.3): namenode (active), resourcemanager

Hadoop2 (192.168.0.4): namenode (standby), journalnode, datanode, nodemanager, historyserver

Hadoop3 (192.168.0.5): journalnode, datanode, nodemanager

Hadoop4 (192.168.0.6): journalnode, datanode, nodemanager

The HA of HDFS adopts the QJM method (journalnode):

I. system preparation

1. Selinux is closed for each unit.

# vi / etc/selinux/config

SELINUX=disabled

2. Turn off the firewall for each machine (be sure to turn it off, otherwise you will report an error when formatting hdfs and cannot connect to journalnode)

# chkconfig iptables off

# service iptables stop

3. Install jdk1.7 on each machine

# cd / software

# tar-zxf jdk-7u65-linux-x64.gz-C / opt/

# cd / opt

# ln-s jdk-7u65-linux-x64.gz java

# vi / etc/profile

Export JAVA_HOME=/opt/java

Export PATH=$PATH:$JAVA_HOME/bin

4. Establish hadoop-related users on each machine and establish mutual trust

# useradd grid

# passwd grid

(brief steps for establishing mutual trust)

5. Set up relevant catalogs for each machine

# mkdir-p / hadoop_data/hdfs/name

# mkdir-p / hadoop_data/hdfs/data

# mkdir-p / hadoop_data/hdfs/journal

# mkdir-p / hadoop_data/yarn/local

# chown-R grid:grid / hadoop_data

II. Hadoop deployment

The main purpose of HDFS HA is to specify nameservices (if you don't do HDFS ferderation, there will be only one ID), as well as the two namenode under that nameserviceID and their addresses. The nameservice name here is set to hadoop-spark

1. Each machine decompresses the hadoop package

# cd / software

# tar-zxf hadoop-2.6.0-cdh6.8.2.tar.gz-C / opt/

# cd / opt

# chown-R grid:grid hadoop-2.6.0-cdh6.8.2

# ln-s hadoop-2.6.0-cdh6.8.2 hadoop

2. Switch to grid to continue the operation.

# su-grid

$cd / opt/hadoop/etc/hadoop

3. Configure hadoop-env.sh (actually only configure JAVA_HOME)

$vi hadoop-env.sh

# The java implementation to use.

Export JAVA_HOME=/opt/java

4. Set hdfs-site.xml

Dfs.replication1dfs.nameserviceshadoop-sparkComma-separated list of nameservices.dfs.ha.namenodes.hadoop-sparknn1,nn2The prefix for a given nameservice Contains a comma-separatedlist of namenodes for a given nameservice (eg EXAMPLENAMESERVICE). Dfs.namenode.rpc-address.hadoop-spark.nn1hadoop1:8020RPC address for nomenode1 of hadoop-sparkdfs.namenode.rpc-address.hadoop-spark.nn2hadoop2:8020RPC address for nomenode2 of hadoop-sparkdfs.namenode.http-address.hadoop-spark.nn1hadoop1:50070The address and the base port where the dfs namenode1 web ui will listen on.dfs.namenode.http-address.hadoop-spark.nn2hadoop2:50070The address and the base port where the dfs namenode2 web ui will listen on.dfs. Namenode.name.dir file:///hadoop_data/hdfs/nameDetermines where on the local filesystem the DFS name nodeshould store the name table (fsp_w_picpath). If this is a comma-delimited listof directories then the name table is replicated in all of thedirectories, for redundancy. Dfs.namenode.shared.edits.dirqjournal://hadoop2:8485;hadoop3:8485;hadoop4:8485/hadoop-sparkA directory on shared storage between the multiple namenodesin an HA cluster. This directory will be written by the active and readby the standby in order to keep the namespaces synchronized. This directorydoes not need to be listed in dfs.namenode.edits.dir above. It should beleft empty in a non-HA cluster.dfs.datanode.data.dir file:///hadoop_data/hdfs/dataDetermines where on the local filesystem an DFS data nodeshould store its blocks. If this is a comma-delimitedlist of directories, then data will be stored in all nameddirectories, typically on different devices.Directories that do not exist are ignored. Dfs.client.failover.proxy.provider.hadoop-spark org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProviderdfs.ha.automatic-failover.enabledfalseWhether automatic failover is enabled. See the HDFS HighAvailability documentation for details on automatic HAconfiguration.dfs.journalnode.edits.dir/hadoop_data/hdfs/journal

5. Configure core-site.xml (configure fs.defaultFS, use the nameservices name of HA)

Fs.defaultFShdfs://hadoop-sparkThe name of the default file system. A URI whosescheme and authority determine the FileSystem implementation. Theuri's scheme determines the config property (fs.SCHEME.impl) namingthe FileSystem implementation class. The uri's authority is used todetermine the host, port, etc. For a filesystem.

6. Configure mapred-site.xml

Mapreduce.framework.nameyarnThe runtime framework for executing MapReduce jobs.Can be one of local, classic or yarn.mapreduce.jobhistory.addresshadoop2:10020MapReduce JobHistory Server IPC host:portmapreduce.jobhistory.webapp.addresshadoop2:19888MapReduce JobHistory Server Web UI host:port

7. Configure yarn-site.xml

The hostname of the RM.yarn.resourcemanager.hostnamehadoop1The address of the applications manager interface in the RM.yarn.resourcemanager.address$ {yarn.resourcemanager.hostname}: 8032The address of the scheduler interface.yarn.resourcemanager.scheduler.address$ {yarn.resourcemanager.hostname}: 8030The http address of the RM web application.yarn.resourcemanager.webapp.address$ {yarn.resourcemanager.hostname}: 8088The https adddress of the RM web application.yarn.resourcemanager.webapp.https.address$ {yarn.resourcemanager.hostname}: 8090yarn.resourcemanager.resourceMurtracker.address$ {yarn.resourcemanager. Hostname}: 8031The address of the RM admin interface.yarn.resourcemanager.admin.address$ {yarn.resourcemanager.hostname}: 8033The class to use as the resource scheduler.yarn.resourcemanager.scheduler.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSchedulerfair-scheduler conf locationyarn.scheduler.fair.allocation.file$ {yarn.home.dir} / etc/hadoop/fairscheduler.xmlList of directories to store localized files in. Anapplication's localized file directory willbe found in:$ {yarn.nodemanager.local-dirs} / usercache/$ {user} / appcache/application_$ {appid} .Individual containers' work directories, called container_$ {contid}, willbe subdirectories of this.yarn.nodemanager.local-dirs/hadoop_data/yarn/localWhether to enable log aggregationyarn.log-aggregation-enabletrueWhere to aggregate logs to.yarn.nodemanager.remote-app-log-dir/tmp/logsAmount of physical memory, in MB That can be allocatedfor containers.yarn.nodemanager.resource.memory-mb2048Number of CPU cores that can be allocatedfor containers.yarn.nodemanager.resource.cpu-vcores2the valid service name should only contain a-zA-Z0-9 _ and can not start with numbersyarn.nodemanager.aux-servicesmapreduce_shuffle

8. Configure slaves

Hadoop2

Hadoop3

Hadoop4

9. Configure fairscheduler.xml

0mb, 0 vcores 6144 mb, 6 vcores 503001.0grid

10. Synchronize configuration files to each node

$cd / opt/hadoop/etc

$scp-r hadoop hadoop2:/opt/hadoop/etc/

$scp-r hadoop hadoop3:/opt/hadoop/etc/

$scp-r hadoop hadoop4:/opt/hadoop/etc/

Start the cluster (format the file system)

1. Establish environmental variables

$vi ~ / .bash_profile

Export HADOOP_HOME=/opt/hadoop

Export YARN_HOME_DIR=/opt/hadoop

Export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop

Export YARN_CONF_DIR=/opt/hadoop/etc/hadoop

2. Start HDFS

Start journalnode first, on hadoop2~hadoop4:

$cd / opt/hadoop/

$sbin/hadoop-daemon.sh start journalnode

Format HDFS, and then start namenode. On hadoop1:

$bin/hdfs namenode-format

$sbin/hadoop-daemon.sh start namenode

Synchronize another namenode and start it. On hadoop2:

$bin/hdfs namenode-bootstrapStandby

$sbin/hadoop-daemon.sh start namenode

In this case, both namenode are in standby state, so switch hadoop1 to active (hadoop1 corresponds to nn1 in hdfs-site.xml):

$bin/hdfs haadmin-transitionToActive nn1

Start datanode. On hadoop1 (namenode of active):

$sbin/hadoop-daemons.sh start datanode

Note: after starting, you only need to use sbin/start-dfs.sh. However, since there is no failover configured for zookeeper, HA can only be switched manually. So every time you start HDFS, execute $bin/hdfs haadmin-transitionToActive nn1 to make the namenode of hadoop1 active.

2. Start yarn

On hadoop1 (resourcemanager):

$sbin/start-yarn.sh

The HDFS HA configured above does not fail over automatically. If you configure HDFS for automatic failover, you need to add the following steps (stop the cluster first):

1. To deploy zookeeper, omit the steps. Deploy in hadoop2, hadoop3, hadoop4, and launch

2. Add: to hdfs-site.xml:

Dfs.ha.automatic-failover.enabled true

Dfs.ha.fencing.methods sshfence dfs.ha.fencing.ssh.private-key-files / home/exampleuser/.ssh/id_rsa

For a detailed explanation, see the official documentation. This configuration sets the fencing method to close the port of the previous active node through ssh. As long as the first two namenode can SSH each other.

There is another way to configure:

Dfs.ha.automatic-failover.enabled true

Dfs.ha.fencing.methods shell (/ path/to/my/script.sh arg1 arg2...)

This configuration actually uses shell to isolate ports and programs. If you do not want to take the actual action, dfs.ha.fencing.methods can be configured as shell (/ bin/true)

3. Add: in core-site.xml:

Ha.zookeeper.quorum hadoop2:2181,hadoop3:2181,hadoop4:2181

4. Initialize zkfc (execute on namenode)

Bin/hdfs zkfc-formatZK

5. Start the cluster

Zkfc: runs on every namenode, is the client of zk, and is responsible for automatic failover

Zk: odd number of nodes, maintaining consistency locks, responsible for electing active nodes

Joural node: odd number of nodes for data synchronization between active and standby nodes. The active node writes data to these nodes, and the standby node reads

Change to resourcemanager HA:

Select hadoop2 as another rm node

1. Set hadoop2 for mutual trust with other nodes

2. Compile yarn-site.xml and synchronize it to other machines

3. Copy fairSheduler.xml to hadoop2

4. Start rm

5. Start another rm

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.