Hadoo distributed installation 07/01 Update SLTechnology News&Howtos

Hadoo distributed installation

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Before the sudden:

Daemon in Hadoop Cluster

HDFS:

NameNode,NN

SecondaryNode,SNN

DataNode:DN

/ data/hadoop/hdfs/ {nn,snn,dn}

Nn:fsp_w_picpath,editlog// mirroring and editing logs

/ / the NN of hdfs stores data in memory and constantly modifies metadata according to file state changes.

Fsp_w_picpath stores: which node will the files be stored on after segmentation?

/ / changes to the file metadata will be written to editllog and finally to fsp_w_picpath, so the data will still exist after the next NN restart. Read the data from fsp_w_picpath and get it to memory.

/ / once nn crashes, data recovery will take a lot of time

Snn: when nn crashes, put it on in time to save the time to repair nn and bring nn back online, but each data node reports the data status and the time to repair is still needed.

Normally: snn is in charge of fsp_w_picpath and editlog for copy nn and then merges on snn

Checkpoint: because nn is constantly changing, snn specifies to merge to that point in time.

/ / it is officially recommended that more than 30 node should build hadoop clusters.

Does data need to work on raid// because hdfs already has the function of replicate, so it is not necessary to provide redundancy again?

Hadoop-daemon.sh running process

When you hadoop-daemon.sh start DataNode in cluster mode, you need to find each DataNode node automatically, and then start it automatically on each DataNode.

How to find, or how to ensure that the command can automatically connect to each slave node through the master node, and have permission to execute the command.

On the primary node: configuration

YARN:

ResourceManager

NodeManager:

Yarn-daemon.sh start/stop

The actual running process:

[NN] [SNN] [RM]

| | |

[node1/NN] [nod2/NN] [node3/NN]

Start on node: just the datanode process and the nodemanager process

Experimental model:

[NN/SNN/RM]

| |

[node1/NN] [nod2/NN] [node3/NN]

Run on the master node: namenode,secondarynamenode,resourcemanager three processes

Start on other node: datanode process and nodemanager process

Preparation:

1.ntpdate synchronization

Tzselect

Timedatactl / / View time zone settings

Timedatectl list-timezones # list all time zones

Timedatectl set-local-rtc 1 # adjusts the hardware clock to match the local clock, and 0 is set to UTC time

Timedatectl set-timezone Asia/Shanghai # sets the system time zone to Shanghai

Cp / usr/share/zoneinfo/Asia/Shanghai / etc/localtime / / the simplest solution

2.hosts communication

172.16.100.67 node1.mt.com node1 master

172.16.100.68 node2.mt.com node2

172.16.100.69 node3.mt.com node3

172.16.100.70 node4.mt.com node4

If you need to start or stop the entire cluster through the master node, you need to configure users running the service on master, such as hdfs and yarn, to be able to link based on key ssh

Node1:

I. Prelude

(1) configure the environment

Vim / etc/profile.d/java.sh

JAVA_HOME=/usr

Yum install java-1.8.0-openjdk-devel.x86_64

Scp / etc/profile.d/java.sh node2:/etc/profile.d/

Scp / etc/profile.d/java.sh node3:/etc/profile.d/

Scp / etc/profile.d/java.sh node4:/etc/profile.d/

Vim / etc/profile.d/hadoop.sh

Export HADOOP_PREFIX=/bdapps/hadoop

Export PATH=$PATH:$ {HADOOP_PREFIX} / bin:$ {HADOOP_PREFIX} / sbin

Export HADOOP_YARN_HOME=$ {HADOOP_PREFIX}

Export HADOOP_MAPPERD_HOME=$ {HADOOP_PREFIX}

Export HADOOP_COMMON_HOME=$ {HADOOP_PREFIX}

Export HADOOP_HDFS_HOME=$ {HADOOP_PREFIX}

. / etc/profile.d/hadoop.sh

Scp / etc/profile.d/hadoop.sh node2:/etc/profile.d/

Scp / etc/profile.d/hadoop.sh node3:/etc/profile.d/

Scp / etc/profile.d/hadoop.sh node4:/etc/profile.d/

(2) modify hosts file

Vim / etc/hosts

172.16.100.67 node1.mt.com node1 master

172.16.100.68 node2.mt.com node2

172.16.100.69 node3.mt.com node3

172.16.100.70 node4.mt.com node4

Scp to node2,node3,node4

(3) login with hadoop key

Useradd hadoop / / node2,3,4 has a hadoop user

Echo "hadoop" | passwd-- stdin hadoop

Useradd-g hadoop hadoop / / both use one user here, and you can also create yarn and hdfs users respectively

Su-hadoop

Ssh-keygen

For i in 2 34; do ssh-copy-id-I. ssh / id_rsa.pub hadoop@node$ {I}; done

Verify:

Ssh node2 'date'

Ssh node3 'date'

Ssh node4 'date'

2. Install and deploy hadoop

(1) decompression

Mkdir-pv / bdapps/ / data/hadoop/hdfs/ {nn,snn,dn} / / the dn here is not needed, because the master node does not store data and may not be created.

Chown-R hadoop:hadoop / data/hadoop/hdfs

Tar xvf hadoop-2.6.2.tar.gz-C / bdapps/

Cd / bdapps/

Ln-sv hadoop-2.6.2 hadoop

Cd hadoop

Mkdir logs

Chown Grouw logs

Chown-R hadoop:hadoop. / *

(2) configuration file modification

1.core-site.xml configuration

Vim etc/hadoop/core-site.xml

Fs.defaultFS

Hdfs://master:8020

/ / the access interface of hdfs. If master cannot be parsed, you can also use ip address.

True

/ / core points to NN

2.yanr-site.xml file configuration

Vim etc/hadoop/yarn-site.xml

Yarn.resourcemanager.address

Master:8032

Yarn.resourcemanager.scheduler.address

Master:8030

Yarn.resourcemanager.resource-tracker.address

Master:8031

Yarn.resourcemanager.admin.address

Master:8033

Yarn.resourcemanager.webapp.address

Master:8088

Yarn.nodemanager.aux-services

Mapreduce_shuffle

Yarn.nodemanager.auxservices.mapreduce_shuffle.class

Org.apache.hadoop.mapred.ShuffleHandler

Yarn.resourcemanager.scheduler.class

Org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler

/ /% s/localhost/master/g / / replace localhost with master

/ / point to ResourceManager

3.hdfs-site.xml configuration

Vim etc/hadoop/hdfs-site.xml

Number of copies of dfs.replication / / dfs

two

Dfs.namenode.name.dir

File:///data/hadoop/hdfs/nn

Dfs.datanode.data.dir

File:///data/hadoop/hdfs/dn

Fs.checkpoint.dir

File:///data/hadoop/hdfs/snn

Fs.checkpoint.edits.dir

File:///data/hadoop/hdfs/snn

Mapred-site.xml is the only one that doesn't need to be modified.

The default is yarn.

Vim slaves

Node2

Node3

Node4

/ / slaves is datanode and nodemanager

(3)

After node2,node3,node4 assigns to this step: chown-R hadoop:hadoop. / *

Su-hadoop

Scp / bdapps/hadoop/etc/hadoop/* node2:/bdapps/hadoop/etc/hadoop/

Scp / bdapps/hadoop/etc/hadoop/* node3:/bdapps/hadoop/etc/hadoop/

Scp / bdapps/hadoop/etc/hadoop/* node4:/bdapps/hadoop/etc/hadoop/

three。 Format and start

Su-hadoop

Hdfs namenode-format

Show / data/hadoop/hdfs/nn hash been successful formatted indicates success

There are two ways to start hadoop:

1. Start the service to be started on each node

Start the yarn service using the yarn user identity

Master nodes: NameNode services and ResourceManager services

Su-hdfs-c 'hadoop-daemon.sh start namenode'

Su-hdfs-c 'yarn-daemon.sh start resourcemanager'

Slave nodes: DataNode services and NodeManager services

Su-hdfs-c 'hadoop-daemon.sh start datanode'

Su-hdfs-c 'yarn-daemon.sh start nodemanager'

two。 Start the entire cluster on master

Su-hdfs-c 'start-dfs.sh'

Su-hdfs-c 'start-yarn.sh'

The old version used start-all.sh and stop-all.sh to control hdfs and mapreduce

Start the service:

Su-hdfs-c 'start-dfs.sh'

Su-hdfs-c 'stop-dfs.sh' / / close hdfs

It will be prompted to start on the 2pm 3pm 4 node.

Su-hdfs-c 'start-yarn.sh'

Master starts resourcemanager

Start nodemanager on slave

Test:

Node3: su-hadoop

Jps / / View DataNode process and NodeManager process

Node1:su-hadoop

Jps / / starts secondaryNameNode and NameNode processes

Hdfs dfs-mkdir / test

Hdfs dfs-put / etc/fstab / test/fstab

Hdfs dfs-ls-R / test

Hdfs dfs-cat / test/fstab

Node3:

Ls / data/hadoop/hdfs/dn/current/..../blk,... Store it here.

Note: one of the node2,3,4 does not store the file because the defined slaves is 2

Vim etc/hadoop/hdfs-site.xml

Number of copies of dfs.replication / / dfs

two

View the Web API:

172.16.100.67:8088

The memory is displayed as 24G, because of 3G, the physical memory generation size of each node is 8G.

172.16.100.67:50070

Datanode: there are three

A single file that is too small will not be cut, and files larger than 64m will be sliced.

Compressed files can be uploaded directly and will be sliced.

Run the task test:

Yarn jar / bdapps/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-example-2.6.jar wordcount / test/fstab / test/functions / test/wc

Hdfs dfs cat / test/wc/part-r-0000

IV. Other nodes

Node2:

User hadoop

Echo "hadoop" | passwd-- stdin hadoop

Mkdir-pv / bdapps / data/hadoop/hdfs/ {nn,snn,dn} / / only dn is useful

Chown-R hadoop:hadoop / data/hadoop/hdfs/

Tar xvf hadoop-2.6.2.tar.gz-C / bdapps/

Cd / bdapps/

Ln-sv hadoop-2.6.2 hadoop

Cd hadoop

Mkdir logs

Chown Grouw logs

Chown-R hadoop:hadoop. / *

/ / after modifying the configuration file, you can copy it directly to node3 and node4 because the configuration is the same.

Node3:

User hadoop

Echo "hadoop" | passwd-- stdin hadoop

Mkdir-pv / bdapps / data/hadoop/hdfs/ {nn,snn,dn} / / only dn is useful

Chown-R hadoop:hadoop / data/hadoop/hdfs/

Tar xvf hadoop-2.6.2.tar.gz-C / bdapps/

Cd / bdapps/

Ln-sv hadoop-2.6.2 hadoop

Cd hadoop

Mkdir logs

Chown Grouw logs

Chown-R hadoop:hadoop. / *

Node4:

User hadoop

Echo "hadoop" | passwd-- stdin hadoop

Mkdir-pv / bdapps / data/hadoop/hdfs/ {nn,snn,dn} / / only dn is useful

Chown-R hadoop:hadoop / data/hadoop/hdfs/

Tar xvf hadoop-2.6.2.tar.gz-C / bdapps/

Cd / bdapps/

Ln-sv hadoop-2.6.2 hadoop

Cd hadoop

Mkdir logs

Chown Grouw logs

Chown-R hadoop:hadoop. / *

Cluster Management Command of yarn

Yarn [--config confdir] COMMAND

Resourcemanager-format-state-store / / Delete RMStateStore

Resourcemanager / / run ResourceManager

Nodemanaer / / run nodemanager on each slave

Timelineserver / / run timelineserver, task scheduling, timeline

Rmadmin / / resourcemanager management

Version

Jar / / run the jar file

Application / / display application information

Report/kill application

Applicationattempt / / attempt to run the relevant report

Container / / Container related information

Node / / display node

Queue / / report queue information

Logs / / backup container log

Classpath / / displays the class loading path when java runs the program

Daemonlog / / get the log level of the daemon

Jar,application,node,logs,classpath,version is a commonly used user command

Resourcemanager,nodemanager,proxyserver,rmadmin,daemon is a commonly used management command

Yarn application [options]

-status ApplicationID status information

Yarn application-status application_1494685700454_0001

-list lists the application on yarn

-appTypes:MAPREDUCE,YARN

-appStates:ALL,NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING,FINISHED,FAILED,KILLED

Yarn application-appStates=all

-kill ApplicationID

Yarn node

-list / / instantiate node list

-states: NEW,RUNNING,UNHEALTHY is unhealthy, DECOMMISSION retired, LOST,REBOOTED

-staus Node-ID / / display node information

Logs: displays the log of the completed YARN program (and status is: FAILED,KILLED,FINISHED)

If you need to view logs on the command line, you need to configure yarn-site.xml

The yarn.log-aggregation-enable attribute value is true

Yarn logs-applicationId [applicationID] [options]

-applicationId applicationID prerequisite, which is used to get its details from ResourceManager.

-appOwner APPOwner is the current user by default, optional

-nodeAddress NodeAddress-containerId containerID: gets the information about the specified container on the current specified node. The format of NodeAddress is the same as NodeId.

Classpath:

Yarn calasspath / / load java program path

Administrative commands:

Rmadmin

Nodemanager

Timelineserver

Rmadmin is a client program of ResourceManager that can be used to refresh access control policies, scheduler queues, nodes registered to RM, and so on.

After refreshing, no reboot is required to take effect.

Yarn rmadmin [options]

-help

-refreshQueues: reloads the acl, status and caller queue of the queue; it reinitializes the scheduler based on the configuration information in the configuration file

-refreshNodes: refreshes the host information for RM, which updates the list of nodes that the cluster needs to include or exclude by reading the include and exclude files of the RM node.

-refreshUserToGroupMappings: updates the mapping relationship between users and groups by refreshing the information in the group cache according to the configured Hadoop security group mapping.

-refreshSuperUserGroupsConfiguration: refresh the superuser agent group mapping and update the agent group defined by the hadoop.proxyuser attribute in the agent host and core-site.xml configuration files

-refreshAdminAcls: refreshes the management ACL of RM based on the yarn.admin.acl property of the yarn site profile or the default profile

-refreshServiceAcl: reloads the service level authorization policy file, and then RM will reload the authorization policy file; it checks whether hadoop security authorization is enabled and refreshes the ACL for IPC Server,ApplicationMaster,Client and Resource tracker

DaemonLog: viewing or more detailed

Http://host:port/logLevel?log=name service?

Yarn daemonlog [options] args

-getLevel host:port name: displays the log level of the specified daemon

-getLevel host:port level: sets the log level of the daemon

Run YARN application

Yarn application can be a shell script, a MapReduce job, or any other type of job.

Steps:

1.Application initialization submission / / client completion

two。 Allocate memory and start AM / / RM to complete

3.AM registration and resource allocation / / AM is completed on nodemanager

4. Start and monitor container / / AM report to NM, NM report RM completion

5.Application Progress report / / AM complete

6.Application progress completed / /

Using ambari to deploy hadoop Cluster

Https://www.ibm.com/developerworks/cn/opensource/os-cn-bigdata-ambari/

Https://cwiki.apache.org/confluence/display/AMBARI/Installation+Guide+for+Ambari+2.5.0

IBM official Technical Forum: https://www.ibm.com/developerworks/cn/opensource/

Ambari 2.2.2 download Resources

OS Format URL

Http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.2.2.0

Http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.2.2.0/ambari.repo

Http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.2.2.0/ambari-2.2.2.0-centos7.tar.gz

HDP 2.4.2 download Resources

Http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.4.2.0

Http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.4.2.0/hdp.repo

Http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.4.2.0/HDP-2.4.2.0-centos7-rpm.tar.gz

Http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/centos7

Http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/centos7/HDP-UTILS-1.1.0.20-centos7.tar.gz

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.