How to deploy a Hadoop cluster 04/27 Update SLTechnology News&Howtos

How to deploy a Hadoop cluster

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article focuses on "how to deploy Hadoop clusters", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to deploy Hadoop clusters.

Environmental preparation

A total of five machines are used as the hardware environment, all of which are centos 6.4s

Namenode & resourcemanager Master Server: 192.168.1.1

Namenode & resourcemanager standby server: 192.168.1.2

Datanode & nodemanager server: 192.168.1.100 192.168.1.101 192.168.1.102

Zookeeper server cluster (for automatic switching of namenode high availability): 192.168.1.100 192.168.1.101

Jobhistory server (for logging mapreduce): 192.168.1.1

NFS for namenode HA: 192.168.1.100

Environmental deployment

First, join the YUM warehouse of CDH4

1. The solution is to put the cdh5 package in the self-built yum warehouse. For more information on how to build the yum warehouse, please see the self-built YUM warehouse.

two。 If you do not want to build your own yum warehouse, do the following on all hadoop machines to join cdn4's yum warehouse

Wget http://archive.cloudera.com/cdh5/one-click-install/redhat/6/x86_64/cloudera-cdh-4-0.x86_64.rpm sudo yum-- nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm

Create a NFS server for namenode HA

1. Log in to 192.168.1.100 and execute the following script createNFS.sh

#! / bin/bash yum-y install rpc-bind nfs-utils mkdir-p / data/nn_ha/ echo "/ data/nn_ha * (rw,root_squash,all_squash,sync)" > / etc/exports / etc/init.d/rpcbind start / etc/init.d/nfs start chkconfig-- level 234 rpcbind on chkconfig-level 234 nfs on III, Hadoop Namenode & resourcemanager master server environment deployment

1. Log in to 192.168.1.1, create a script directory, and copy the script from the git repository

Yum-y install git mkdir-p / opt/ cd / opt/ git clone http://git.oschina.net/snake1361222/hadoop_scripts.git / etc/init.d/iptables stop

two。 Modify hostname

Sh / opt/hadoop_scripts/deploy/AddHostname.sh

3. Modify the configuration file of the deployment script

Vim / opt/kingsoft/hadoop_scripts/deploy/config # add the address of the master server, that is, the namenode master server master= "192.168.1.1" # add the nfs server address nfsserver= "192.168.1.100"

4. Edit the hosts file (this file will be synchronized to all machines in the hadoop cluster)

Vim / opt/hadoop_scripts/share_data/resolv_host 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4:: 1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.1.1 nn.dg.hadoop.cn 192.168.1.2 nn2.dg.hadoop.cn 192.168.1.100 dn100.dg.hadoop.cn 192.168.1.101 dn101.dg.hadoop.cn 192.168.1.102 dn102.dg.hadoop.cn

5. Execute the deployment script CreateNamenode.sh

Sh / opt/hadoop_scripts/deploy/CreateNamenode.sh

6. Set up saltstack master

PS: an open source server management tool similar to puppet, which is relatively lightweight. It is used to manage hadoop clusters and schedule datanode here. For more information about saltstack, please see SaltStack deployment and use.

a. Installation

Yum-y install salt salt-master

b. Modify the configuration file `/ etc/salt/ master`. The following marks the items to be modified.

Modify listening IP: interface: 0.0.0.0 Multi-thread Pool: worker_threads: 5 enable task cache: (official description enables cache to host 5000minion) job_cache enables automatic authentication: auto_accept: True

c. Start the service

/ etc/init.d/salt-master start chkconfig salt-master on

7. My sample configuration has been copied during deployment, so only some of the configuration files need to be modified

A. / etc/hadoop/conf/hdfs-site.xml (actually modifies the host name and address according to the actual situation)

Dfs.namenode.rpc-address.mycluster.ns1 nn.dg.hadoop.cn:8020 defines the rpc address of ns1 dfs.namenode.rpc-address.mycluster.ns2 nn2.dg.hadoop.cn:8020 defines the rpc address ha.zookeeper.quorum dn100.dg.hadoop.cn:2181,dn101.dg.hadoop.cn:2181,dn102.dg.hadoop.cn:2181 of ns2, and specifies the list of ZooKeeper cluster machines used for HA

B. Mapred-site.xml

Mapreduce.jobhistory.address nn.dg.hadoop.cn:10020 mapreduce.jobhistory.webapp.address nn.dg.hadoop.cn:19888

C. Yarn-site.xml

Property > yarn.resourcemanager.resource-tracker.address nn.dg.hadoop.cn:8031 yarn.resourcemanager.address nn.dg.hadoop.cn:8032 yarn.resourcemanager.scheduler.address nn.dg.hadoop.cn:8030 yarn.resourcemanager.admin.address nn.dg.hadoop.cn:8033

III. Deployment of Hadoop Namenode & resourcemanager standby server environment

1. Log in to 192.168.1.2, create the script directory, and synchronize the script from the main server

/ etc/init.d/iptables stop mkdir-p / opt/hadoop_scripts rsync-avz 192.168.1.1::hadoop_s / opt/hadoop_scripts

two。 Execute the deployment script CreateNamenode.sh

Sh / opt/hadoop_scripts/deploy/CreateNamenode.sh

3. Synchronize hadoop profile

Rsync-avz 192.168.1.1::hadoop_conf / etc/hadoop/conf

4. Deploy saltstack client

Sh / opt/hadoop_scripts/deploy/salt_minion.sh

IV. Zookeeper server cluster deployment

Zookeeper is an open source distributed service that is used here for namenode's auto fail over functionality.

1. Installation

Yum install zookeeper zookeeper-server

two。 Modify configuration file / etc/zookeeper/conf/zoo.cfg

MaxClientCnxns=50 # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. DataDir=/var/lib/zookeeper # the port at which the clients will connect clientPort=2181 # all the machines in the zookeeper cluster are specified here. In this configuration, all the machines in the cluster are the same server.1=dn100.dg.hadoop.cn: 2888 server.2=dn101.dg.hadoop.cn:2888:3888

3. Specify the id of the current machine and enable the service

# for example, the current machine is 192.168.1.100 (dn100.dg.hadoop.cn), it is server.1,id 1 so: echo "1" > / var/lib/zookeeper/myid chown-R zookeeper.zookeeper / var/lib/zookeeper/ service zookeeper-server init / etc/init.d/zookeeper-server start chkconfig zookeeper-server on # and so on, deploy 192.168.1.101

V. datanode & nodemanager server deployment

1. Log in to the datanode machine, create a script directory, and synchronize the scripts from the main server

/ etc/init.d/iptables stop mkdir-p / opt/hadoop_scripts rsync-avz 192.168.1.1::hadoop_s / opt/hadoop_scripts

two。 Modify hostname and execute the deployment script CreateDatanode.sh

Sh / opt/hadoop_scripts/deploy/AddHostname.sh sh / opt/hadoop_scripts/deploy/CreateDatanode.sh

Cluster initialization

At this point, the environment of the hadoop cluster has been deployed, and the cluster is now initialized

1. Initialization of HA high availability for namenode

1. Perform failover function formatting of zookeeper on the namenode master server (192.168.1.1)

Sudo-u hdfs hdfs zkfc-formatZK

two。 Start the zookeeper cluster service (192.168.1.100 192.168.1.101)

/ etc/init.d/zookeeper-server start

3. Zkfc service for namenode active and standby servers (192.168.1.1 192.168.1.2)

/ etc/init.d/hadoop-hdfs-zkfc start

4. Format hdfs on the namenode master server (192.168.1.1)

# make sure that sudo-u hdfs hadoop namenode-format is formatted with the hdfs user

5. It takes a lot of time to copy the data under the name.dir to the namenode slave server because of the high availability of the namenode.

a. Execute on the primary server (192.168.1.1)

Tar-zcvPf / tmp/namedir.tar.gz / data/hadoop/dfs/name/ nc-l 9999 < / tmp/namedir.tar.gz

b. Execute on the standby server (192.168.1.2)

Wget 192.168.1.1 zxvPf 9999-O / tmp/namedir.tar.gz tar-zxvPf / tmp/namedir.tar.gz

6. Both master and slave services are started

/ etc/init.d/hadoop-hdfs-namenode start / etc/init.d/hadoop-yarn-resourcemanager start

7. View the web interface of hdfs

Http://192.168.1.1:9080 http://192.168.1.2:9080 # if you see two namenode with backup status in the web interface, the auto fail over configuration is not successful # View zkfc logs (/ var/log/hadoop-hdfs/hadoop-hdfs-zkfc-nn.dg.s.kingsoft.net.log) # View zookeeper cluster logs (/ var/log/zookeeper/zookeeper.log)

8. Now you can try to shut down the namenode master service and see if you can switch between master and slave.

2. Hdfs cluster is enabled

At this point, all hadoop deployments have been completed, and now start the cluster to verify the effect.

1. Start all datanode servers

# remember the saltstack management tool built before, but now it begins to play its role. Log in to saltstack master (192.168.1.1) and execute salt-v "dn*" cmd.run "/ etc/init.d/hadoop-hdfs-datanode start".

two。 Check the hdfs web interface to see if they all become live nodes.

3. If there is no problem, you can try the hdfs operation now

# create a tmp directory sudo-u hdfs hdfs dfs-mkdir / tmp # create an empty file of 10g, calculate its MD5 value, and put it in hdfs dd if=/dev/zero of=/data/test_10G_file bs=1G count=10 md5sum / data/test_10G_file sudo-u hdfs hdfs dfs-put / data/test_10G_file / tmp sudo-u hdfs hdfs dfs-ls / tmp # now you can try to close a datanode and pull out the test file you just did Calculate the MD5 again to see if it is the same sudo-u hdfs hdfs dfs-get / tmp/test_10G_file / tmp/ md5sum / tmp/test_10G_file

3. Yarn cluster is enabled

In addition to hdfs for big data's distributed storage, hadoop also has a more important component, distributed Computing (mapreduce). Now let's start the mapreducev2 yarn cluster.

1. Service resourcemanager on the primary server (192.168.1.1)

/ etc/init.d/hadoop-yarn-resourcemanager start

two。 Start all nodemanager services

# Log in to saltstack master and execute salt-v "dn*" cmd.run "/ etc/init.d/hadoop-yarn-nodemanager start"

3. Check the yarn task tracking interface (http://192.168.1.1:9081/), to see if all nodes have been added

4.hadoop has its own benchmark mapreduce instance, which we use to test whether the yarn environment is normal.

# TestDFSIO tests the read and write performance of HDFS, writing 10 files, each file 1G. Su hdfs-hadoop jar / usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-cdh5.2.1-tests.jar TestDFSIO-write-nrFiles 10-fileSize 1000 # Sort Test MapReduce # # output data to the random-data directory hadoop jar / usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomwriter random-data # # run the sort program hadoop jar / usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar sort random-data sorted-data # # Verify that the sorted-data file is in order hadoop jar / usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-cdh5.2.1-tests.jar testmapredsort-sortInput random-data\-sortOutput sorted-data to this point I believe you have a deeper understanding of "how to deploy Hadoop clusters". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.