Hadoop pseudo-distributed deployment 07/08 Update SLTechnology News&Howtos

Hadoop pseudo-distributed deployment

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. Environmental preparation

System:

Centos6.5

JDK1.8

Create a hadoop installation directory

Mkdir / bdappstar xf hadoop-2.7.5.tar.gz-C / bdapps/cd / bdapps/ln-sv hadoop-2.7.5 hadoopcd hadoop

Create an environment script

Vim / etc/profile.d/hadoop.sh

The contents are as follows:

Export HADOOP_PREFIX=/bdapps/hadoopexport PATH=$PATH:$ {HADOOP_PREFIX} / bin:$ {HADOOP_PREFIX} / sbinexport HADOOP_YARN_HOME=$ {HADOOP_PREFIX} export HADOOP_MAPPERD_HOME=$ {HADOOP_PREFIX} export HADOOP_COMMON_HOME=$ {HADOOP_PREFIX} export HADOOP_HDFS_HOME=$ {HADOOP_PREFIX}

2. Create users and groups

For security and other purposes, it is usually necessary to run different hadoop daemons with specific users. For example, with hadoop as a group, three users, yarn, hdfs, and mapred, are used to run the corresponding processes.

Groupadd hadoopuseradd-g hadoop yarnuseradd-g hadoop hdfsuseradd-g hadoop mappred

Create data and log directories

Hadoop requires data and log directories with different permissions. Here, / data/hadoop/hdfs is the hdfs data storage directory.

Ensure that the hdfs user has permissions on the / data/ directory

Mkdir-pv / data/hadoop/hdfs/ {nn,dn,snn} chown-R hdfs:hadoop / data/hadoop/hdfs/

Then, create the logs directory in the hadoop installation directory and modify the owner and group of all hadoop files

Cd / bdapps/hadoop/mkdir logschmod Grouw logschown-R yarn:hadoop. / *

3. Configure hadoop

Etc/hadoop/core-site.xml

The core-site.xml file contains information such as the NameNode host address and its listening RPC port, which is localhost for the pseudo-distributed model installation. The default RPC port used by NameNode is 8020. The brief configuration is as follows

Fs.defaultFS hdfs://localhost:8020 true

Etc/hadoop/hdfs-site.xml

Hdfs-site.xml is mainly used to configure HDFS-related attributes, such as replication factors (that is, the number of copies of data blocks), directories where NN and DN store data, and so on. The number of copies of a block should be 1 for a pseudo-distributed Hadoop, and the directory used by NN and DN to store the data is the path specifically created for it in the previous step. In addition, the relevant directory for SNN was created in the previous step, which is also configured to be enabled.

Dfs.replication 1 dfs.namenode.name.dir file:///data/hadoop/hdfs/nn dfs.datanode.data.dir file:///data/hadoop/hdfs/dn fs.checkpoint.dir file:///data/hadoop/hdfs/snn fs.checkpoint.edits.dir file:///data/hadoop/hdfs/snn

Explanation:

The number of dfs.replication replicas is 1. Pseudo-distributed deployment deploys all roles locally, so only one copy is retained locally.

The path to dfs.namenode.name.dir namenode

The path of the dfs.datanode.data.dir data node

Dfs.dcheckpoint.dir checkpoint file storage path

Fs.checkpoint.edit.dir checkpoint editing directory

Note: if you need other users to have write access to hdfs, you also need to add an attribute definition to hdfs-site.xml.

Dfs.permissionsfalse

This configuration means that the permissions of dfs are not strictly checked, so that other users have write permissions.

Etc/hadoop/mapred-site.xml

The mapred-site.xml file is used to configure the MapReduce framework of the cluster, where you should specify the use of yarn. The other available values are local and classic. Mapred-site.xml.template, just copy it to mapred-site.xml.

Cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml

The configuration example is as follows

Mapreduce.framework.name yarn

Etc/hadoop/yarn-site.xml

Yarn-site.xml is used to configure the relevant properties of the YARN process and YARN. First, you need to specify the host and listening port of the ResourceManager daemon. For the pseudo-distributed model, the host is localhost.

The default port is 8032; second, you need to specify the scheduler used by ResourceManager and the auxiliary services of NodeManager. A brief example of configuration is as follows:

Yarn.resourcemanager.address localhost:8032 yarn.resourcemanager.scheduler.address localhost:8030 yarn.resourcemanager.resource-tracker.address localhost:8031 yarn.resourcemanager.admin.address localhost:8033 yarn.resourcemanager.webapp.address localhost:8088 yarn.nodemanager.aux-services mapreduce_shuffle Yarn.nodemanager.auxservices.mapreduce_shuffle.class org.apache.hadoop.mapred.ShuffleHandler yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler

Etc/hadoop/hadoop-env.sh and etc/hadoop/yarn-env.sh

The daemons of hadoop depend on the environment variables of JAVA_HOME, which can be used normally if there is a JAVA_HOME variable similar to the one defined through the / etc/profile.d/java.sh global configuration in the previous step. However, if you like my hadoop definition depends on a specific JAVA environment, you can also edit these two script files, uncomment their JAVA_HOME and configure the appropriate values. In addition, the heap size used by most hadoop daemons is 1GB by default, but in real-world applications, you may need to adjust the heap memory size of their various processes, which only requires editing the values of the relevant environment variables in both files. For example, HADOOP_HEAPSIZE, HADOOP_JOB_HISTORY_HEAPSIZE, JAVA_HEAP_SIZE, YARN_HEAP_SIZE and so on.

Slave file

The slave file stores a list of all slave nodes in the current cluster, and for a pseudo-distributed model, the file content should only be localhost, which is indeed the default value for this file. Therefore, for the distributed model, the contents of the secondary file can be kept by default.

4. Format HDFS

Before HDFS's NN starts, you need to initialize the directory in which it stores the data. If the directory specified by the dfs.namenode.name.dir attribute in hdfs-site.xml does not exist, the formatting command will automatically create it; if it exists in advance, make sure that its permissions are set correctly, and the format operation will erase all its internal data and re-establish a new file system. You need to execute the following command as the hdfs user

Hdfs namenode-format

Hdfs command

Check what files and directories are on the dhfs file system / directory. Default is empty.

$hdfs dfs-ls /

Create a test directory test on the hdfs file system

$hdfs dfs-mkdir / test$ hdfs dfs-ls / Found 1 itemsdrwxr-xr-x-hdfs supergroup 0 2018-03-26 13:48 / test

Note: the file just created belongs to supergroup and does not belong to hadoop. All other users with group hadoop do not have write permission to the modified file. There is a configuration in the previous hdfs-site.xml file that is

Dfs.permissions, which, if set to false, gives write permissions to users belonging to the hadoop group

Upload the local / etc/fstab file to the hdfs file system / test directory

$hdfs dfs-put / etc/fstab / test/fstab$ hdfs dfs-ls / testFound 1 items-rw-r--r-- 1 hdfs supergroup 223 2018-03-26 13:55 / test/fstab

View the contents of a file on the dhfs file system using the cat command

$hdfs dfs-cat / test/fstabUUID=dbcbab6c-2836-4ecd-8d1b-2da8fd160694 / ext4 defaults 1 1tmpfs / dev/shm tmpfs defaults 0 0devpts / dev/pts devpts gid=5,mode=620 0 0sysfs / sys sysfs defaults 0 0proc / proc proc defaults 0 0dev/vdb1 none swap sw 0 0

5. Start hadoop

Switch to hdfs user

Su-hdfs

Hadoop2 startup and other operations can be carried out through its special script located in the sbin path.

NameNode:hadoop-daemon.sh (start | stop) namenode

DataNode: hadoop-daemon.sh (start | stop) datanode

Secondary NameNode: hadoop-daemon.sh (start | stop) secondarynamenode

ResourceManager: yarn-daemon.sh (start | stop) nodemanager

Start the HDFS service

HDFS has three daemons: namenode, datanode, and secondarynamenode, all of which indicate starting or stopping through the hadoop-daemon.sh script. Execute the relevant commands as the hdfs user, as shown below:

Start namenode

Hadoop-daemon.sh start namenodestarting namenode, logging to / bdapps/hadoop/logs/hadoop-hdfs-namenode-SRV-OPS01-LINTEST01.out$ jps99466 NameNode99566 Jps

Start secondarynamenode

Hadoop-daemon.sh start secondarynamenodestarting secondarynamenode, logging to / bdapps/hadoop/logs/hadoop-hdfs-secondarynamenode-SRV-OPS01-LINTEST01.out$ jps100980 SecondaryNameNode101227 Jps99466 NameNode

Start datanode

$hadoop-daemon.sh start datanodestarting datanode, logging to / bdapps/hadoop/logs/hadoop-hdfs-datanode-SRV-OPS01-LINTEST01.out$ jps101617 DataNode100980 SecondaryNameNode101767 Jps99466 NameNode

Start the yarn cluster

Switch to yarn user to log in to the system, and then start the service

YARN has two daemons, resourcemanager and nodemanager, both of which can be started or stopped by yarn-daemon.sh scripts. Execute the relevant commands as the yarn user.

Start resourcemanager

Yarn-daemon.sh start resourcemanagerstarting resourcemanager, logging to / bdapps/hadoop/logs/yarn-yarn-resourcemanager-SRV-OPS01-LINTEST01.out$ jps110218 Jps109999 ResourceManager

Start nodemanager

Yarn-daemon.sh start nodemanagerstarting nodemanager, logging to / bdapps/hadoop/logs/yarn-yarn-nodemanager-SRV-OPS01-LINTEST01.out$ jps111061 Jps110954 NodeManager109999 ResourceManager

6 、 Web UU

HDFS and YARN ResourceManager each provide a Web interface, through which you can check the relevant status information of HDFS cluster and YARN cluster. Their access interfaces are required as follows. In use, you need to set the

NameNodeHost and ReourceManageHost are changed to their corresponding host addresses respectively.

HDFS-NameNode http://:50070/

YARN-ResourceManager http://:8088/

Note: if the value of the yarn.resourcemanager.webapp.address attribute in the yarn-site.xml file is defined as "localhost:8088", its WebUI only listens on port 8088 at the address 127.0.0.1.

7 run the test program

Hadoop-YARN comes with many sample programs, which are located in the share/hadoop/mapreduce/ directory under the hadoop installation path, in which hadoop-mapreduce-examples can be used as a mapreduce program test

Yarn jar / bdapps/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar wordcount / yarn/fstab / yarn/fstab.out

Do word statistics on the / yarn/fstab file on the hdfs file system, and the statistical results are stored in the / yarn/fstab.out file

18-03-26 16:07:01 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:803218/03/26 16:07:02 INFO input.FileInputFormat: Total input paths to process: 118-03-26 16:07:02 INFO mapreduce.JobSubmitter: number of splits:118/03/26 16:07:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1522044437617_000118/03/26 16:07:02 INFO impl.YarnClientImpl: Submitted application application_1522044437617_000118/03/26 16 : 07:02 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1522044437617_0001/18/03/26 16:07:02 INFO mapreduce.Job: Running job: job_1522044437617_000118/03/26 16:07:10 INFO mapreduce.Job: Job job_1522044437617_0001 running in uber mode: false18/03/26 16:07:10 INFO mapreduce.Job: map 0 reduce 0 INFO mapreduce.Job 26 16:07:15 INFO mapreduce.Job: map 0 reduce Map reduce 100-03-26 16:07:20 INFO mapreduce.Job: Job job_1522044437617_0001 completed successfully18/03/26 16:07:21 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=272 FILE: Number of bytes written=243941 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=320 HDFS: Number of bytes written=191 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms) = 2528 Total time spent by all reduces in occupied slots (ms) = 2892 Total time spent by all map tasks (ms) = 2528 Total time spent by all reduce tasks (ms) = 2892 Total vcore-milliseconds taken by all map tasks=2528 Total vcore-milliseconds taken by all reduce tasks=2892 Total megabyte-milliseconds taken by all map tasks=2588672 Total megabyte-milliseconds taken by all reduce tasks=2961408 Map-Reduce Framework Map input records=6 Map output records=36 Map output bytes=367 Map output materialized bytes=272 Input split bytes=97 Combine input records=36 Combine output records=19 Reduce input groups=19 Reduce shuffle bytes=272 Reduce input records=19 Reduce output records=19 Spilled Records=38 Shuffled Maps = 1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms) = 153 CPU time spent (ms) = 1290 Physical memory (bytes) snapshot=447442944 Virtual memory (bytes) snapshot=4177383424 Total committed heap usage (bytes) = 293076992 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=223 File Output Format Counters Bytes Written=191

View statistical results

$hdfs dfs-cat / yarn/fstab.out/part-r-00000/ 1/dev/pts 1/dev/shm 1/proc 1/sys 10 101 2UUID=dbcbab6c-2836-4ecd-8d1b-2da8fd160694 1defaults 4dev/vdb1 1devpts 2ext4 1gidflowers 5 memes 620 1none 1proc 2sw 1swap 1sysfs 2tmpfs 2

Question:

1. Other servers cannot connect to hdfs's port 8020 service?

This is because localhost:8020 is configured in the core-site.xml file, and this machine will only listen on the address 127.0.0.1, which needs to be changed to the actual IP of the server.

2. Other users do not have write permission in the hdfs file system?

By default, only hadoop users (who start the service as hadoop) have write permissions. If you want other users to have write permissions, you can add the following configuration to the hdfs-site.xml file

Dfs.permissionsfalse

Or modify the permissions of a directory on the hdfs file system:

For example:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.