Hadoop for Hadoop deployment (3) 07/06 Update SLTechnology News&Howtos

Hadoop for Hadoop deployment (3)

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. Introduction of Hadoop

The core design of Hadoop's framework is: HDFS and MapReduce. HDFS provides storage for massive data, while MapReduce provides computing for massive data.

1. HDFS introduction

Hadoop implements a distributed file system (Hadoop Distributed File System), referred to as HDFS.

HDFS has high fault tolerance and is designed to be deployed on low-cost (low-cost) hardware; and it provides high throughput (high throughput) to access application data, suitable for applications with very large data sets (large data set). HDFS relaxes the requirement of (relax) POSIX to access (streaming access) data in a file system as a stream.

2. HDFS composition

HDFS adopts the master-slave (Master/Slave) structure model, and a HDFS cluster is composed of a NameNode and several DataNode. NameNode acts as the primary server, managing the file system namespace and client access to files. DataNode manages the stored data. HDFS supports data in file form.

Internally, the file is divided into several blocks, which are stored on a set of DataNode. NameNode performs the namespace of the file system, such as opening, closing, renaming files or directories, etc., and is also responsible for mapping data blocks to specific DataNode. DataNode is responsible for handling the file reading and writing of the file system client, and the creation, deletion and replication of the database under the unified scheduling of NameNode. NameNode is the manager of all HDFS metadata, and user data never passes through NameNode.

3. MapReduce introduction

Hadoop MapReduce is a google MapReduce clone.

MapReduce is a computing model, which is used to calculate a large amount of data. Map performs specified operations on independent elements on the dataset to generate intermediate results in the form of key-value pairs. Reduce specifies all values of the same "key" in the intermediate result to get the final result. The function partition such as MapReduce is very suitable for data processing in a distributed parallel environment composed of a large number of computers.

4. MapReduce architecture

The Hadoop MapReduce uses the Master/Slave (Mamp S) architecture, as shown in the following figure, and consists of the following components: Client, JobTracker, TaskTracker, and Task.

JobTrackerJobTracker is called a job tracker, and it is a very important process running on the master node (Namenode). It is the scheduler of the MapReduce system. The daemon used to process the job (the code submitted by the user) determines which files are involved in the processing of the job, then cuts the job into small task and assigns them to the child nodes where the desired data is located. The principle of Hadoop is to run nearby, where the data and the program are in the same physical node, where the data is, where the program runs. This work is done by JobTracker, monitoring task, and restarting failed task (on different nodes). Each cluster has only one JobTracker, similar to a single point of NameNode, located in Master node TaskTrackerTaskTracker called task tracker, the last background process of MapReduce system, located on each slave node, combined with datanode (the principle of code and data), managed the task (assigned by jobtracker) on each node, and each node has only one tasktracker. But a tasktracker can start multiple JVM and run Map Task and Reduce Task And interact with JobTracker, report task status, Map Task: parse each data record, pass it to the map () written by the user, and execute, write the output to the local disk (if it is a map-only job, write directly to HDFS). Reducer Task: from the execution result of Map Task, the input data is read remotely, the data is sorted, and the data is executed according to the reduce function written by the user. 2. Installation of Hadoop 1, download installation # download installation package wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz# extract installation package tar xf hadoop-2.7.3.tar.gz & & mv hadoop-2.7.3 / usr/local/hadoop# create directory mkdir-p / home/hadoop/ {name,data,log,journal} 2, configure Hadoop environment variables

Create the file / etc/profile.d/hadoop.sh.

# HADOOP ENVexport HADOOP_HOME=/usr/local/hadoopexport PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Make the Hadoop environment variable effective.

Source / etc/profile.d/hadoop.sh III, Hadoop configuration 1, configuration hadoop-env.sh

Edit the file / usr/local/hadoop/etc/hadoop/hadoop-env.sh and modify the following fields.

Export JAVA_HOME=/usr/java/defaultexport HADOOP_HOME=/usr/local/hadoop2, configure yarn-env.sh

Edit the file / usr/local/hadoop/etc/hadoop/yarn-env.sh and modify the following fields.

Export JAVA_HOME=/usr/java/default3, configure DN whitelist slaves

Edit file / usr/local/hadoop/etc/hadoop/slaves

Datanode01datanode02datanode034, configure the core component core-site.xml

Edit the file / usr/local/hadoop/etc/hadoop/core-site.xml and modify it as follows:

Fs.default.name hdfs://cluster1:9000 hadoop.tmp.dir / home/hadoop/data ha.zookeeper.quorum zk01:2181,zk02:2181 Zk03:2181 dfs.permissions false io.file.buffer.size 131702 5, profile system hdfs-site.xml

Edit the file / usr/local/hadoop/etc/hadoop/hdfs-site.xml and modify it as follows:

Dfs.namenode.name.dir file:/home/hadoop/name dfs.datanode.data.dir file:/home/hadoop/data dfs.replication 2 dfs.webhdfs.enabled True dfs.nameservices cluster1 6 、 Configure the computing framework mapred-site.xml

Edit the file / usr/local/hadoop/etc/hadoop/mapred-site.xml and modify it as follows:

Mapreduce.framework.name yarn mapred.local.dir / home/hadoop/data mapreduce.admin.map.child.java.opts-Xmx256m mapreduce.admin.reduce.child.java. Opts-Xmx4096m mapred.child.java.opts-Xmx512m mapred.task.timeout 1200000 true dfs.hosts.exclude slaves. Exclude mapred.hosts.exclude slaves.exclude 7 、 Configure the computing framework yarn-site.xml

Edit the file / usr/local/hadoop/etc/hadoop/yarn-site.xml and modify it as follows:

Yarn.resourcemanager.hostname namenode01 yarn.resourcemanager.address ${yarn.resourcemanager.hostname}: 8032 yarn.resourcemanager.scheduler.address ${yarn.resourcemanager.hostname}: 8030 yarn.resourcemanager. Webapp.address ${yarn.resourcemanager.hostname}: 8088 yarn.resourcemanager.resource-tracker.address ${yarn.resourcemanager.hostname}: 8031 yarn.resourcemanager.admin.address ${yarn.resourcemanager.hostname}: 8033 Yarn.scheduler.maximum-allocation-mb 983040 yarn.resourcemanager.scheduler.class yarn.resourcemanager.resource-tracker.address ${yarn.resourcemanager.hostname}: 8031 yarn.resourcemanager.admin.address ${yarn.resourcemanager.hostname}: 8033 yarn.scheduler.maximum-allocation-mb 8182 yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler yarn.log-aggregation -enable true yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler yarn.scheduler.maximum-allocation-vcores 512 yarn.scheduler.minimum-allocation-mb 2048 yarn.log-aggregation-enable true yarn.log-aggregation.retain-seconds 604800 yarn.nodemanager.resource.cpu-vcores 12 Yarn.nodemanager.resource.memory-mb 8192 yarn.nodemanager.vmem-check-enabled false yarn.nodemanager.pmem-check-enabled false yarn.nodemanager .vmem-pmem-ratio 2.1 yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage 98.0 yarn.nodemanager.aux-services mapreduce_shuffle Yarn.nodemanager.auxservices.mapreduce.shuffle.class org.apache.hadoop.mapred.ShuffleHandler 8 、 Copy the configuration file to another service node cd / usr/local/hadoop/etc/hadoopscp * datanode01:/usr/local/hadoop/etc/hadoopscp * datanode02:/usr/local/hadoop/etc/hadoopscp * datanode03:/usr/local/hadoop/etc/hadoopchown-R hadoop:hadoop / usr/local/hadoopchmod 755 / usr/local/hadoop/etc/hadoop IV, Hadoop startup 1, format HDFS (executed in NameNode01) hdfs namenode-formathadoop-daemon.sh start namenode2, Restart Hadoop (executed in NameNode01) stop-all.shstart- all.sh 5. Check Hadoop1, check the WEB interface of JPS process [root@namenode01] # jps17419 NameNode17780 ResourceManager18152 Jps [root@datanode01] # jps2227 DataNode1292 QuorumPeerMain2509 Jps2334 NodeManager [root@datanode02] # jps13940 QuorumPeerMain18980 DataNode19093 NodeManager19743 Jps [root@datanode03 ~] # jps19238 DataNode19350 NodeManager14215 QuorumPeerMain20014 Jps2, HDFS

Visit http://192.168.1.200:50070/

3. WEB interface of YARN

Visit http://192.168.1.200:8088/

6. WordCount verification of MapReduce 1. Upload the files that need to be processed to hdfs. [root@namenode01 ~] # hadoop fs-put / root/anaconda-ks.cfg / anaconda-ks.cfg2, Conduct wordcount [root@namenode01 ~] # cd / usr/local/hadoop/share/hadoop/mapreduce/ [root@namenode01 mapreduce] # hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount / anaconda-ks.cfg / test18/11/17 00:04:45 INFO client.RMProxy: Connecting to ResourceManager at namenode01/192.168.1.200:803218/11/17 00:04:45 INFO input.FileInputFormat: Total input paths to process: 00:04:45 on 118-11-17 INFO mapreduce.JobSubmitter: number Of splits:118/11/17 00:04:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541095016765_000418/11/17 00:04:46 INFO impl.YarnClientImpl: Submitted application application_1541095016765_000418/11/17 00:04:46 INFO mapreduce.Job: The url to track the job: http://namenode01:8088/proxy/application_1541095016765_0004/18/11/17 00:04:46 INFO mapreduce.Job: Running job: job_1541095016765_000418/11/17 00:04:51 INFO mapreduce .Job: Job job_1541095016765_0004 running in uber mode: false18/11/17 00:04:51 INFO mapreduce.Job: map 0 reduce 0 reduce 17 00:04:55 INFO mapreduce.Job: map 100% reduce 0 Charpy 11 00:04:59 INFO mapreduce.Job: map 100% reduce 100-11-17 00:04:59 INFO mapreduce.Job: Job job_1541095016765_0004 completed successfully18/11/17 00:04:59 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=1222 FILE: Number of bytes written=241621 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1023 HDFS: Number of bytes written=941 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent By all maps in occupied slots (ms) = 1758 Total time spent by all reduces in occupied slots (ms) = 2125 Total time spent by all map tasks (ms) = 1758 Total time spent by all reduce tasks (ms) = 2125 Total vcore-milliseconds taken by all map tasks=1758 Total vcore-milliseconds taken by all reduce tasks=2125 Total megabyte-milliseconds taken by all map tasks=1800192 Total megabyte-milliseconds taken by all reduce tasks=2176000 Map-Reduce Framework Map input records=38 Map output records=90 Map output bytes=1274 Map output materialized bytes=1222 Input split bytes=101 Combine input records=90 Combine output records=69 Reduce input groups=69 Reduce shuffle bytes=1222 Reduce input records=69 Reduce output records=69 Spilled Records=138 Shuffled Maps = 1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms) = 99 CPU time spent (ms) = 970 Physical memory (bytes) snapshot=473649152 Virtual memory (bytes) snapshot=4921606144 Total committed heap Usage (bytes) = 441450496 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=922 File Output Format Counters Bytes Written=9413, View the result [root@namenode01 mapreduce] # hadoop fs-cat / test/part-r-00000# 11#version=DEVEL 1 $6 $kRQ2y1nt/B6c6ETs$ITy0O/E9P5p0ePWlHJ7fRTqVrqGEQf7ZGi5IX2pCA7l25IdEThUNjxelq6wcD9SlSa1cGcqlJy2jjiV9/lMjg/ 1%addon 1%end 2%packages 1--all 1--boot-drive=sda 1--bootproto=dhcp 1--device=enp1s0 1--disable 1--drives=sda 1--enable 1--enableshadow 1--hostname=localhost.localdomain 1--initlabel 1--ipv6=auto 1--isUtc 1--iscrypted 1--location=mbr 1--onboot=off 1--only-use=sda 1--passalgo=sha512 1 Mustang reserved Mustang 1@core 1Agent 1Asia/Shanghai 1CDROM 1Keyboard 1Network 1Partition 1Root 1Run 1Setup 1System 4Use 2auth 1authorization 1autopart 1boot 1bootloader 2cdrom 1clearing 1clearpart 1com_redhat_kdump 1configuration 1first 1firstboot 1graphical 2ignoredisk 1information 3install 1installation 1keyboard 1lang 1language 1layouts 1media 1network 2on 1password 1rootpw 1the 1timezone 2zh_CN.UTF-8 auto' 1--type=lvm 1--vckeymap=cn 1 Mustang xlayoutsprinted floor cn' 1 @ ^ copyright copyright 17, The use of Hadoop

View the fs help command: hadoop fs-help

Check HDFS disk space: hadoop fs-df-h

Create a directory: hadoop fs-mkdir

Upload local files: hadoop fs-put

View the file: hadoop fs-ls

View file contents: hadoop fs-cat

Copy file: hadoop fs-cp

Download the HDFS file to local: hadoop fs-get

Move files: hadoop fs-mv

Delete the file: hadoop fs-rm-r-f

Delete folder: hadoop fs-rm-r

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.