Big data distributed Computing-- hadoop 07/02 Update SLTechnology News&Howtos

Big data distributed Computing-- hadoop

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed operation and storage.

The Hadoop core project provides the basic services for building a cloud computing environment on low-end hardware, and it also provides the necessary API interface for software running in the cloud.

The two basic parts of the Hadoop kernel are the MapReduce framework, the cloud computing environment, and the HDFS distributed file system. In the core framework of Hadoop, MapReduce is often called mapred,HDFS and often called dfs. HDFS provides storage for massive data, and MapReduce provides computing for massive data.

The core concept of MapReduce is to divide the input data into different logical blocks, and the Map task first processes each block separately in parallel. The processing results of these logical blocks are reassembled into different sorted sets, which are finally processed by the Reduce task.

HDFS distributed file system has high fault tolerance and is designed to be deployed on low-cost (low-cost) hardware; and it provides high throughput (high throughput) to access application data, which is suitable for applications with very large data sets (large data set). HDFS relax POSIX requirements, you can stream access to (streaming access) the data in the file system.

Reference: hadoop.appache.org

Experimental environment rhel6.5

Host server7, slave server8.server9 Note: domain names must be resolved to each other.

Installation and basic configuration

Each machine creates a hadoop user with a uid of 900. the password is redhat.

On Server7, hadoop users are under / home/.

# tar zxf hadoop-1.2.1.tar.gz-C hadoop

# cd hadoop

# ln-s hadoop-1.2.1/ hadoop

# sh jdk-6u32-linux-x64.bin / / install java

# ln-s jdk-1.6.32 java

# vim .bash _ profile / / configure path

Export JAVA_HOME=/home/hadoop/java

Export CLASSPATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib

Export PATH=$PATH:$HOME/bin:$JAVA_HOME/bin

# source .bash _ profile

# echo $JAVA_HOME

Show / home/hadoop/java

# cd hadoop/conf

# vim hadoop-env.sh

Modify exprot JAVA_HOME=/home/hadoop/java

# mkdir.. / input

# cp * .xml.. / input / / create a distributed file system

# cd..

# bin/hadoop jar hadoop-examples-1.2.1.jar

Lists the relevant parameters of hadoop-example-1.2.1jar to input operations, such as grep lookup, sort sorting, wordcount counting and so on.

# bin/hadoop jar hadoop-examples-1.2.1.jar grep input output 'dfs [a dfs.] +' / / find the files in inout whose filenames begin with dfs followed by lowercase English, and store the results in the automatically generated file folder

# cd output/

# ls

# cat *

Then introduce the three working modes of hadoop.

Stand-alone mode (standalone)

Stand-alone mode is the default mode of Hadoop. When the source package of Hadoop is decompressed for the first time, Hadoop cannot understand the hardware installation environment, so it conservatively chooses the minimum configuration. All three XML files are empty in this default mode. When the configuration file is empty, Hadoop runs entirely locally. Because there is no need to interact with other nodes, stand-alone mode does not use HDFS and does not load any Hadoop daemons. This mode is mainly used to develop and debug the application logic of MapReduce programs.

Pseudo distribution pattern (Pseudo-Distributed Mode)

Pseudo-distributed mode runs Hadoop on a "single node cluster", where all daemons are running on the same machine. This mode adds code debugging to stand-alone mode, allowing you to check memory usage, HDFS input and output, and other daemon interactions.

Fully distributed mode

The Hadoop daemon runs on a cluster.

The above operation is in stand-alone mode.

Hadoop distributed deployment

Structure:

The master node includes the name node (namenode), the subordinate name node (secondarynamenode), and the jobtracker daemon (so-called master daemon), as well as the utilities and browsers used to manage the cluster.

Slave nodes include tasktracker and data nodes (dependent daemons). The difference between the two settings is that the master node includes daemons that provide Hadoop cluster management and coordination, while the slave nodes include daemons that implement Hadoop file system (HDFS) storage and MapReduce functions (data processing).

The role of each daemon in the Hadoop framework:

Namenode is the master server in Hadoop that manages file system namespaces and access to files stored in the cluster.

Secondary namenode, which is not a redundant daemon for namenode, but provides periodic checkpointing and cleanup tasks.

You can find a namenode and a secondary namenode in each Hadoop cluster.

Datanode manages storage connected to nodes (there can be multiple nodes in a cluster). Each node that stores the data runs a datanode daemon.

Each cluster has a jobtracker, which is responsible for scheduling work on the datanode.

Each datanode has a tasktracker, and they do the actual work.

Jobtracker and tasktracker are master-slave, with jobtracker dispatching datanode distribution and tasktracker performing tasks. Jobtracker also checks the requested work, and if a datanode fails for some reason, jobtracker reschedules the previous task.

The following implements pseudo-distribution

For convenience, set ssh password-free.

Hadoop user on Server7.

# ssh-keygen

# ssh-copy-id localhost

# ssh localhost / / login to this machine without a password

Modify the configuration file:

# cd hadoop/conf

# vim core-site.xml

Add below

Fs.default.name

Hdfs://172.25.0.7:9000

/ / specify namenode

# vim mapred-site.xml

Add below

Mapred.job.tracker

172.25.0.7:9001

/ / specify jobtracker

# vim hdfs-site.xml

Add below

Dfs.replication

one

/ / specify the number of copies saved in the file. Because it is pseudo-distributed, the copy is 1 local.

# cd..

# bin/hadoop namenode-format / / format namenode

# bin/start-dfs.sh / / start hdfs

# jps / / View the process

You can see that the secondarynamenode,namenode,datanode is all started. Namenode and datanode are on the same machine, so they are pseudo-distributed.

# bin/start-mapred.sh.sh / / start mapreduce

# bin/hadoop fs-put input test / / upload input to hdfs and rename it to test in hdfs

Browse the network interfaces of NameNode and JobTracker, and their addresses are:

NameNode-http://172.25.0.7:50070/

JobTracker-http://172.25.0.7:50030/

View namenode

# bin/hadoop fs-ls test / / list the files in the test directory in hdfs

View files under test on Web

The following implements a fully distributed mode

Both the master and slave computers install nfs-utils and start the rpcbind service (mainly responsible for notifying the client and the server's NFS port number when nfs is shared. Simply understand that rpc is a mediation service), removes the installation configuration by using hadoop directly from the slave through nfs.

On server7, start the nfs service

# vim / etc/exports

/ home/hadoop * (rw,all_squash,anonuid=900,anongid=900

/ / share hadoop, specify id for login users, and log in with users with uid of 900s

On server8,9

# mount 172.25.0.7:/home/hadoop / hooem/hadoop/ Mount the shared directory

On server7, hadoop users, change the hdfs-site under hadoop/conf and change the number of copies from 1 to 2.

# cd hadoop/conf

# vim slave add slave

172.25.0.8

172.25.0.9

# vim master sets the host

172.25.0.7

Format the pseudo-distributed file system before starting fully distributed mode

# cd..

# bin/stop-all.sh / / stop jobtracker,namenode,secondarynamenode

# bin/hadoop-daemon.sh stop tasktracker

# bin/hadoop-daemon.sh stop datanode / / stop tasktracker,datanode

# bin/hadoop namenode-format

# bin/start-dfs.sh displays server8,server9 connections.

# bin/start-mapred.sh

Added jobtracker process

On server8, jps can see three process jps,datanode,tasktracker

Slave computer can upload, query, etc.

# bin/hadoop fs-put input test

# bin/hadoop jar hadoop-example-1.2.1.jar grep test out 'dfs [a Merz] +'

On server7

# bin/hadoop dfsadmin-report / / display hdfs information

Since no files are added under hadoop, the dfs used% is 0%.

# dd if=/dev/zero of=bigfile bs=1M count=200

# bin/hadoop fs-put bigfile test

See that dfs used is 403.33MB on web (two slaves, each 200MB)

Note: sometimes operation errors cause hadoop to enter safe mode and cannot perform operations such as uploading

Just run the downlink instruction

# bin/hadoop dfsadmin-safemode leave

Hadoop supports real-time expansion and can add slaves online.

Add slave server10. Install nfs-utils and start the rpcbind service. Add the hadoop user of uid900, mount the hadoop of server7, and add 172.25.0.10 to slaves under hadoop/conf.

Note: hostname resolution of server10 must be added on the master and slave before adding server10.

On server10, hadoop users

# bin/hadoop-daemon.sh start datanode

# bin/hadoop-daemon.sh start tasktracker

On server7

# bin/hadoop dfsadmin-report

You can see the information of server10

You can see that the server10 dfs used is 0, and you can move the data from server9 to server10.

Data migration:

Data migration is the process of moving rarely or unused files to a secondary storage system.

Hadoop removes server9 datanode nodes online to achieve data migration:

# bin/hadoop-daemon.sh stop tasktracker / / when doing data migration, this node should not participate in tasktracker, otherwise an exception will occur

Modify conf/mapred-site.xml on master

Add below

Dfs.hosts.exclude

/ home/hadoop/hadoop-1.0.4/conf/datanode-excludes

Create a datanode-excludes under conf, and add the CVM to be deleted, one per line

# vim datanode-excludes

172.25.0.9 / / Delete node server9

# cd..

# bin/hadoop dfsadmin-refreshNodes / / refresh nodes online

# bin/hadoop dfsadmin-report

You can see the server9 status: Decommission in progress

To delete a tasktracker node online

Modify conf/mapred-site.xml on server7

Mapred.hosts.exclude

/ home/hadoop/hadoop-1.0.4/conf/tasktracker-excludes

Create a tasktracker-excludes file and add the hostname to be deleted, one per line

Server9.example.com

# bin/hadoop mradmin-refreshNodes

When the status of this node is displayed as Decommissioned, the data migration is complete and can be safely shut down.

The hadoop1.2.1 version is too low, and the scheduling ability of jobtracker is not strong, so it is easy to become a bottleneck when there is too much slvers. Using the new version 2.6.4 is a good choice.

Stop the process and delete the file:

On server7

# bin/stop-all.sh

# cd / home/hadoop

# rm-fr hadoop java hadoop-1.2.1 java1.6.32

# rm-fr / tmp/*

From the plane

# bin/hadoop-daemon.sh stop datanode

# bin/hadoop-daemon.sh stop tasktracker

# rm-fr / tmp/*

The following operation is basically the same as above.

Hadoop users on server7 and / home/hadoop/

# tar zxf jdk-7u79-linux-x64.tar.gz-C / home/hadoop/

# ln-s jdk1.7.0.79 java

# tar zxf hadoop-2.6.4.tar.gz

# ln-s hadoop-2.6.4 hadoop

# cd hadoop/etc/hadoop

# vim hadoop-env.sh

Export JAVA_HOME=/home/hadoop/java

Export HADOOP_PREFIX=/home/hadoop/hadoop

# cd / home/hadoop/hadoop

# mkdir input

# cp etc/hadoop/*.xml input

# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.4.jar grep input output 'dfs [a Merz.] +'

# cat output/*

Grep compiles with warning, and problems may occur when the cluster is large. Hadoop-native needs to be added.

# tar-xf hadoop-native-64.2.6.0.tar-C hadoop/lib/native/

# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.4.jar grep input output 'dfs [a Merz.] +'

There is no warning after compilation.

# cd etc/hadoop

# vim slaves

172.25.0.8

172.25.0.9

# vim etc/hadoop/core-site.xml

Fs.defaultFS

Hdfs://172.25.0.7:9000

# vim hdfs-site.xml

Dfs.replication

two

# bin/hdfs namenode-format

# sbin/start-dfs.sh

# jps

# ps-ax can see namenode and secondarynamenode processes

# bin/hdfs dfs-mkdir / user/hadoop

# bin/hdfs dfs-put input/ test

Input can be seen on web for upload.

MapReduce's JobTracker/TaskTracker mechanism requires large-scale adjustments to fix its shortcomings in scalability, memory consumption, threading model, reliability and performance.

In order to fundamentally solve the performance bottleneck of the old MapReduce framework and promote the longer-term development of the Hadoop framework, since version 0.23.0, the MapReduce framework of Hadoop has been completely reconstructed and fundamental changes have taken place. The new Hadoop MapReduce framework is named MapReduceV2 or Yarn

# vim etc/hadoop/yarn-site.xml

< property>

Yarn.resourcemanager.hostname

Server7.example.com

# sbin/start-yarn.sh

# jps

Server8 can see that the process has been started

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.