Hadoop stand-alone and pseudo-distributed 07/15 Update SLTechnology News&Howtos

Hadoop stand-alone and pseudo-distributed

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

The basic concept of hadoop pseudo-distributed hadoop cluster installation hdfs, MapReduce demonstration Internet is moving from IT to DT era. Big data application analysis 1, statistical analysis 2, recommended class analysis 3, machine learning (classification, clustering) 4, artificial intelligence, prediction (algorithm) 1, what is the hadoop official website: http://hadoop.apache.orghadoop is a set of open source software platform under apache. Is a reliable, scalable, distributed computing open source software. The apache hadoop platform is a framework that allows you to use a simple programming model. The platform is designed to scale from a single server to thousands of servers, each providing local computing and storage. It is also designed to detect and handle faults at the application layer (that is, high reliability and high fault tolerance), high availability services are based on computer clusters, and each of these computers may fail. Functions provided by hadoop: using server clusters, according to user-defined business logic, distributed processing of massive data. Author: core component of doug cuttinghadoop: hadoop common:hadoop tool hadoop distributed file system (HDFS): distributed file system, solve the storage of massive data hadoop YARN: computing resource scheduling system, solve resource management and scheduling hadoop MapReduce: distributed computing programming framework After Hadoop, an analysis model for massive data, independent resource management from MapReduce to a general framework in 2. 0, it evolved from a three-tier structure of 1. 0 to a four-tier architecture: 1. Bottom layer-storage layer, file system HDFS 2. Middle tier-resource and data management, YARN and Sentry, etc. Upper layer-MapReduce, Impala, Spark and other computing engines 4. Top-level-based on the advanced encapsulation and tools of computing engines such as MapReduce, Spark, such as Hive, Pig, Mahout and so on, hadoop usually refers to a broader concept-Hadoop Biosphere 2, hadoop background 1, hadoop originated from Nutch. The design goal of Nutch is to build a large-scale web-wide search engine, including web page crawling, indexing, query and other functions, but with the increase of the number of crawled web pages, it has encountered a serious scalability problem-how to solve the storage and indexing problems of billions of web pages. 2. Two papers published by Google in 2003 and 2004 provide a feasible solution to this problem. Distributed file system (GFS) can be used to deal with the storage of massive web pages, and distributed computing framework MapReduce can be used to deal with the index calculation of massive web pages. 3. Nutch developers completed the corresponding open source implementation of HDFS and MapReduce, and spun off from Nutch into an independent project hadoop. In January 2008, hadoop became a top-level project of apache and ushered in a period of rapid development. Third, the position and relationship of hadoop in big data and cloud computing 1. Cloud computing is the product of the integrated development of traditional computer technology and Internet technology, such as distributed computing, parallel computing, grid computing, multicore computing, network storage, virtualization, load balancing and so on. Provide powerful computing power to end users with the help of business models such as Iaas (Infrastructure as a Service), Paas (platform as a Service) and Saas (Software as a Service). At this stage, the underlying supporting technologies of cloud computing are "virtualization" and "big data technology" 3, while hadoop is one of the solutions of Paas layer of cloud computing, which is not the same as Paas, let alone cloud computing itself. Big data deals with business applications 1. Log analysis of web servers on large websites: the web server cluster of a large website contains about 800GB every 5 minutes, with a peak of 9 million clicks per second, loading the data into memory every 5 minutes, high-speed computing the hot URL of the website, and feeding this information back to the front-end cache server to improve the cache hit rate. 2. Operator traffic analysis: the daily traffic data is about 2TB~5TB and copied to HDFS. Through the interactive analysis engine framework, hundreds of complex data cleaning and reporting services can be run. The total time is two or three times faster than minicomputer clusters and DB2 with similar hardware configurations. 3. IPTV viewing statistics and VOD recommendation: a real-time ratings statistics and VOD recommendation system, which can collect users' remote control operations in real time, provide real-time ratings list, and implement VOD recommendation service according to content recommendation and collaborative filtering algorithm. 4. Real-time analysis of video surveillance information of urban traffic gates: real-time analysis, alarm and statistics (calculating real-time road conditions) are carried out through video surveillance information collected at provincial-wide traffic outlets based on streaming stream. Real-time alarm can be made for non-annual inspection vehicles or × × analysis within the province. Big data is a compound major, including application development, software platform, algorithm, data mining and so on. Therefore, big data's employment options in the technical field are diverse, but as far as hadoop is concerned, Usually need to have the following skills or knowledge: 1, hadoop distributed cluster platform building 2, hadoop distributed file system HDFS principle understanding and use 3, hadoop distributed computing framework MapReduce principle understanding and programming 4, hive data warehouse tools proficient application 5, flume, sqoop, oozie and other auxiliary tools skilled use 6, shell, Development capabilities of scripting languages such as python hadoop solution to massive data processing: HDFS architecture: master-slave structure: master node: namenode slave node: there are many datanodenamenode responsible for: accept user operation requests to store file metadata and maintain file system directory structure such as block list of each file and datanode where blocks are located manage the relationship between files and block The relationship between block and datanode datanode is responsible for: the storage file stores block data in the local file system, and the checksum file of the block data is divided into block and stored on disk to ensure data security, the file will have multiple copies of secondary namenode (2nn): an auxiliary daemon used to monitor the status of HDFS, taking snapshots of and hdfs metadata at regular intervals. YARN architecture: 1) ResourceManager (RM) main functions are as follows: (1) processing client requests (2) Monitoring NodeManager (3) starting or monitoring ApplicationMaster2) NodeManager (nm) main functions are as follows: (1) manage resources on a single node (2) process commands from ResourceManager (3) process commands from ApplicationMaster 3) ApplicationMaster (AM) function: (1) Segmentation of auxiliary data (2) request resources for applications And assigned to internal tasks (3) Monitoring and fault tolerance of tasks 4) ContainerContainer is a resource abstraction in YARN It encapsulates multi-dimensional resources on a node, such as memory, CPU, disk, network and so on. Question: how to solve the computing Mapreduce architecture of massive data: two programs: Map: local parallel processing input data reduce: summarize the results of local processing, and then count the difference between global hadoop1.x and hadoop2.x versions: installation and deployment Operation and maintenance, development, testing HDFS three core: HDFS, MapReduce, YARN four modules: hadoop common: provide infrastructure for other hadoop modules hadoop dfs: a highly reliable, high-throughput distributed file system hadoop mapreduce: a distributed offline parallel computing framework hadoop yarn: a new mapreduce framework, task scheduling and resource management hadoop installation: Hadoop three installation modes 1.Hadoop stand-alone mode is the default Hadoop installation mode This installation mode mainly does not configure more configuration files, but conservatively sets the initialization parameters in several default configuration files, it does not interact with other nodes, and does not use the HDFS file system, it is mainly for debugging MapReduce programs. 2.Hadoop pseudo-distributed installation mode Hadoop pseudo-distributed installation, need to configure 5 regular configuration files (XML), and here involves the interaction between NameNode and DataNode nodes, and NameNode and DataNode on the same node, but also need to configure mutual trust. In fact, in a strict sense, a pseudo-distributed cluster can already be called a real cluster, and it also contains all the components of hdfs and MapReduce, but all the components are on the same node. 3.Hadoop fully distributed installation mode Hadoop fully distributed cluster is mainly divided into conventional Hadoop fully distributed cluster and Hadoop HA cluster (here is mainly aimed at the number of NameNode and the high availability guarantee mechanism of NameNode). It can be seen that compared with pseudo-distributed cluster, in a completely distributed cluster, all processing nodes are not on the same node, but on multiple nodes. Build a pseudo-distributed cluster 1, environment build 1, system environment platform: VMware Workstation 14 system: centos 7.42, modify host name: hostnamectl set-hostname hadoopuseradd hadooppasswd hadoopvisoduhadoop ALL= (ALL) ALL Note: after changing the host name, you need exit to quit and restart. 3. Modify the / etc/hosts domain name resolution configuration file vi / etc/hosts192.168.80.100 hadoop4, turn off the firewall and selinuxsystemctl disable firewalldsystemctl stop firewalldsetenforce 0sed-I _ Install Java environment 1) extract Java package tar-xf jdk-8u11-linux-x64.tar.gz-C / optcp-rf jdk1.8.0_11/ / usr/local/java2) configure Java environment variable vi / etc/profile add: export JAVA_HOME=/usr/local/javaexport JRE_HOME=/usr/local/java/jreexport CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/libexport PATH=$PATH:$JAVA_HOME/bin3) effective environment variable source / Etc/profile4) verify that java-version has the following prompt Successfully deployed on behalf of the java environment: java version "1.8.0 11" Java (TM) SE Runtime Environment (build 1.8.0_11-b12) Java HotSpot (TM) 64-Bit Server VM (build 25.11-b03 Mixed mode) II. Hadoop official deployment and installation official document: http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation download address: http://archive.apache.org/dist/hadoop/core/hadoop-3.1.0/hadoop-3.1.0.tar.gz1, decompress hadoop package tar xf hadoop-3.1.0.tar.gz 2, rename mv hadoop-3.1.0/ / home/hadoop/hadoop3, Configure the environment variable vi / etc/profileexport HADOOP_HOME=/home/hadoop/hadoopexport LD_LIBRARY_PATH=$HADOOP_HOME/lib/nativeexport HADOOP_COMMON_LIB_NATIVE_DIR=/home/hadoop/hadoop/lib/nativeexport HADOOP_OPTS= "- Djava.library.path=/home/hadoop/hadoop/lib" export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/bin#hadoop-3.1.0 must add the following five variables or start the error report Hadoop-2.x doesn't seem to need export HDFS_NAMENODE_USER=rootexport HDFS_DATANODE_USER=rootexport HDFS_SECONDARYNAMENODE_USER=rootexport YARN_RESOURCEMANAGER_USER=rootexport YARN_NODEMANAGER_USER=root.

4. Take effect the environment variable source / etc/profile5, and test whether the configuration is successful. Hadoop version displays the following message, indicating that the configuration is successful: Hadoop 3.1.0Source code repository https://github.com/apache/hadoop-r 16b70619a24cdcf5d3b0fcf4b58ca77238ccbe6dCompiled by centos on 2018-03-30T00:00ZCompiled with protoc 2.5.0From source with checksum 14182d20c972b3e2105580a1ad6990This command was run using / usr/local/hadoop/share/hadoop/common/hadoop-common-3.1.0.jar

7. Hadoop directory before you know how to modify the configuration file, take a look at the directory under hadoop: the most basic management scripts and usage scripts of bin:hadoop. These scripts are the basic implementation of management scripts under the sbin directory. Users can directly use these scripts to manage and use hadoopetc: the directory where the configuration file is stored. Including configuration files inherited from hadoop1.x such as core-site.xml,hdfs-site.xml,mapred-site.xml and new configuration files added by hadoop2.x such as yarn-site.xml: externally provided programming library header files (specific dynamic libraries and static libraries are in the lib directory, these header files are defined by C++ militarily Usually used for C++ programs to access hdfs or write mapreduce programs) Lib: this directory contains the dynamic and static libraries provided by hadoop, and uses libexec in conjunction with the header files in the include directory: the directory where the shell configuration files corresponding to each service are located, and can be used to configure the directory where the log output directory, startup parameters and other information sbin:hadoop management scripts are located. It mainly contains the startup and shutdown scripts of all kinds of services in hdfs and yarn, the directory where the compiled jar package of each module is located. Cd / home/hadoop/hadoop/etc/hadoop # this directory is the variable setting script for the vi hadoop-env.sh # hadoop where the configuration files are stored. # hadoop-3.1.0 is line 54 and hadoop-2.7.7 is line 25. Export JAVA_HOME=/usr/local/java test: mkdir / home/inputhadoop jar / home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.0.jar wordcount / home/input / home/output

Hdfs dfs-ls /

Cd / home/hadoop/hadoop/etc/hadoop/vi core-site.xml # hadoopg public files Add the following lines to the global configuration file: fs.defaultFS hdfs://hadoop:9000 hadoop.tmp.dir # specify temporary data storage directory / home/hadoop/hadoop/tmp # system path Note: there is a default parameter table for all configuration files in the documentation of the hadoop installation directory, which users can view after Modify it according to the actual situation. Uri uses its own protocol and its own address port.

As you can see in the / usr/local/hadoop/share/doc/hadoop/hadoop-project-dist/hadoop-common/core-default.html document, the default value for hadoop.tmp.dir is / tmp/hadoop-$ {user.name}. / tmp/ is the temporary directory of the Linux system. If we do not respecify it, the default Hadoop working directory is in the temporary directory of Linux. Once the Linux system is restarted, all files will be emptied, including metadata and other information will be lost. It is very troublesome to reformat. Vi hdfs-site.xml # hdfs site profile adds the following lines: dfs.replication # specify the number of copies of hdfs 1 # specify the number of copies dfs.http.address 192.168.80.100 dfs.http.address 50070 Note: as you can see in the hdfs-default.xml document: the default value for dfs.replication is 3, because the number of copies of HDFS cannot be greater than the number of DataNode We have only one DataNode in the hadoop installed at this time, so change the dfs.replication value to 1. The default value of dfs.namenode.http-address on the hadoop-3.1.0 version is? 0.0.0.0 hadoop-2.7.7 9870, and the default value on the hadoop-2.7.7 version is 0.0.0.0 hadoop-2.7.7 50070, so different versions can access the NameNode through different ports.

Rename cp mapred-site.xml.templete mapred-site.xml #. In the hadoop-3.1.0 system, mapred-site.xml does not need to be renamed. Hadoop-2.x needs to be renamed vi mapred-site.xml # to add the following lines to specify which computing framework hadoop runs on. Here, specify the yarn framework. Mapreduce.framework.name # specify the service on which the MapReduce program is placed to enable yarn

Vi yarn-env.xml # 2.x version needs to change the jdk path export JAVA_HOME = vi yarn-site.xml # add the following lines yarn.resourcemanager.hostname hadoop yarn.nodemanager.aux-services # mapreduce_shuffle

8 、 Password-free interaction ssh-keygen-t rsa # generate ssh key pair Generating public/private rsa key pair.Enter file in which to save the key (/ root/.ssh/id_rsa): # Direct enter Enter passphrase (empty for no passphrase): # Direct enter Enter same passphrase again: # Direct enter Your identification has been saved in / root/.ssh/id_rsa.Your public key has been saved in / root/.ssh/id_rsa.pub.The key fingerprint is: SHA256:9NevFFklAS5HaUGJtVrfAlbYk82bStTwPvHIWY7as38 root@hadoopThe key's randomart image is:+--- [RSA 2048]-+ |. | |. S = | | .O * = | | = X.% .E. . | | * + =% oBo+.o | | o=B+o++o.oo. | +-[SHA256]-+ cd / root/.ssh/ls id_rsa id_rsa.pub known_hosts Note: # id_rsa is the private key and id_rsa.pub is the public key. Because the hadoop pseudo-distribution is built, both namenode and datanode are on the same node. Cp id_rsa.pub authorized_keys # enables hosts to log in to ssh hadoop date # without a password (no need to enter a password and directly output the result, indicating success in secret-free) 9. Start hadoop cluster 1) first format NameNode Note: if you have run hadoop after formatting NameNode, and then want to format NameNode again, you need to delete the VERSION file generated after the first run of Hadoop, otherwise an error will occur. For more information, please see question 4 in part 4. There is no error in cd [root@hadoop ~] # hdfs namenode-format # and the following message is displayed to indicate that the formatting is successful. / * SHUTDOWN_MSG: Shutting down NameNode at hadoop/192.168.80.100*** * * /

After the formatting is complete, the system generates metadata information in the dfs.data.dir directory. Name/currentdata/currentbin/hdfs namenode-formatsbin/hadoop-daemon.sh start namenodesbin/hadoop-daemon.sh start datanode or: start-dfs.shstart-yarn.sh2) enter start-all.sh to launch start-all.shStarting namenodes on [hadoop] Last login: April 18 23:06:27 CST 2019 from 192.168.80.1pts/1 Starting datanodes last login: April 18 23:53:44 on CST 2019pts/1 localhost: Warning: Permanently added 'localhost' ( ECDSA) to the list of known hosts.Starting secondary namenodes [hadoop] Last login: April 18 23:53:46 on CST 2019pts/1 2019-04-18 23 WARN util.NativeCodeLoader 548969 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicableStarting resourcemanager last login: April 18 23:54:03 on CST 2019pts/1 Starting nodemanagers Last login: April 18 23:54:10 on CST 2019pts/1 3) execute jps to verify whether the cluster starts successfully jps # shows the following processes: 96662 Jps95273 DataNode # optional 95465 SecondaryNameNode # important 95144 NameNode # important 95900 NodeManager # important 95775 ResourceManager #

4) Log in to the HDFS management interface (NameNode): http://ip:50070

5) Log in to the MR management interface: http://ip:8088

Use: upload files to hdfs: hadoop fs-put aa hdfs://192.168.80.100:9000/ Note: aa is the name of the file to be uploaded

Abbreviation: hadoop fs-put aa / download file from hdfs: hadoop fs-get hdfs://192.168.80.100:9000/aa

Create a directory in hdfs: hadoop fs-mkdir hdfs://192.168.80.100:9000/word can also be abbreviated: hadoop fs-mkdir / word

Call MapReduce program: hadoop jar hadoop-mapreduce-examples~ pi 55 Note: pi: main class, calculate pi 5: parameters Number of map tasks 5: number of samples per map hadoop jar hadoop-mapreduce-example~ word/ word/input / word/outputhadoop fs-ls / word/outputhadoop fs-cat / word/output/part~HDFS implementation idea: 1. Hdfs stores files through a distributed cluster. 2. Files are split into block when they are stored in a hdfs cluster. 3. The block of files is stored on several datanode nodes. 4. There is a mapping relationship between files in the hdfs file system and the real block Managed by namenode 5. Each block stores multiple replicas in the cluster. The advantage is that it can improve the reliability of data and the throughput of access. We can see that no matter starting or shutting down the hadoop cluster, the system will report the following error: WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicable solution: first check whether the hadoop we installed is 64-bit [root@hadoop hadoop] # file / usr/local/hadoop/lib/native/libhadoop.so.1.0.0 # the following message indicates that our hadoop is 64-bit / usr/local/hadoop/lib/native/libhadoop.so.1.0.0: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID [sha1] = 8d84d1f56b8c218d2a33512179fabffbf237816a Not stripped permanent solution: vi / usr/local/hadoop/etc/hadoop/log4j.properties # add the following sentence at the end of the file Save exit log4j.logger.org.apache.hadoop.util.NativeCodeLoader=Error configuration instructions JDK: Hadoop and Spark depend on the configuration, the official recommended JDK version is above 1.7! The configuration that Scala:Spark depends on. The recommended version is not lower than that of spark. Hadoop: is a distributed system infrastructure. Spark: a tool for processing distributed storage by big data. Zookeeper: distributed application coordination service, required by HBase cluster. HBase: a distributed storage system for structured data. Hive: a data warehouse tool based on Hadoop, the current default Metabase is mysql. Configure the history server vi mapred-site.xml mapreduce.jobhistory.address hadoop:10020 mapreduce.jobhistory.webapp.address hadoop:19888 start the history server: sbin/mr-jobhistory-daemon.sh start historyserver check whether the history server starts jps view jobhistory:192.168.80.100:19888 configuration log aggregation log aggregation: after the application is run Upload the program running log information to the HDFS system benefits: you can easily view the program running details, convenient for development and debugging Note: enable log aggregation function Need to restart NodeManager, ResourceManager, and HistoryManager step: close all sbin/mr-jobhistory-daemon.sh stop historyserversbin/yarn-daemon.sh stop nodemanagersbin/yarn-daemon.sh stop resourcemanagerjpsvi yarn-site.xml yarn.log-aggregation-enable true yarn.log-aggregation.retain-seconds 604800sbin/yarn-daemon.sh start resourcemanagersbin/yarn-daemon.sh start nodemanagersbin/mr-jobhistory-daemon.sh start historyserver tests: hadoop jar hadoop-mapreduce-examples~ pi 5 5

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.