In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/02 Report--
Deploy Hadoop high-performance cluster
Server Overview
1) what is Hadoop
Hadoop is the founder of Lucene Doug Cutting, according to the relevant content of Google copied out of the distributed file system and massive data analysis and calculation of the basic framework system, including MapReduce programs, hdfs systems and so on.
Hadoop includes two cores, distributed storage system and distributed computing system.
2) distributed storage
Why does the data need to be stored in a distributed system? can't a single computer store it? can't a few TB hard drives hold this data? As a matter of fact, it doesn't fit. For example, a lot of telecom call logs are stored on many hard drives on many servers. So, to deal with so much data, you have to read and write data from one server, which is too troublesome! We want to have a file system that governs many servers for storing data. When you store data through this file system, you don't feel like it's being stored on a different server. When reading data, it does not feel like it is reading from a different server.
3) as shown in the figure: this is the distributed file system.
The distributed file system manages a server cluster. In this cluster, the data is stored in the nodes of the cluster (that is, the servers in the cluster), but the file system shields the differences between the servers. In that case, we can use it as if we were using a normal file system, but the data is scattered across different servers.
4) Namespace (namespace):
In the distributed storage system, the data scattered in different nodes may belong to the same file. In order to organize a large number of files, the files can be put into different folders, and the folder can be included at one level. We call this form of organization namespace. Namespaces manage all files in the entire server cluster. The responsibility of a namespace is not the same as the responsibility of storing real data. The node responsible for namespace responsibilities is called the master node (master node), and the node responsible for storing real data is called the slave node (slave node).
5) Master / Slave node:
The master node is responsible for managing the file structure of the file system, and the slave node is responsible for storing the real data, which is called the master-slave structure (master-slaves).
When users operate, they should also deal with the master node first to query which slave nodes the data is stored on, and then read from the slave nodes. On the master node, in order to speed up user access, the entire namespace information is placed in memory, and the more files are stored, the more memory space the master node needs.
(1) block: when storing data from nodes, some original data files may be very large, some may be very small, and files of different sizes are not easy to manage, so you can abstract a separate storage file unit, called block.
(2) disaster recovery: if the data is stored in the cluster, access may fail due to network reasons or server hardware reasons. It is best to use the replication mechanism to back up the data to multiple servers at the same time, so that the data is secure and the probability of data loss or access failure is low.
(3) Workflow chart:
6) Summary:
In the above master-slave structure, because the master node contains the target structure information of the entire file system, because it is very important. In addition, because the master node runs with namespace information in memory, the more files it stores, the more memory the master node needs.
In hadoop, a distributed storage system is called HDFS (hadoop distributed file system). The master node is called the name node (namenode) and the slave node is called the data node (datanode).
7) distributed computing:
When processing the data, we will read the data into memory for processing. If we deal with large amounts of data, for example, the data size is 100GB, we need to count the total number of words in the file. It is almost impossible to load all data into memory, which is called mobile data.
So can I put the program code on the server where the data is stored? Because the program code is generally small and almost negligible compared with the original data, it saves the time of original data transmission. Now, the data is stored in the distributed file system, and the data of 100GB may be stored on many servers, so the program code can be distributed to these servers and executed at the same time on these servers, that is, parallel computing, that is, distributed computing. This greatly shortens the execution time of the program. We move the program code to the machine of the data node to perform the calculation called mobile computing.
Distributed computing needs the final result, and the program code will produce a lot of results after being executed in parallel on many machines, so there needs to be a section of code to summarize these intermediate results. Distributed computing in Hadoop is generally done in two phases. The first stage is responsible for reading the original data in each data node, carrying out preliminary processing, and counting the number of words in the data in each node. Then transfer the processing result to the second stage, summarize the intermediate results, produce the final result, and find out the total number of words in the 100GB file, as shown in the figure:
In hadoop, the distributed computing part is called MapReduce. MapReduce is a programming model for parallel operations on large datasets (larger than 1TB). The concepts "Map" and "Reduce", and their main ideas, are borrowed from functional programming languages, as well as features borrowed from vector programming languages. It greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming.
There are the following roles in distributed computing:
The primary node is called the job node (jobtracker)
The slave node is called task node (tasktracker).
In the task node, the code that runs the first phase is called the map task (map task), and the code that runs the second phase is called the reduce task (reduce task). Task: task, tracker, tracker
8) noun interpretation of hadoop
(1) Hadoop:Apache open source distributed framework.
(2) HDFS:Hadoop distributed file system.
(3) NameNode:Hadoop HDFS metadata master node server, which is responsible for saving DataNode files and storing metadata information, this server is a single point.
(4) JobTracker:Hadoop 's Map/Reduce scheduler, which is responsible for communicating with TaskTracker to assign computing tasks and track the progress of tasks, this server is also a single point.
(5) DataNode:Hadoop data node, which is responsible for storing data.
(6) TaskTracker:Hadoop scheduler, which is responsible for the startup and execution of Map,Reduce tasks.
Note: Namenode records the location information of the data node where each block is located in each file.
One: experimental topology
Second: experimental objectives
Real-time site: build Hadoop cluster:
Three: experimental environment
Xuegod63.cn 192.168.1.63 NameNode
Xuegod64.cn 192.168.1.64 DataNode1
Xuegod62.cn 192.168.1.62 DataNode2
Four: experimental code
1: the basic environment is configured as follows
1) configure hosts files on the three machines as follows:
[root@xuegod63 ~] # vim / etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
:: 1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.63 xuegod63.cn
192.168.1.64 xuegod64.cn
192.168.1.62 xuegod62.cn
Copy hosts to the other two machines:
[root@xuegod63 ~] # scp / etc/hosts root@192.168.1.64:/etc/
[root@xuegod63 ~] # scp / etc/hosts root@192.168.1.62:/etc/
Note: in / etc/hosts, do not put the machine name to the address 127.0.0.1 at the same time, otherwise the data node will not be connected.
Namenode, the error is as follows:
Org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:9000
2) Login without secret key
Configured on xuegod63, you can log in to the machine xuegod63,xuegod64,xuegod62 without a password by ssh, making it easy to copy files and start services later. Because when namenode starts, it will connect to the datanode to start the corresponding service.
(1) generate public and private keys
[root@xuegod63 ~] # ssh-keygen
(2) Import the public key to other datanode node authentication files
[root@xuegod63 ~] # ssh-copy-id root@192.168.1.62
[root@xuegod63 ~] # ssh-copy-id root@192.168.1.64
2: Java environment JDK should be configured and installed on all three machines:
1) install and configure the Java runtime environment-jdk. Upgraded the version of jdk
[root@xuegod63 ~] # rpm-ivh jdk-7u71-linux-x64.rpm
[root@xuegod63 ~] # rpm-pql / root/jdk-7u71-linux-x64.rpm# can know that the installation directory of jdk is in / usr/java by looking at the information of jdk.
[root@xuegod63 ~] # vim/etc/profile# adds the following at the end of the file:
Export JAVA_HOME=/usr/java/jdk1.7.0_71
Export JAVA_BIN=/usr/java/jdk1.7.0_71/bin
Export PATH=$ {JAVA_HOME} / bin:$PATH
Export CLASSPATH=.:$ {JAVA_HOME} / lib/dt.jar:$ {JAVA_HOME} / lib/tools.jar
2) make the configuration file effective
[root@xuegod63 ~] # source / etc/profile
3) verify that the java runtime environment is installed successfully:
[root@xuegod63 ~] # java-version
Java version "1.7.0Y71"
Java (TM) SE Runtime Environment (build 1.7.0_71-b14)
Java HotSpot (TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
If the corresponding version of the installation appears, the java runtime environment has been installed successfully.
Note: only the version of jdk has been upgraded here, because jdk is already installed on the system I installed.
4) deploy jdk to the other two machines:
[root@xuegod63 ~] # scp jdk-7u71-linux-x64.rpm root@192.168.1.62:/root
[root@xuegod63 ~] # scp jdk-7u71-linux-x64.rpm root@192.168.1.64:/root
[root@xuegod63 ~] # scp / etc/profile 192.168.1.62:/etc/profile
[root@xuegod63 ~] # scp / etc/profile 192.168.1.64:/etc/profile
5) install:
[root@xuegod64 ~] # rpm-ivh jdk-7u71-linux-x64.rpm
[root@xuegod62~] # rpm-ivh jdk-7u71-linux-x64.rpm
6) reload the java runtime environment:
[root@xuegod64 ~] # source / etc/profile
[root@xuegod62 ~] # source / etc/profile
7) Test:
[root@xuegod64~] # java-version
Java version "1.7.0Y71"
Java (TM) SE Runtime Environment (build 1.7.0_71-b14)
Java HotSpot (TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
[root@xuegod62 ~] # java-version
Java version "1.7.0Y71"
Java (TM) SE Runtime Environment (build 1.7.0_71-b14)
Java HotSpot (TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
3: create a user account and Hadoop directory to run hadoop.
[root@xuegod63 ~] # useradd-u 8000 hadoop# to ensure that the ID of hadoop users created on other servers remains consistent, try to increase the UID when creating.
[root@xuegod63 ~] # echo 123456 | passwd-- stdin hadoop
[root@xuegod64~] # useradd-u 8000 hadoop
[root@xuegod64~] # echo 123456 | passwd-- stdin hadoop
[root@xuegod62~] # useradd-u 8000 hadoop
[root@xuegod62~] # echo 123456 | passwd-- stdin hadoop
Note: when creating a user hadoop, you cannot use the parameter-s / sbin/nologin because we will su-hadoop to switch users later.
4: install Hadoop in xuegod63 and configure it as name node master node
Hadoop installation directory: / home/hadoop/hadoop-2.2.0 uploads hadoop-2.2.0.tar.gz to the server using the root account
1) create a working directory related to hadoop:
[root@xuegod63 ~] # cp hadoop-2.2.0.tar.gz / home/hadoop/
[root@xuegod63] # chown-R hadoop:hadoop / home/hadoop/hadoop-2.2.0.tar.gz
[root@xuegod63 ~] # su-hadoop
[hadoop@xuegod63] $mkdir-p / home/hadoop/dfs/name / home/hadoop/dfs/data / home/hadoop/tmp
[hadoop@xuegod63 ~] $tar zxvf hadoop-2.2.0.tar.gz
[hadoop@xuegod63 ~] $ls
Dfs hadoop-2.2.0 hadoop-2.2.0.tar.gz tmp
2) configure Hadoop: seven configuration files need to be modified.
File location: / home/hadoop/hadoop-2.2.0/etc/hadoop/
File name: hadoop-env.sh, yarn-evn.sh, slaves, core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml
(1) configuration file hadoop-env.sh, which specifies the java running environment of hadoop
This file is the configuration of the basic environment in which hadoop is running, and the location of the java virtual machine needs to be modified.
[hadoop@xuegod63hadoop-2.2.0] $vim / home/hadoop/hadoop-2.2.0/etc/hadoop/hadoop-env.sh
Change: 27 export JAVA_HOME=$ {JAVA_HOME}
Is: export JAVA_HOME=/usr/java/jdk1.7.0_71
Note: specify the java runtime environment variable, and note that the Java version should correspond
(2) configuration file yarn-env.sh, which specifies the java running environment of the yarn framework
This file is the configuration of the running environment of the yarn framework, and you also need to modify the location of the java virtual machine.
Yarn: Hadoop's new MapReduce framework Yarn is the new map-reduce framework (Yarn) principle of Hadoop since version 0.23.0.
[hadoop@xuegod63 hadoop-2.2.0] $vim / home/hadoop/hadoop-2.2.0/etc/hadoop/yarn-env.sh
Change: 26 JAVA_HOME=$JAVA_HOME
Is: 26 JAVA_HOME=/usr/java/jdk1.7.0_71
(3) configuration file slaves, which specifies the datanode data storage server
Write all DataNode machine names to this file, one line for each hostname, and configure as follows:
[hadoop@xuegod63 hadoop-2.2.0] $vim / home/hadoop/hadoop-2.2.0/etc/hadoop/slaves
Change to: localhost
Are:
Xuegod64.cn
Xuegod62.cn
(4) configure the file core-site.xml to specify the access path to the hadoop web interface
This is the core configuration file of hadoop, and these are the only two properties that need to be configured here: fs.default.name is configured with the naming of hadoop's HDFS system, and the location is port 9000 of the host; hadoop.tmp.dir is configured with the root location of hadoop's tmp directory. A location that is not in the file system is used here, so create a new one with the mkdir command first.
[hadoop@xuegod63 hadoop-2.2.0] $vim / home/hadoop/hadoop-2.2.0/etc/hadoop/core-site.xml
Change:
nineteen
twenty
Note: insert the content between and marked in red and blue:
Are:
Fs.defaultFS
Hdfs://xuegod63.cn:9000
Io.file.buffer.size
131072
Hadoop.tmp.dir
File:/home/hadoop/tmp
Abase for other temporary directories.
Note: property property
(5) configuration file hdfs-site.xml
This is the configuration file for hdfs. Dfs.http.address configures the access location of hdfs's http.
Dfs.replication is configured with the number of copies of file blocks, which is generally not greater than the number of slaves.
[root@xuegod63 ~] # vim / home/hadoop/hadoop-2.2.0/etc/hadoop/hdfs-site.xml
Change: 19
twenty
twenty-one
Note: insert the content between and marked in red and blue:
Are:
Dfs.namenode.secondary.http-address
Xuegod63.cn:9001# views HDFS status through the web interface
Dfs.namenode.name.dir
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.