Hadoop environment building 10/21 Update SLTechnology News&Howtos

Hadoop environment building

2025-10-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

A brief introduction to hadoop 1 Hadoop overall framework

Hadoop is composed of HDFS, MapReduce, HBASE, hive and zookeeper, among which the most

The most important element of the foundation is the underlying file system HDFS, which stores the files of all storage nodes in the cluster.

MapReduce engine that executes MapReduce programs

1 pig is a large-scale data analysis platform based on Hadoop, and pig is a complex massive data parallel meter.

Calculation provides a simple operation and programming interface.

2 hive is a Hadoop-based tool that provides complete SQL queries that can convert sql statements

Execute for MapReduce (mapping) task

3 zookeeper: efficient, scalable coordination system to store and coordinate critical shared states

4 HBASE is an open source distributed database based on column storage model

5 hdfs is a distributed file system with high fault tolerance, which is suitable for those with very large data sets.

Use the program

6 MapReduce is a programming mode for parallel computing of large-scale data sets

2 hadoop cluster deployment structure

3 hadoop core design

1 HDFS

Is a highly fault-tolerant distributed file system, which can be widely deployed on cheap PC. It accesses application data in streaming access mode, which can improve the data throughput of the system, so it is very suitable for applications with large datasets.

HDFS architecture adopts master-slave architecture. A HDFS cluster should contain a namenode node and multiple datanode nodes. Namenode is responsible for the storage and management of file metadata in the entire HDFS file system. Usually, there is only one machine running namenode,datanode node on the cluster to save the data in the file, and the machines in the cluster are running a datenode instance. In HDFS, namenode nodes are called name nodes and DataNode are called data nodes. DataNode communicates with namenode nodes regularly through heartbeat mechanism. Namenode is equivalent to master serverDatanode in mfs and chunk server in mfs.

2 the reading and writing mode of HDFS

Write

File write: as shown above

1 the client initiates a file write request to nameode (master server)

2 namenode returns DataNode information to the client based on file size and file block configuration

(chunkserver)

3 client divides the file into multiple file blocks and writes each in sequence according to the address information of the DataNode.

In DataNode

Read

Steps:

1 send a read request to namenode

2 namenode returns a list of file locations

3 client reads the file information according to the list

2 MapReduce

Is a programming model for large-scale data set parallel computing, map (mapping) and reduce (simplification), using distributed mode, (partition), first distributing tasks to cluster nodes, parallel computing, and then merging results, multi-node computing, task scheduling, load balancing, fault tolerance, all completed by MapReduce

The user submits the task to job tracer, and job tracer maps the map reduce operations in the corresponding user program to the tasktracee node. The input module is responsible for dividing the input data into small data blocks, and then passing them to the map node. The map node gets each key/value pair, then generates one or more key/value pairs, and then writes to the file. The reduce node obtains the data in the temporary file and iterates the data with the same key. Then write the final result to the file

The core of Hadoop is MapReduce, while the core of MapReduce lies in map and reduce functions. They are left to the user to implement, and these two functions define the task itself.

Map function: accepts a key-value pair (key-value pair) (for example, the Splitting result in the figure above) and produces a set of intermediate key-value pairs (such as the result after Mapping in the image above). The Map/Reduce framework passes the same value in the middle key value pair generated by the map function to a reduce function.

Reduce function: accepts a key and a related set of values (for example, the result after Shuffling in the figure above), and merges this set of values to produce a smaller set of values (usually only one or zero values) (such as the result after Reduce in the figure above)

However, Map/Reduce is not a panacea, and there are pre-conditions for Map/Reduce computing:

(1) the data set to be processed can be decomposed into many small data sets.

(2) and each small data set can be processed completely in parallel.

If either of the above two articles is not satisfied, the Map/Reduce mode is not suitable.

Second, setting up the environment

Software download location

Link: https://pan.baidu.com/s/1lBQ0jZC6MGj9zfV-dEiguw

Password: 13xi

1 configure hadoop user

2 download and decompress the related software

3 modify the environment variables to make hadoop run on the Java platform

4 modify the java environment variable so that it can view the open status of the hadoop process

5 View

Two single node deployment 1 create a folder and import the data to test the single node

And use the Hadoop internal method to complete the basic configuration. Output is created automatically and does not need to be created manually

2 check its statistical results

Three pseudo nodes deployment 1 configuration file system management related

2 number of copies of the configuration file saved

3 configure hadoop password and set ssh secret-free authentication

4 configure the datanode node

5 namenode node formatting

The returned value is 0, which indicates that the format is successful.

6 start the service and view the process status

Test shows results

7 Test

Create directory upload

View

Upload files to the server

View

Use the command to view the results

Delete and view its display result

8 Advanced configuration: mapred configuration

9 start the service and view

10 check to see if it is successful

Four distributed configuration 1 pseudo-node configuration before stopping

2 install the service under the superuser to share storage

3 start the service

4 configure shared storage

5 refresh to see if it is successful

6 the client starts the service and mounts the mount

7 View configuration

8 configure the datanode node

9 configure the number of backup storage

10 formatting namenode nodes

11 set secret-free

12 start the service and view

13 datanode Node View Service

14 check whether the datanode node exists and mounts normally

Add node 1 online to install and configure the basic environment

2 start the service and mount it

3 configure datanode node

4 configure secret-free authentication

5 start the service and view its progress

6 check whether it is added to the storage system

Six-node data migration 1 create a data directory and upload data

2 check whether the upload is successful

3 View the storage status of each node

4 configuration offline

Configure the offline user as server3

5 effective profile

6 check the serever3 status. If normal, the migration is completed, otherwise it is not completed

7 check the storage of other nodes and find that the storage has been increased, which indicates that the data migration is complete.

8 shut down the datanode node and went offline successfully

9 enable nodemanager services of other nodes

Seven highs are available

Brief introduction:

In a typical HA cluster, there are usually two different machines acting as NN (namenode). At any time, only one machine is in the active state and the other is in the standby state. Active NN is responsible for the operation of all clients in the cluster, while standby NN is mainly used for standby, mainly to maintain an adequate state, and can provide fast fault recovery if necessary.

In order to keep the state of standby NN synchronized with active NN and the metadata consistent, he will communicate with the journalnodes daemon. When active NN performs any changes to the namespace, he needs to persist to more than half of the journalnodes (through edits log persistent storage), while standby NN is responsible for observing the changes in edits log. He can read edits information from JNS and update its internal namespace. Once active NN fails, it can read edits information from JNS. Standby NN will ensure that all edits is read from JNS and then switch to active state. Standby NN reads all edits to ensure that it has a fully synchronized namespace state with active NN before a failover occurs.

In order to provide fast failure recovery, standby NN also needs to save the storage location of each file block in the cluster. To achieve this, all Datanode in the cluster will configure the location of active NN and standby NN and send them the location and heartbeat of the fast file.

In order to deploy a HA cluster, you need to prepare the following:

(1), NameNode machines: machines running Active NN and Standby NN require the same hardware configuration

(2), JournalNode machines: the machine running JN. JN daemons are relatively lightweight, so they can be run on the same machine by other daemon threads, such as NN,YARN ResourceManager. In a cluster, at least 3 JN daemons need to be run, which will give the system some fault tolerance. Of course, you can also run more than 3 JN, but in order to increase the fault tolerance of the system, you should run an odd number of JN (3, 5, 7, etc.). When running N JN, the system will tolerate a maximum of 2 JN crashes. In a HA cluster, Standby NN also executes checkpoints of namespace state, so it is not necessary to run Secondary NN, CheckpointNode, and BackupNode;. In fact, it is wrong to run these daemons.

1 DHFS highly available 1 shutdown of previous services

2 View the configuration of each node

3 configure the service

4 delete the original configuration

5 empty the original configuration to prevent influence

6 configure the installation service zookeeper

Zookeeper is at least three, and the total number of points is odd.

7 start the service

8 check that it is leader

9 start the service on leader and view the relevant configuration

10 configure cluster related configuration 1 configure specify namenode of hdfs as master (arbitrary name) specify zookeeper cluster host address (IP address of server2,server3,server4)

2 Editing the hdfs-site.xml file

A specifies that the nameservices of hdfs is master

B define namenode node (server1 server5)

C specifies where namenode metadata is stored on journalNode

D specifies where journalnode stores data on the local disk

E enable namenode failed automatic switching, and automatic switching implementation, isolation mechanism, and the use of sshfence isolation mechanism requires ssh secret-free and isolation mechanism timeout and other parameters

11 configure server5 to mount

12 start the log server server2 server3 server4

13 formatting namenode

14 send the generated data to another highly available node

15 configure secret-free authentication

16 start the zkfc service

17 View Services

18 verify high availability, shut down the service

2 highly available YARN

1 specify that the frame of yarn is mapreduce

2 configure to run the mapreduce program on nodemanager

3 activate RM high availability

4 specify the cluster ID of RM

5 define RM node

6 activate RM automatic recovery

7 configure RM status information storage method, including memstore and ZKstore

8 specify the address of the ziikeeper cluster when configured for zookeeper storage

9 start the yarn service and view

10 another node needs to start the service manually

11 View cluster status

12 testing

Disconnect the master node to view the situation

Then switch to server5.

View server5 status

Start server1

View server1 status

3 how to shut down the service

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.