In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
A brief introduction to hadoop 1 Hadoop overall framework
Hadoop is composed of HDFS, MapReduce, HBASE, hive and zookeeper, among which the most
The most important element of the foundation is the underlying file system HDFS, which stores the files of all storage nodes in the cluster.
MapReduce engine that executes MapReduce programs
1 pig is a large-scale data analysis platform based on Hadoop, and pig is a complex massive data parallel meter.
Calculation provides a simple operation and programming interface.
2 hive is a Hadoop-based tool that provides complete SQL queries that can convert sql statements
Execute for MapReduce (mapping) task
3 zookeeper: efficient, scalable coordination system to store and coordinate critical shared states
4 HBASE is an open source distributed database based on column storage model
5 hdfs is a distributed file system with high fault tolerance, which is suitable for those with very large data sets.
Use the program
6 MapReduce is a programming mode for parallel computing of large-scale data sets
2 hadoop cluster deployment structure
3 hadoop core design
1 HDFS
Is a highly fault-tolerant distributed file system, which can be widely deployed on cheap PC. It accesses application data in streaming access mode, which can improve the data throughput of the system, so it is very suitable for applications with large datasets.
HDFS architecture adopts master-slave architecture. A HDFS cluster should contain a namenode node and multiple datanode nodes. Namenode is responsible for the storage and management of file metadata in the entire HDFS file system. Usually, there is only one machine running namenode,datanode node on the cluster to save the data in the file, and the machines in the cluster are running a datenode instance. In HDFS, namenode nodes are called name nodes and DataNode are called data nodes. DataNode communicates with namenode nodes regularly through heartbeat mechanism. Namenode is equivalent to master serverDatanode in mfs and chunk server in mfs.
2 the reading and writing mode of HDFS
Write
File write: as shown above
1 the client initiates a file write request to nameode (master server)
2 namenode returns DataNode information to the client based on file size and file block configuration
(chunkserver)
3 client divides the file into multiple file blocks and writes each in sequence according to the address information of the DataNode.
In DataNode
Read
Steps:
1 send a read request to namenode
2 namenode returns a list of file locations
3 client reads the file information according to the list
2 MapReduce
Is a programming model for large-scale data set parallel computing, map (mapping) and reduce (simplification), using distributed mode, (partition), first distributing tasks to cluster nodes, parallel computing, and then merging results, multi-node computing, task scheduling, load balancing, fault tolerance, all completed by MapReduce
The user submits the task to job tracer, and job tracer maps the map reduce operations in the corresponding user program to the tasktracee node. The input module is responsible for dividing the input data into small data blocks, and then passing them to the map node. The map node gets each key/value pair, then generates one or more key/value pairs, and then writes to the file. The reduce node obtains the data in the temporary file and iterates the data with the same key. Then write the final result to the file
The core of Hadoop is MapReduce, while the core of MapReduce lies in map and reduce functions. They are left to the user to implement, and these two functions define the task itself.
Map function: accepts a key-value pair (key-value pair) (for example, the Splitting result in the figure above) and produces a set of intermediate key-value pairs (such as the result after Mapping in the image above). The Map/Reduce framework passes the same value in the middle key value pair generated by the map function to a reduce function.
Reduce function: accepts a key and a related set of values (for example, the result after Shuffling in the figure above), and merges this set of values to produce a smaller set of values (usually only one or zero values) (such as the result after Reduce in the figure above)
However, Map/Reduce is not a panacea, and there are pre-conditions for Map/Reduce computing:
(1) the data set to be processed can be decomposed into many small data sets.
(2) and each small data set can be processed completely in parallel.
If either of the above two articles is not satisfied, the Map/Reduce mode is not suitable.
Second, setting up the environment
Software download location
Link: https://pan.baidu.com/s/1lBQ0jZC6MGj9zfV-dEiguw
Password: 13xi
1 configure hadoop user
2 download and decompress the related software
3 modify the environment variables to make hadoop run on the Java platform
4 modify the java environment variable so that it can view the open status of the hadoop process
5 View
Two single node deployment 1 create a folder and import the data to test the single node
And use the Hadoop internal method to complete the basic configuration. Output is created automatically and does not need to be created manually
2 check its statistical results
Three pseudo nodes deployment 1 configuration file system management related
2 number of copies of the configuration file saved
3 configure hadoop password and set ssh secret-free authentication
4 configure the datanode node
5 namenode node formatting
The returned value is 0, which indicates that the format is successful.
6 start the service and view the process status
Test shows results
7 Test
Create directory upload
View
Upload files to the server
View
Use the command to view the results
Delete and view its display result
8 Advanced configuration: mapred configuration
9 start the service and view
10 check to see if it is successful
Four distributed configuration 1 pseudo-node configuration before stopping
2 install the service under the superuser to share storage
3 start the service
4 configure shared storage
5 refresh to see if it is successful
6 the client starts the service and mounts the mount
7 View configuration
8 configure the datanode node
9 configure the number of backup storage
10 formatting namenode nodes
11 set secret-free
12 start the service and view
13 datanode Node View Service
14 check whether the datanode node exists and mounts normally
Add node 1 online to install and configure the basic environment
2 start the service and mount it
3 configure datanode node
4 configure secret-free authentication
5 start the service and view its progress
6 check whether it is added to the storage system
Six-node data migration 1 create a data directory and upload data
2 check whether the upload is successful
3 View the storage status of each node
4 configuration offline
Configure the offline user as server3
5 effective profile
6 check the serever3 status. If normal, the migration is completed, otherwise it is not completed
7 check the storage of other nodes and find that the storage has been increased, which indicates that the data migration is complete.
8 shut down the datanode node and went offline successfully
9 enable nodemanager services of other nodes
Seven highs are available
Brief introduction:
In a typical HA cluster, there are usually two different machines acting as NN (namenode). At any time, only one machine is in the active state and the other is in the standby state. Active NN is responsible for the operation of all clients in the cluster, while standby NN is mainly used for standby, mainly to maintain an adequate state, and can provide fast fault recovery if necessary.
In order to keep the state of standby NN synchronized with active NN and the metadata consistent, he will communicate with the journalnodes daemon. When active NN performs any changes to the namespace, he needs to persist to more than half of the journalnodes (through edits log persistent storage), while standby NN is responsible for observing the changes in edits log. He can read edits information from JNS and update its internal namespace. Once active NN fails, it can read edits information from JNS. Standby NN will ensure that all edits is read from JNS and then switch to active state. Standby NN reads all edits to ensure that it has a fully synchronized namespace state with active NN before a failover occurs.
In order to provide fast failure recovery, standby NN also needs to save the storage location of each file block in the cluster. To achieve this, all Datanode in the cluster will configure the location of active NN and standby NN and send them the location and heartbeat of the fast file.
In order to deploy a HA cluster, you need to prepare the following:
(1), NameNode machines: machines running Active NN and Standby NN require the same hardware configuration
(2), JournalNode machines: the machine running JN. JN daemons are relatively lightweight, so they can be run on the same machine by other daemon threads, such as NN,YARN ResourceManager. In a cluster, at least 3 JN daemons need to be run, which will give the system some fault tolerance. Of course, you can also run more than 3 JN, but in order to increase the fault tolerance of the system, you should run an odd number of JN (3, 5, 7, etc.). When running N JN, the system will tolerate a maximum of 2 JN crashes. In a HA cluster, Standby NN also executes checkpoints of namespace state, so it is not necessary to run Secondary NN, CheckpointNode, and BackupNode;. In fact, it is wrong to run these daemons.
1 DHFS highly available 1 shutdown of previous services
2 View the configuration of each node
3 configure the service
4 delete the original configuration
5 empty the original configuration to prevent influence
6 configure the installation service zookeeper
Zookeeper is at least three, and the total number of points is odd.
7 start the service
8 check that it is leader
9 start the service on leader and view the relevant configuration
10 configure cluster related configuration 1 configure specify namenode of hdfs as master (arbitrary name) specify zookeeper cluster host address (IP address of server2,server3,server4)
2 Editing the hdfs-site.xml file
A specifies that the nameservices of hdfs is master
B define namenode node (server1 server5)
C specifies where namenode metadata is stored on journalNode
D specifies where journalnode stores data on the local disk
E enable namenode failed automatic switching, and automatic switching implementation, isolation mechanism, and the use of sshfence isolation mechanism requires ssh secret-free and isolation mechanism timeout and other parameters
11 configure server5 to mount
12 start the log server server2 server3 server4
13 formatting namenode
14 send the generated data to another highly available node
15 configure secret-free authentication
16 start the zkfc service
17 View Services
18 verify high availability, shut down the service
2 highly available YARN
1 specify that the frame of yarn is mapreduce
2 configure to run the mapreduce program on nodemanager
3 activate RM high availability
4 specify the cluster ID of RM
5 define RM node
6 activate RM automatic recovery
7 configure RM status information storage method, including memstore and ZKstore
8 specify the address of the ziikeeper cluster when configured for zookeeper storage
9 start the yarn service and view
10 another node needs to start the service manually
11 View cluster status
12 testing
Disconnect the master node to view the situation
Then switch to server5.
View server5 status
Start server1
View server1 status
3 how to shut down the service
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
The Now Platform supports programmatic retrieval of PDF data through an HTTP GET request.The request
© 2024 shulou.com SLNews company. All rights reserved.