Learning log-hdfs configuration and principle + yarn configuration 04/21 Update SLTechnology News&Howtos

Learning log-hdfs configuration and principle + yarn configuration

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Filtering algorithm:

Focus on the weight formula:

W = TF * Log (N/DF)

TF: the total number of times the current keyword appears in this record

N: total number of records

DF: the number of entries that the current keyword appears in all records

Namenode HA and namenode Federation of HDFS

(1) resolve a single point of failure:

Use HDFS HA: solve the problem through the master / slave namenode; if the master fails, switch to the slave.

(2) address memory constraints:

Use HDFS Federation, scale horizontally, support multiple namenode, independent of each other. Share all datanode.

The following details are described:

All namenode HA:namenode modifications to metadata will go through journalnode and back up one on the QJM cluster, so the metadata on QJM is the same as that on namenode (the metadata of namenode is the metadata image on QJM). After the namenode is dead, standby's namenode will find the metadata on the QJM cluster and continue to work. If you use namenode Federation, the shared data for each namenode will be on the journalnode cluster. It is equivalent to storing an image of the journalnode cluster on each namenode, and the read and write of the namenode are modified and found on the jn cluster.

When the client first requests hdfs, it visits the zookeeper to check which namenode is dead and which is alive, and to decide which namenode to visit. Any namenode will correspond to a FailoverController, or ZKFC competitive lock. After a namenode dies, there is a competitive lock to choose which namenode to use, and the voting mechanism is used here, so the zookeeper uses an odd number.

Namenode Federation: there are several independent namenode in a cluster, which is equivalent to multiple independent clusters, but share datanode. When clients access these namenode, they choose which namenode to use before they can access and use them.

To add HA to Federation is to add HA to each namenode, independent of each other.

YARN:

YARN is a resource management system that manages HDFS data and knows all about the data; computing framework applies to yarn for resources to calculate, so that resources are not wasted and can run concurrently; it is compatible with other third-party parallel computing frameworks.

In terms of resource management:

ResourceManager: responsible for resource management and scheduling of the entire cluster

ApplicationMaster: responsible for application-related transactions, such as task scheduling, task monitoring, and fault tolerance. It has nodeManager when it works on each node, and there is ApplicationMaster in it.

NodeManager is preferably on datanode's machine, because it is easy to calculate.

Configure the startup hadoop cluster in the way of namenode HA

Configure hdfs-site.xml and its instructions:

This is all about the configuration of hdfs, such as which node has which specific operations.

Dfs.name.dir / root/data/namenode dfs.data.dir / root/data/datanode dfs.tmp.dir / root/data/tmp dfs.replication 1 / / nameservices is the name of the cluster and is the only mark for zookeeper to identify. Mycluster is the name. It can be changed to other dfs.nameservices mycluster / / indicating that there are several namenode and their names under the cluster. Here is the name of the cluster and the address of the rpc protocol corresponding to dfs.ha.namenodes.mycluster nn1,nn2 / / each namenode above, which is used to transfer data. The client uploads and downloads the port of the dfs.namenode.rpc-address.mycluster.nn1 hadoop11:4001 dfs.namenode.rpc-address.mycluster.nn2 hadoop22:4001 dfs.namenode.servicerpc-address.mycluster.nn1 hadoop11:4011 dfs.namenode.servicerpc-address.mycluster.nn2 hadoop22:4011 / / http protocol in order to pass through the network, such as a browser Check the dfs.namenode.http-address.mycluster.nn1 hadoop11:50070 dfs.namenode.http-address.mycluster.nn2 hadoop22:50070 / / of hdfs. Here are the hosts configured with journalnode, which are configured as odd, and which machines in the cluster have journalnode. / / when namenode reads and writes, the address is requested. Journalnode records the file in real time. When the outside world accesses the namenode,namenode, on the one hand, it responds to the request, on the other hand, it asks journalnode to read and write to make a good backup. Dfs.namenode.shared.edits.dir qjournal://hadoop11:8485;hadoop22:8485 The file location of hadoop33:8485/mycluster / / journalNode on the machine The working directory dfs.journalnode.edits.dir / root/data/journaldata/ classes called by namenode activated by external connections / / for the outside world to find the namenode dfs.client.failover.proxy.provider.mycluster org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider of active / / automatically switch namenode dfs.ha.automatic-failover.enabled true / / location for keys used by one machine to log in to another machine dfs.ha.fencing.methods sshfence / / location of private key file dfs.ha.fencing.ssh.private-key-files / root/.ssh/id_dsa

Configure core-site.xml

/ / this is the unified entrance to hdfs. Mycluster is the unified service ID of the cluster configured by us / / external access is the cluster fs.defaultFS hdfs://mycluster / / hdfs is managed by zookeeper. Here is the address and port of the ZooKeeper cluster. Note that the quantity must be odd and not less than three nodes ha.zookeeper.quorum hadoop11:2181,hadoop22:2181,hadoop33:2181

If a namenode that is not HA becomes HA, execute hdfs-initializeSharedEdits on the host of the namenode to be changed, which can change the metadata on that namenode to metadata on journalnode.

The .bashrc file under root is a configuration file for environment variables and is only available to root users.

Configuration of zookeeper:

One is to configure the dir path first to store files to avoid the loss of zookeeper information after closing.

Server.1=hadoop11:2888:3888server.2=hadoop22:2888:3888server.3=hadoop33:2888:3888

Server.1 refers to the number of zookeeper in the cluster

There is also a dataDir=/root/data/zookeeper in the configuration file of zookeeper, which contains a myid file.

[root@hadoop11 data] # cd zookeeper/

[root@hadoop11 zookeeper] # ls

Myid version-2

This myid file indicates the number of the current zookeeper in the cluster.

Brief description of the configuration process:

Now start zookeeper,zk on every machine. Activate. Don't move

Then, hdfs-daemon.sh journalnode, start journalnode, start namenode on one of the machines, use hdfs namenode-format to get the source file of namenode, you can start the namenode of this node, and then use hdfs namenode-bootstrapStandby on another namenode as a backup node, and the metafiles of the two namenode are the same.

If you are setting up a fresh HDFS cluster, you should first run the format command (hdfs namenode-format) on one of NameNodes.

If you have already formatted the NameNode, or are converting a non-HA-enabled cluster to be HA-enabled, you should now copy over the contents of your NameNode metadata directories to the other, unformatted NameNode by running the command "hdfs namenode-bootstrapStandby" on the unformatted NameNode. Running this command will also ensure that the JournalNodes (as configured by dfs.namenode.shared.edits.dir) contain sufficient edits transactions to be able to start both NameNodes.

If you are converting a non-HA NameNode to be HA, you should run the command "hdfs-initializeSharedEdits", which will initialize the JournalNodes with the edits data from the local NameNode edits directories.

There is a zkfc on each namenode, which is a failure mechanism that interacts with zookeeper.

The following instructions are executed on a namenode to associate zkfc with zookeeper

Enable zkfc to start normally

Initializing HA state in ZooKeeper

After the configuration keys have been added, the next step is to initialize required state in ZooKeeper. You can do so by running the following command from one of the NameNode hosts.

$hdfs zkfc-formatZK

This will create a znode in ZooKeeper inside of which the automatic failover system stores its data.

Some features of hdfs:

The hadoop-deamon.sh [node] in the sbin directory can be used to open a node on the machine.

Which process can kill the node of a certain process with kill-9?

Start-dfs.sh starts the hdfs of the cluster

After configuring the bin and sbin of hadoop as environment variables, you can use hdfs to do a lot of things, as follows:

[root@hadoop11 ~] # hdfs

Usage: hdfs [--config confdir] COMMAND

Where COMMAND is one of:

Dfs run a filesystem command on the file systems supported in Hadoop.

Namenode-format format the DFS filesystem

Secondarynamenode run the DFS secondarynamenode

Namenode run the DFS namenode

Journalnode run the DFS journalnode

Zkfc run the ZK Failover Controller daemon

Datanode run a DFS datanode

Dfsadmin run a DFS admin client

Haadmin run a DFS HA admin client

Fsck run a DFS filesystem checking utility

Balancer run a cluster balancing utility

Jmxget get JMX exported values from NameNode or DataNode.

Oiv apply the offline fsp_w_picpath viewer to an fsp_w_picpath

Oev apply the offline edits viewer to an edits file

Fetchdt fetch a delegation token from the NameNode

Getconf get config values from configuration

Groups get the groups which users belong to

SnapshotDiff diff two snapshots of a directory or diff the

Current directory contents with a snapshot

LsSnapshottableDir list all snapshottable dirs owned by the current user

Use-help to see options

Portmap run a portmap service

Nfs3 run an NFS version 3 gateway

Cacheadmin configure the HDFS cache

Configuration of YARN

In mapred-site.xml

Mapred-site.xml / / this indicates which framework mapreduce.framework.name yarn mapreduce uses

In yarn-site.xml

Yarn-site.xml / / the following are configured the same in each node, because it indicates which machine is used in the cluster as the resourcemanager / / this is the yarn explorer address For external connection to the resource manager (this) the yarn.resourcemanager.address hadoop1:9080 / / application host communicates with the resource manager yarn.resourcemanager.scheduler.address hadoop1:9081 / / node management The port through which the device communicates with the resource manager If you accompany this in hadoop2, the nodemanager in 2 can find the list of additional services running by 1's resourcemanager yarn.resourcemanager.resource-tracker.address hadoop1:9082 / / Node Manager yarn.nodemanager.aux-services mapreduce_shuffle

Each machine can start nodemanager on its own, using yarn-darmon.sh start nodemanager, where the started nodemanager will find its resourcemanager according to the configuration in the yarn-site.xml file. But in the cluster, nodemanager runs on datanode to manage datanode, so if you specify which machines have datanode in slaves, when start-yarn.sh is used on the host, the host is used as resourcemanager, and nodemanager is launched from the slaves on the node in the file.

There is a yarn on each node, which will form a cluster in an orderly manner according to its own yarn configuration, mainly resourcemanager.

Start yarn at the address required by resourcemanager to start resourcemanager.

To run mapreduce on hadoop:

To package the mapreduce program, put it in the hadoop cluster

Use instruction: hadoop jar [web.jar program name] [class name of the main function] [input file path] [output file path]

For example: hadoop jar web.jar org.shizhen.wordcount / test / output

Then you can check it on output.

The hadoop and yarn in the hadoop cluster itself correspond to a lot of instructions:

Use these instructions to manipulate a process and a node.

[root@hadoop11 ~] # hadoop

Usage: hadoop [--config confdir] COMMAND

Where COMMAND is one of:

Fs run a generic filesystem user client

Version print the version

Jar run a jar file

Checknative [- a |-h] checknative hadoop and compression libraries availability

Distcp copy file or directories recursively

Archive-archiveName NAME-p * create a hadoop archive

Classpath prints the classpath needed to get the

Hadoop jar and the required libraries

Daemonlog get/set the log level for each daemon

CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

Using these yarn class instructions, you can manipulate mapreduce-related nodes and monitor the flow of programs, such as application.

[root@hadoop11 ~] # yarn

Usage: yarn [--config confdir] COMMAND

Where COMMAND is one of:

Resourcemanager run the ResourceManager

Nodemanager run a nodemanager on each slave

Historyserver run the application historyserver

Rmadmin admin tools

Version print the version

Jar run a jar file

Application prints application (s) report/kill application

Applicationattempt prints applicationattempt (s) report

Container prints container (s) report

Node prints node report (s)

Logs dump container logs

Classpath prints the classpath needed to get the

Hadoop jar and the required libraries

Daemonlog get/set the log level for each daemon

CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

After the configuration is complete:

Start with zkServer.sh start to start zookeeper, and then start-all.sh to start hadoop.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.