Common commands of Hadoop and descriptions of safe mode properties 07/19 Update SLTechnology News&Howtos

Common commands of Hadoop and descriptions of safe mode properties

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly explains the "Hadoop common commands and safe mode attribute description", the content of the article is simple and clear, easy to learn and understand, now please follow the editor's train of thought slowly in depth, together to study and learn "Hadoop common commands and safe mode attribute description" bar!

Namenode (hdfs) + jobtracker (mapreduce) can be placed on one machine, datanode+tasktracker can be on one machine, auxiliary namenode needs to put a separate machine, jobtracker usually has the same partition as datanode (directories are best distributed on different disks, and a directory corresponds to a disk), namenode storage directories need to be formatted, datanode storage directories do not need to be formatted, and are created automatically at startup

Blocks on each disk on the same datanode are not duplicated, and blocks between different datanode can be duplicated.

Descriptions of some documents:

1. Dfs.hosts records the list of machines that will join the cluster as datanode.

2. Mapred.hosts records the list of machines that will join the cluster as tasktracker.

3. Dfs.hosts.exclude mapred.hosts.exclude contains a list of machines to be removed.

4. Master records the list of machines running the secondary namenode

5. Slave records the list of machines running datanode and tasktracker

6. Hadoop-env.sh records the environment variables to be used by the script to run hadoop

7. Configuration items of core-site.xml hadoop core, such as hdfs and mapreduce commonly used iUnix settings, etc.

8. Configuration items for hdfs-site.xml hadoop daemons, including namenode, secondary namenode, datanode, etc.

9. Configuration items for mapred-site.xml mapreduce daemons, including jobtracker and tasktracker

10. Hadoop-metrics.properties controls how metrics publishes attributes on hadoop

11. Attributes of log4j.properties system log file, namenode audit log, task log of tasktracker child process

I. key attributes of the hdfs daemon

1. Fs.default.name type: uri default: file:/// description: default file system. Uri defines the host name and the port number of namenode's rpc server. The default is 8020, which is configured in core-site.xml.

2. Dfs.name.dir type: comma-separated directory name default value: ${hadoop.tmp.dir} / dfs/name description: namenode stores permanent metadata directory list. Namenode stores the same metadata file in each directory on the list.

3. Dfs.data.dir type: comma-separated directory name default value: ${hadoop.tmp.dir} / dfs/data description: datanode stores a list of directories where data blocks are stored, and each data block is stored in a directory

4. Fs.checkpoint.dir type: directory names separated by commas. Default value: ${hadoop.tmp.dir} / dfs/namesecondary description: list of directories where namenode stores checkpoints, and a copy of checkpoint files is stored in each directory listed.

2. Key attributes of mapreduce daemon

1. Mapred.job.tracker type: hostname and port default: local description: the host name and port number on which the rpc server of jobtracker is located. If set to the default value of local, jobtracker runs in handler mode immediately when you run a mapreduce job (in other words, users do not need to start jobtracker; and actually try to start jobtracker in this mode will cause an error)

2. Mapred.local.dir type: comma-separated directory name default value: ${hadoop.tmp.dir} / mapred/local description: a directory list of intermediate data stored in a job, which is cleared when the job terminates

3. Mapred.system.dir type: uri default: ${hadoop.tmp.dir} / mapred/system description: the directory where the shared files are stored during the run of the job, relative to fs.default.name

4. Mapred.tasktracker.map.tasks.maximum type: int default value: 2 description: the maximum number of map tasks running on tasktracker at any one time

5. Mapred.tasktracker.reduce.tasks.maximum type: int default value: 2 description: the maximum number of reduce tasks running on tasktracker at any one time

6. Mapred.child.java.opts type: string default value:-xmx200m description: jvm option to start the tasktracker sub-process running map and reduce tasks. This property can be set for each job. For example, you can set the property of jvm to support debugging.

7. Mapred.child.ulimit limits the maximum virtual memory (in kilobytes) of child processes initiated by tasktracker, which must be greater than the value of the 6 setting item.

III. Attributes of rpc server

1. Dfs.datanode.ipc.address default value: 0.0.0.0rpc 50020 description: address and port of rpc server for datanode

2. Mapred.job.tracker default value: local description: when set to host name and port number, this attribute specifies the rpc server address and port of jobtracker. The commonly used port number is 8021.

3. Mapred.task.tracker.report.address default value: 127.0.0.1 mapred.task.tracker.report.address 0 description: rpc server address and port number of tasktracker, which is used by sub-jvm of tasktracker to communicate with tasktracker. In this case, any free port can be used because the server is only hidden from the address that will be sent. If the address is not sent on this machine, you need to change the default setting.

Datanode also runs a tcp/ip server to support block transfer, which is set by default by dfs.datanode.address and defaults to 0.0.0.0 dfs.datanode.address 50010

IV. Http server properties

1. Mapred.job.tracker.http.address default value: 0.0.0.0http 50030 description: http server address and port of http

2. Mapred.task.tracker.http.address default value: 0.0.0.0http 50060 description: http server address and port of http

3. Dfs.http.address default value: 0.0.0.0http 50070 description: http server address and port of http

4. Dfs.datanode.http.address default value: 0.0.0.0http 50075 description: http server address and port of http

5. Dfs.secondary.http.addressDefaults: 0.0.0.0 namenode 50090 description: secondary http server address and port

You can select a network interface as the ip address for each datanode and tasktracker (for http and rpc servers). Related property items include dfs.datanode.dns.interface and mapred.tasktracker.dns.interface, and the default value is default

Fifth, the attribute description of the safe mode

1. Dfs.replication.min type: int default: 1 description: set the minimum copy level, the minimum number of copies that need to be created for successful write operations (that is, the minimum copy level)

2. Dfs.safemode.threshold.pct type: float default: 0.999 indicates that before namenode exits safe mode, the proportion of blocks in the system that meet the minimum replica level (defined by the previous option). Setting this value to 0 or less will prevent namenode from starting safe mode, while setting it higher than 1 will never exit safe mode.

3. Dfs.safemode.extension type: int default: 30000 indicates that namenode still needs time in safe mode (in milliseconds) after the minimum copy condition (defined by the previous option) is met. For small clusters (more than ten nodes), this value can be set to 0.

Core-site.xml individual settings description:

1. Io.file.buffer.size sets the buffer size. Default is 4kb (64kb 128kb)

2. Fs.trash.interval sets how long the files in the Recycle Bin will be deleted, in minutes. The default value is 0, which means that the recycling feature is invalid. The recycling feature is a user-level feature. When enabled, each user has its own independent recycle bin directory, that is, the .trash directory under the home directory. When restoring, you just need to find the deleted file from this directory and remove it. Hdfs automatically deletes files in the Recycle Bin. Other file systems do not have this function. You need to delete hadoop fs-expunge by using the following command

Hdfs-site.xml individual settings description:

1. Dfs.block.size sets hdfs block size. Default is 64mb (128mb 256mb).

2. Dfs.balance.bandwidthPerSec sets the bandwidth of the equalizer to replicate data between different nodes

Dfs.datanode.du.reserved sets the amount of reserved space for use by other programs, in bytes

Fs.checkpoint.period sets how often the auxiliary namenode creates checkpoints in seconds

Fs.checkpoint.size sets the checkpoint to be created when the edit log (edits) size reaches how many mb, and the system checks the edit log size every 5 minutes

When dfs.datanode.numblocks sets how many blocks are stored in a datanode directory, a subdirectory is recreated.

Dfs.datanode.scan.period.hours sets the scanning cycle of datanode blocks, which is scanned once every three weeks by default.

Hadoop command:

1. Hadoop fs-mkdir / user/username create users

2. Set permissions for hadoop fs-chown user:user / user/username

3. Hadoop dfsadmin-setSpaceQuota 1t / user/username limits space capacity

4. Hadoop dfsadmin-saveNamespace creates a checkpoint, saves the file bear mapping in memory as a new fsimage file, and resets the edits file. This operation is only performed in safe mode.

5. Hadoop dfsadmin-safemode get to check whether namenode is in safe mode.

6. Hadoop dfsadmin-safemode wait exits safe mode before executing a command in a script

7. Hadoop dfsadmin-safemode enter enters safe mode

8. Hadoop dfsadmin-safemode leave leaves safe mode

9. Hadoop dfsadmin-report displays the statistics of the file system, as well as the information of each datanode connected

10. Hadoop dfsadmin-metasave stores some information in a file in the hadoop log directory, including block information that is being copied or deleted, and a list of connected datanode

11. Hadoop dfsadmin-refreshNodes updates the list of datanode allowed to connect to namenode

12. Hadoop dfsadmin-upgradeProgress gets information about the progress of the hdfs upgrade or force the upgrade

13. Hadoop dfsadmin-finalizeUpgrade removes legacy data from datanode and namenode storage directories

14. Hadoop dfsadmin-setQuota sets the quota for the number of files and subdirectories contained in the directory

15. Hadoop dfsadmin-clrQuota cleans up the quota for the number of files and subdirectories in a specified directory

16. Hadoop dfsadmin-clrSpaceQuota cleans up the specified space size quota

17. Hadoop dfsadmin-refreshServiceAcl refreshes the service level authorization policy file of namenode

18. Hadoop fsck / check the health of files in hdfs, which looks for blocks that are missing in all datanode and blocks that have too few or too many copies

19. The hadoop fsck / user/tom/part-007-files-blocks-racks files option displays the file name, size, number of blocks, and health status; the block option describes the information of each block in the file, one line for each block; the racks option displays the rack location of each block and the address of the datanode

Decommissioning Datanode node datanodename bin/hadoop dfsadmin-decommission datanodename

Fix the slow restart of the primary nemenode: (use the-importCheckpoint option to start the secondary namenode and use the secondary namenode as the new primary namenode)

1. The secondary namenode requests the primary namenode to stop using the edits (file that records the operation log) file. Temporarily record the new record write operation to a new file

2. The secondary namenode obtains fsimage (metadata permanent checkpoint file) and edits file (using http get) from the primary namenode.

3. Assist namenode to load fsimage files into memory, perform the operations in edits files one by one, and create new fsimage files.

4. The secondary namenode sends the new fsimage file back to the primary namenode (using http post)

5. The main namenode replaces the old fsimage file with the fsimage file received from the secondary namenode, and replaces the old edits file with the edits file generated in step 1. At the same time, the fstime file is updated to record the time of checkpoint execution.

Equalizer program:

The start-balancer.sh-threshold parameter specifies the threshold (percentage format). The default value is 10%. Only one equalizer can be run in the cluster at any time. There is a bandwidth limit for equalizers to copy data between different nodes. The default value is 1mb/s.

Thank you for your reading, the above is the content of "Hadoop common commands and safe mode attribute description". After the study of this article, I believe you have a deeper understanding of the common Hadoop commands and safe mode attribute description, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.