Excerpts from the authoritative guide to Hadoop-1 04/15 Update SLTechnology News&Howtos

Excerpts from the authoritative guide to Hadoop-1

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. Why not use RAID

The inter-node data replication technology provided by HDFS can meet the needs of data backup and does not need to use RAID redundancy mechanism.

RAID 0 is slower than JBOD (Just a Bunch Of Disks), and JBOD circulates HDFS blocks between all disks. The read and write operation of RAID 0 is limited by the speed of the slowest disk in the disk array, while the disk operation of JBOD is independent, so the average read and write speed of the article is higher than that of the slowest disk.

2. Whether the service can be placed on a server

For a small cluster (dozens of nodes), there is usually no problem running namenode and jobtracker at the same time on a master machine (make sure that at least one copy of namenode's metadata is stored separately in the remote file system). But as the number of clusters and files in HDFS grows, and namenode needs more memory, it's best to put namenode and jobtracker on separate machines.

The secondary namenode can run on the same machine as the namenode, but also for memory usage reasons (the secondary namenode and the primary namenode have the same memory requirements), it is best to run on separate servers, especially for large clusters.

3. Hadoop configuration file

Each node of the hadoop cluster saves its own configuration file, not in a separate global location, and the administrator completes the synchronization of the configuration file. Hadoop provides a basic tool for synchronization, namely rsync. In addition, parallel shell tools such as dsh or pdsh can also accomplish this task.

Hadoop also supports the same set of profiles for all master machines and worker machines. The biggest advantage of this approach is simplicity. However, this all-in-one configuration model is not suitable for some clusters. Take an extended cluster as an example, when you try to add a new machine to the cluster, and the hardware specification of the new machine is different from the existing machine, you need to create a new set of configuration files to make full use of the additional resources of the new hardware.

In this case, the concept of "machine class" needs to be introduced to maintain a separate configuration file for each machine class. Hadoop does not provide a tool for this operation, and external tools are needed to perform the configuration operation.

4. Benefits of installing MapReduce and HDFS independently

The prerequisite for separating the two services is that compatibility restrictions are relaxed, which facilitates upgrades, for example, you can easily upgrade MapReduce (perhaps with a patch) while still running HDFS.

It is important to note that even if HDFS and MapReduce are installed separately, they can still share configuration information by pointing to the same configuration directory using the-- config option (when starting the daemon). Since the names of the log files they produce are different, they will not cause conflicts, so you can still output the logs to the same directory.

5. Masters node

In order to run the hadoop built-in script to operate the startup and shutdown of the cluster service and daemon, you need to know all the machines in the cluster in advance. Two files can achieve this goal, namely, masers and slaves. Each file records some machine names or IP addresses line by line. The name of the masters file is a bit misleading, and it mainly records all the machines that are intended to run the secondary namenode.

Namenode holds all the metadata and block metadata in the entire namespace in memory, which requires a lot of memory. The secondary namenode is idle most of the time, but its memory requirements when creating checkpoints are similar to those of namenode. Once the file system contains a large number of files, the physical memory of a single machine cannot run both the primary namenode and the secondary namenode.

Assist namenode to keep an up-to-date checkpoint and record the metadata of the file system it created. Backing up this historical information to other nodes helps restore namenode's metadata files in the event of data loss (or system crash).

On a high-load cluster running a large number of MapReduce jobs, jobtracker consumes a lot of memory and CPU resources, so it is best to run on a dedicated node.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.