How to realize Hadoop and AvatarNode Cluster in Facebook 07/01 Update SLTechnology News&Howtos

How to realize Hadoop and AvatarNode Cluster in Facebook

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces how to achieve Hadoop and AvatarNode clusters in Facebook, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Facebook stores data in a data warehouse built with Hadoop/Hive. The data warehouse has 4800 cores, has the storage capacity of 5.5PB, and each node can store data the size of 12TB. At the same time, it also has two layers of network topology. The MapReduce cluster in Facebook is dynamic and can be moved dynamically based on the load and the configuration information between the cluster nodes.

Facebook's data warehouse architecture, in which network servers and internal services generate log data. Here Facebook uses an open source log collection system, which can store hundreds of log data sets on NFS servers, but most of the log data will be copied to HDFS instances in the same center, and the data stored by HDFS will be placed in the data warehouse built with Hive. Hive provides a SQL-like language to combine with MapReduce to create and publish a variety of summaries and reports, as well as historical analysis based on them. The browser-based interface on Hive allows users to execute Hive queries. The Oracle and MySQL databases are used to publish these abstracts, which are relatively small in size, but are queried frequently and require real-time responses. Some old data needs to be archived in time and stored in cheaper memory.

Here is some of the work Facebook has done on AvatarNode and scheduling policies. AvatarNode is mainly used for HDFS recovery and startup. If HDFS crashes, it takes 10 minutes to read and write back the file image of 12GB, 20 minutes to 30 minutes to process block reports from 2000 DataNode, and 40 minutes to recover the crashed NameNode and deployment software. Table 3-1 illustrates the difference between BackupNode and AvatarNode. AvatarNode starts as a normal NameNode and handles all messages from DataNode. AvatarDataNode, like DataNode, supports multithreading and multiple queues for multiple primary nodes, but cannot distinguish between raw and backup. Manual recovery uses the AvatarShell command line tool, and AvatarShell performs the restore operation and updates the zNode of ZooKeeper. The recovery process is transparent to the user. The distributed Avatar file system is implemented on top of the existing file system.

Location-based scheduling strategy has some problems in practical application, such as tasks that require high memory may be assigned to TaskTracker;CPU resources with low memory and sometimes underutilized, and it is difficult to configure TaskTracker with different hardware. Facebook adopts a resource-based scheduling strategy, that is, fair sharing scheduling method, which monitors the system in real time and collects the usage of CPU and memory. The scheduler analyzes the real-time memory consumption, and then distributes the memory usage of tasks fairly among tasks. It parses the process tree by reading the / proc/ directory, collects all the CPU and memory usage information on the process tree, and then sends the information during the heartbeat through TaskCounters.

Facebook's data warehouse uses Hive, where HDFS supports three file formats: text files (TextFile), which are convenient for other applications to read and write; sequential files (SequenceFile), which only Hadoop can read and support block compression; and RCFile, which uses sequential file block-based storage, where each block is stored in columns, which has better compression ratio and query performance. Facebook will be improved on Hive in the future to support new functions such as indexes, views, subqueries, and so on.

The challenges encountered by Facebook using Hadoop today are:

In terms of quality of service and isolation, larger tasks will affect cluster performance.

In terms of security, what to do if a software vulnerability causes the NameNode transaction log to crash

In terms of data archiving, how to choose archived data and how to archive data

Performance improvement, how to effectively solve the bottleneck and so on.

Solve the stubborn problem of Namenode

One of the reasons for the success of Google's creation of the MapReduce,MapReduce system in 2004 is that it provides a simple programming model for writing code that requires massively parallel processing. A MapReduce cluster can include thousands of computers operating in parallel. At the same time, MapReduce allows programmers to quickly convert and execute data in such a large cluster. It is inspired by the functional programming features of Lisp and other functional languages. MapReduce and cloud computing are a good match. The key feature of MapReduce is its ability to hide operational parallel semantics-the specific way parallel programming works-from developers.

HDFS (Hadoop Distributed Filesystem) is designed for large-scale distributed data processing under the MapReduce framework. HDFS can store big data set (TB level) as a single file, but most file systems do not have this capability. (editor's note: NTFS5 Max Files on Volume:264 bytes (16 ExaBytes) minus 1KB Magi 1EB = 1000000 TB). This is also an important reason why HDFS is so popular all over the world.

Currently, the HDFS physical disk space in the Facebook Hadoop cluster holds more data than 100PB (more than 100 clusters distributed in different data centers). Because HDFS stores the data that Hadoop applications need to process, optimizing HDFS has become a crucial factor for Facebook to provide efficient and reliable services for users.

How does HDFS Namenode work?

HDFS clients perform file system raw data operations through a single server node called Namenode, while DataNode communicates with other DataNode and replicates data blocks to achieve redundancy, so that a single DataNode corruption does not result in data loss in the cluster.

But the loss of NameNode failure is intolerable. The main responsibility of NameNode is to track how the file is divided into file blocks, which nodes the file blocks are stored, and whether the overall running state of the distributed file system is normal. However, if the NameNode node stops running, the data node will not be able to communicate, and the client will not be able to read and write data to HDFS. In fact, this will also cause the whole system to stop working.

The HDFS Namenode is a single point of failure (SPOF)

Facebook is also well aware of the seriousness of the problems brought about by "Namenode-as-SPOF", so Facebook hopes to build a system to eliminate the hidden dangers brought by "Namenode-as-SPOF". But before we learn about this system, let's take a look at what problems Facebook has encountered in using and deploying HDFS.

Usage of Facebook data Warehouse

The largest HDFS cluster is deployed in Facebook's data warehouse, which is used by traditional Hadoop MapReduce workloads-running MapReduce batch jobs in a small number of large clusters

Because the cluster is very large, the client and many DataNode nodes transmit huge amounts of raw data to NameNode nodes, which leads to a very heavy load on NameNode. The pressure from CPU, memory, disk and network also makes the high load of NameNode in the data warehouse cluster common. In the course of use, Facebook found that the failure caused by HDFS in its data warehouse accounted for 41% of the total failure rate.

HDFS NameNode is not only an important part of HDFS but also an important part of the whole data warehouse. Although the highly available NameNode can only prevent 10% of the unplanned downtime of the data warehouse, eliminating NameNode is a major victory for SPOF because it allows Facebook to perform hardware and software responses to subscriptions. In fact, Facebook estimates that 50% of the planned downtime of the cluster can be eliminated if NameNode is resolved.

So what does high availability NameNode look like? How will it work? Let's take a look at the chart of high availability NameNode.

In this structure, clients can communicate with Primary NameNode and Standby NameNode, as well as many DataNode

Also have the ability to send block reports to Primary NameNode and Standby NameNode. In essence, the AvatarNode developed by Facebook is a solution with high availability of NameNode.

Avatarnode: a solution with NameNode failover

To address the design flaw of a single NameNode node, Facebook started working internally with AvatarNode about two years ago.

At the same time, AvatarNode provides high-availability NameNode and hot failover and rollback capabilities, and Facebook has contributed AvatarNode to the open source community. After numerous tests and Bug fixes, AvatarNode is now running steadily in Facebook's largest Hadoop data warehouse. Thanks in large part to Dmytro Molkov, an engineer at Facebook.

In the event of a failure, the two highly available NameNode nodes of the AvatarNode can manually fail over. AvatarNode packages existing NameNode code and places it in the Zookeeper layer.

The basic concepts of AvatarNode are as follows:

1. With Primary NameNode and Standby NameNode

two。 The current Master hostname is saved in ZooKeeper

3. Improved DataNode sends block reports to Primary NameNode and Standby NameNode

4. The improved HDFS client will check the Zookeeper before everything starts, and if it fails, it will transfer to another transaction. At the same time, if AvatarNode failover occurs during the write process, AvatarNode's mechanism will allow complete data writes to be guaranteed.

Avatarnode client

On how to achieve Hadoop and AvatarNode clusters in Facebook is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.