In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
What is the basic concept of HDFS in Hadoop? aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.
One: a simple understanding of Hadoop:
The core design of Hadoop framework is: HDFS (Hadoop Distributed File System) and MapReduce.
HDFS provides storage for massive data, while MapReduce provides computing for massive data.
Second, HDFS architecture:
HDFS is a master / slave (Master/Slave) architecture that can perform CRUD operations on files through a directory path. However, due to the nature of distributed storage, the HDFS cluster has a NameNode and some DataNodes. NameNode manages the metadata of the file system and DataNode stores the actual data. The client accesses the file system through interaction with NameNode and DataNodes. The client contacts the NameNode to get the source data of the file, while the real file Icano operation interacts directly with the DataNode.
NameNode, the master control server, is responsible for managing the namespace of the file system, recording the location and copy information of file blocks on each DataNode, coordinating client access to files, and recording changes in the namespace or changes to the attributes of the namespace itself.
DataNode is responsible for storage management on their physical node. The namespace of HDFS's open file system to allow users to store data as files. The data of HDFS is "written once, read multiple", and its files are usually different data blocks (Block) called by Penguin according to 128MB, and each data block is stored in different DataNode as much as possible.
NameNode performs namespace operations on the file system, such as opening, closing, renaming files or directories, and also determines block-to-DataNode mapping. DataNode is responsible for handling customer read and write requests and performing data creation, replication and other work according to NameNode commands.
Eg: the client accesses a file. First, the client gets a list of the block locations that make up the file from the NameNode, that is, it knows which DataNode the block is stored on, and then the client reads the file data directly from the DataNode. NameNode does not participate in the transfer of files during this process.
There is only one NameNode in a cluster, and other machines in the cluster run one DataNode each. You can also run DataNode on the machine running NameNode, or run multiple DataNode on one machine.
NameNode uses transaction logs (EditLog) to record changes in HDFS metadata, using the image file FsImage to store the namespace of the file system, including file mappings and attributes, and so on. Transaction logs and image files are stored in the local file system of NameNode. When NameNode starts, it reads the image file and transaction log from disk, applies all the transactions of the transaction log to the image file in memory, and then flushes the new metadata to the new image file on the local disk, so that the old transaction log can be truncated. This process is also known as Checkpoint.
SecondaryNameNode assists NameNode in processing image files and transaction logs. When NameNode starts, it merges the image file and transaction log, while SecondaryNameNode periodically copies the image file and transaction log from NameNode to a temporary directory, and then reuploads the image file to NameNode,NameNode and cleans up the transaction log after merging to generate a new image file to make the transaction log size controllable.
*
1. NameNode
NameNode is the HDFS daemon that records how the file is divided into blocks and which data nodes these blocks are stored on. Its main function is to centralize the memory and Icano.
Because NameNode is a single point in the Hadoop cluster, once the NameNode server goes down, the whole system cannot run.
NameNode stores metadata.
2. DataNode
Each slave server in the cluster runs a DataNode daemon that reads and writes HDFS blocks to the local file system. When you need to read / write some data through the client, NameNode first tells the client which DataNode to do the specific read / write operation, and then the client communicates directly with the daemon on the DataaNode server and reads / writes the relevant data blocks.
DataNode stores actual data.
3. SecondaryNameNode
SecondaryNameNode is a secondary daemon used to monitor the status of HDFS. Just like NameNode, each cluster has a SecondaryNameNode and is deployed on a separate server, while SecondaryNameNode, unlike NameNode, does not receive or record any real-time data changes, but it communicates with NameNode to keep snapshots of HDFS metadata on a regular basis. Because NameNode is a single point, the downtime and data loss of NameNode can be minimized through the snapshot feature of SecondaryNameNode. At the same time, when the NameNode is damaged, SecondaryNameNode can be used as a backup NameNode in time.
The most important task of SecondaryNameNode is not to make a hot backup of NameNode metadata, but to merge f s i m a g e and e d i t s logs periodically and transfer them to NameNode. It is important to note that in order to reduce the pressure on NameNode, NameNode itself does not merge fsimage and edits and store the files on disk, but leaves it to SecondaryNameNode to do it.
4. ResourceManager
ResourceManager mainly consists of the following parts:
User interaction
YARN provides three external services for ordinary users, administrators and Web, respectively, corresponding to ClientRMService, AdminService and WebApp:
ClientRMService
ClientRMService is a service for ordinary users, which handles various RPC requests from the client, such as submitting the application, terminating the application, getting the running status of the application, and so on.
AdminService
YARN provides administrators with a set of independent service interfaces to prevent a large number of ordinary user requests from starving the management commands sent by administrators. Administrators can manage clusters through these interfaces, such as dynamically updating node lists, updating ACL lists, updating queue information, and so on.
WebApp
In order to display information such as the usage of cluster resources and the running status of applications more amicably, YARN provides a Web interface.
5. NodeManager (NM Management)
NMLivelinessMonitor
Monitor whether the NM is alive. If a NodeManager does not report a heartbeat within a certain period of time (default is 10min), it is considered dead and will be removed from the cluster.
NodesListManager
Maintain the list of normal and abnormal nodes, and manage the list of exlude (similar to blacklist) and inlude (similar to whitelist) nodes, both of which are set in the configuration file and can be loaded dynamically.
ResourceTrackerService
Processing requests from NodeManager mainly includes two kinds of requests: registration and heartbeat, in which registration is the behavior that occurs when NodeManager starts, and the request packet contains information such as node ID and the upper limit of available resources, while heartbeat is a periodic behavior, including the running status of each Container, the list of running Application, and the health status of nodes (which can be set through a script), while ResourceTrackerService returns a list of Container and Application to be released for NM.
6. Application management
ApplicationACLsManager
Manage application access, including two parts of permissions: view and modify, view mainly refers to viewing the basic information of the application, and modification is mainly to modify the priority of the application, kill the application, and so on.
RMAppManager
Manage the startup and shutdown of applications.
ContainerAllocationExpirer
YARN does not allow AM not to use Container for a long time after obtaining it, because this will reduce the utilization of the entire cluster. When AM receives a newly assigned Container from RM, the Container must be started on the corresponding NM within a certain period of time (default is 10min), otherwise, RM will reclaim the Container.
Safety management
ResourceManage has a very comprehensive rights management mechanism, which is mainly completed by ClientToAMSecretManager, ContainerTokenSecretManager, ApplicationTokenSecretManager and other modules.
Allocation of resources
ResourceScheduler
ResourceScheduler is a resource scheduler that allocates resources in a cluster to individual applications according to certain constraints (such as queue capacity limits, etc.). ResourceScheduler is a plug-in module, which is implemented by FIFO by default, and YARN also provides two multi-tenant schedulers, Fair Scheduler and Capacity Scheduler.
This is the answer to the question about what is the basic concept of HDFS in Hadoop. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.