Big data, a good programmer, shares four module files commonly used in hadoop. 04/06 Update SLTechnology News&Howtos

Big data, a good programmer, shares four module files commonly used in hadoop.

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1.1.1core-site.xml (tool module)

Includes the utility classes commonly used in Hadoop, renamed from the original Hadoopcore section. It mainly includes system configuration tool Configuration, remote procedure call RPC, serialization mechanism and Hadoop abstract file system FileSystem and so on. They provide basic services for building a cloud computing environment on general hardware and provide the required API for software development running on the platform.

1.1.2hdfs-site.xml (data Enclosure)

Distributed file system, which provides high throughput, high scalability and high fault tolerance access to application data. It is the foundation of data storage management in Hadoop system. It is a highly fault-tolerant system that can detect and respond to hardware failures and is used to run on low-cost general-purpose hardware. HDFS simplifies the file consistency model, provides high-throughput application data access through streaming data access, and is suitable for applications with large datasets.

Namenode+ datanode + secondarynode

1.1.3mapred-site.xml (data processing module)

Large dataset parallel processing system based on YARN. It is a kind of calculation model, which is used to calculate a large amount of data. The MapReduce implementation of Hadoop, together with Common and HDFS, constitutes the three components in the early stage of Hadoop development. MapReduce divides the application into two steps: Map and Reduce, in which Map performs specified operations on the independent elements on the dataset to generate intermediate results in the form of key-value pairs. Reduce specifies all values of the same "key" in the intermediate result to get the final result. The function partition such as MapReduce is very suitable for data processing in a distributed parallel environment composed of a large number of computers.

1.1.4yarn-site.xml (Job scheduling + Resource Management platform)

Task scheduling and cluster resource management

Resourcemanager + nodemanager

1.2hadoop five nodes:

1.2.1NameNode (Management Node)

Namenode manages the command space (Namespace) of the file system. It maintains the file system tree (filesystemtree) and the metadata (metadata) of all files and folders in the file tree, including editing logs (edits) and mirror files (fsimage). There are two files that manage this information, namely, the Namespace image file (fsimage) and the editing log file (edits). The main purpose of editing the log is to record the changes made to hdfs. The image file is mainly to record the hdfs file tree structure. This information is stored in RAM by Cache, and of course, these two files are also persisted on the local hard disk. Namenode records the location of the data node in which each block is located in each file, but he does not persist this information because it is rebuilt from the data node when the system boots.

1.2.2DataNode (work node)

Datanode is the working node of the file system, which stores and retrieves data according to the scheduling of the client or namenode, and periodically sends a list of blocks (block) they store to namenode.

The file system cannot be used without namenode. In fact, if the server running the namenode service goes down, all files on the file system will be lost. Because we don't know how to rebuild the file based on the block of DataNode. Therefore, it is very important to carry out fault-tolerant and redundant mechanism for NameNode.

All the slave servers in the cluster run a DataNode daemon that reads and writes HDFS blocks to the local file system. When you need to read / write some data through the client, NameNode first tells the client which DataNode to do specific read / write operations, and then the client communicates directly with the daemon on the DataNode server and reads / writes related data blocks.

1.2.3secondary NameNode (equivalent to a master-slave replication slave node in a MySQL database)

Secondary NameNode is a secondary daemon used to monitor the status of HDFS. Like NameNode, each cluster has a Secondary NameNode and is deployed on a separate server. Unlike NameNode, Secondary NameNode does not accept or record any real-time data changes, but it communicates with NameNode to keep snapshots of HDFS metadata on a regular basis. Because NameNode is a single point, the downtime and data loss of NameNode can be minimized through the snapshot feature of Secondary NameNode. At the same time, if there is a problem with NameNode, Secondary NameNode can be used as a backup NameNode in a timely manner.

1.2.4ResourceManager

ResourceManage is resource management. In YARN, ResourceManager is responsible for the unified management and allocation of all resources in the cluster. It receives resource report information from each node (NodeManager) and allocates the information to each application (actually ApplicationManager) according to a certain strategy.

RM includes Scheduler (timing Scheduler) and ApplicationManager (Application Manager). Schedular is responsible for allocating resources to the application, it does not monitor or track the status of the application, and there is no guarantee that the application itself will be restarted or that the failed application will be executed due to hardware errors. ApplicationManager is responsible for accepting new tasks, coordinating and providing restart in the event of ApplicationMaster container failure. The AM responsibility for each application, Scheduler request resources, and monitor the use of these resources and resource scheduling

1.2.5Nodemanager

NM is ResourceManager's agent on the slave machine, responsible for container management, monitoring their resource usage, and providing resource usage reports to ResourceManager/Scheduler

HDFS file storage mechanism:

HDFS cluster is divided into two major roles: NameNode, DataNode, (secondary NameNode)

NameNode is responsible for managing the metadata of the entire file system

DataNode is responsible for managing users' file blocks.

Files are cut into blocks according to a fixed size and distributed on several DataNode

Each file block can have multiple copies and be stored on a different DataNode

DataNode will regularly report the block information of its saved files to NameNode, while NameNode will be responsible for maintaining the number of copies of files.

The internal working mechanism of HDFS is transparent to the client, and the client requests to access the HDFS are made by applying to NameNode.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.