Example Analysis of each Module of Hadoop 04/03 Update SLTechnology News&Howtos

Example Analysis of each Module of Hadoop

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the example analysis of each module of Hadoop, which is very detailed and has certain reference value. Friends who are interested must finish it!

Hadoop cluster architecture

The Hadoop cluster consists of a Master master node and several Slave nodes. The NameNode and JobTracker daemons are running on the Master node, and the DataNode and TaskTracker daemons are running on the Slave node.

Hadoop divides the hosts in the cluster into two roles from three perspectives:

Role division of Hadoop cluster hosts

From the perspective of host service

divides the hosts in the cluster into Master and Slave in terms of host service functions. The Master host is responsible for the metadata management of the distributed file system and the job scheduling of distributed computing. Generally, there are one or several Master hosts in a cluster; the Slave host is responsible for specific data storage and computing.

From the point of view of file system HDFS

divides hosts into NameNode and DatanNode from the perspective of file system HDFS. Namenode stores metadata information in a distributed file system, in other words, namespaces. It maintains all files and directories in the file system tree and the entire lesson tree. This information is permanently stored on the local disk in the form of two files: the namespace image file and the edit log file. At the same time, namenode also records the data node information of each block in each file, but does not retain this information permanently, because the information is rebuilt by the data node when the system starts; DataNode stores the actual file data in the file system, which stores and retrieves data blocks as needed (scheduled by the client or namenode), and periodically sends a list of blocks they store (so-called heartbeat detection) to namenode.

In other words, the security mechanism of namenode is very important. If namenode is lost or corrupted, all files on the file system will be lost. Because we don't know how to rebuild the file from the datanode block. Hadoop provides two protection mechanisms:

Redundant backup namenode

Another way is to run the secondary namenode, which periodically merges namespace images by editing logs to prevent editing logs from becoming too large.

In fact, we will continue to discuss these two methods later. More importantly, the mechanism of namenode HA has been designed very scientifically. We can refer to this article: detailed explanation of namenode's HA principle.

In fact, Hadoop's file system is much more than HDFS, it integrates many file systems, first provides a high-level file system abstraction org.apache.hadoop.fs.FileSystem, this abstract class shows a distributed file system, and has several specific implementations, as shown in the following table:

Hadoop file system

File system URI Scheme java implementation org.apache.hadoop defines Localfilefs.LocalFileSystem to support a local file system with client checksum. HDFShdfshdfs.DistributedFileSystemHadoop 's distributed file system. HFTPhftphdfs.HftpFileSystem supports read-only access through HTTP. HDFS,distcp is often used to replicate data between different HDFS clusters. HSFTPhsftpfs.HsftpFileSystem supports read-only access to HDFS through HTTPS. HARharfs.HarFileSystem builds a file system that archives files on other file systems. Hadoop archiving files are mainly used to reduce the memory usage of NameNode. KFSkfsfs.kfs.Kosmos.FileSystem is similar to GFS and HDFS. It uses C++ to write FTPftpfs.ftp.FtpFileSystem file system supported by FTP server S3 (local) s3nfs.s2native.NativeS3FileSystem Amazon S3-based file system (block-based) s3fs.s3.NativeS3FileSystem Amazon S3-based file system, storage in block format solves the limit of S3 5GB file size.

3. From the MapReduce point of view

divides hosts into JobTracker and TaskTracker from a MapReduce perspective. JobTracker is the scheduling and manager of the job, belongs to master;TaskTracker is the actual executor of the task, belongs to slave.

So HDFS is just one of the file systems.

Introduction to HDFS

HDFS is a standard storage system used by Hadoop and is a distributed file system based on the network environment. It is developed based on the need for streaming data pattern access and processing of very large files, which is nothing new, as it has been implemented since around the 1980s. The data files stored on the HDFS are first divided into blocks, and multiple copies are created for each block and stored on different nodes in the cluster, and the Hadoop MapReduce program can process the data on all nodes. The process of data storage and processing on HDFS is shown below.

Data Storage and processing Model on Hadoop

Two main components of HDFS

HDFS is designed to follow the master-slave architecture, and each HDFS cluster has a name node (NameNode) and several data nodes (DataNode). The mechanism of HDFS to store files is to divide data files into blocks, where file blocks refer to the minimum file size for read and write operations of the system, and the file system can only deal with data multiple of the disk block size at a time. Then these data blocks are stored on the data node according to a certain strategy. The NameNode node manages the namespace, file, and directory operations of the file system, and is also responsible for determining the mapping between data nodes and file blocks. The DataNode node is responsible for file read / write requests from the client, as well as block creation, deletion, and operation commands for files and blocks from the name node.

The functional differences between NameNode and DataNode:

System Architecture of HDFS

The architecture of HDFS is shown in the figure:

HDFS read file flow

The process for the client to read the file in HDFS is as follows:

HDFS read file flow

The specific process of the read operation is as follows:

1. The client (client) uses the open () function of FileSystem to open the file.

2. DistributedFileSystem communicates with NameNode using RPC to get all the block information of the file and the address information of the node where these blocks are located.

3. DistributedFileSystem saves the acquired block information in FSDataInputStream, and then returns it to the client to read the data.

4. According to the information returned by NameNode, the client connects to the data node where the data block is stored, and then calls the read () function of stream to start reading the data.

5. The DFSInputStream connection holds the nearest data node of the first data block of this file.

6. Data reads from data node to client (client).

7. When this data block is read, DFSInputStream closes the connection to this data node and then connects to the nearest data node of the next data block in this file.

8. When the client finishes reading the data, it calls the close function of FSDataInputStream.

9. In the process of reading data, if the client has an error communicating with the data node, it attempts to connect to the next data node that contains this data block.

10. Failed data nodes will be logged and will not be connected later.

3.5. HDFS write file flow

The HDFS write file process is as follows.

HDFS write file flow

The specific process of the write operation is as follows:

1. The client calls create () to create the file

2. DistributedFileSystem invokes the metadata node with RPC to create a new file in the file system's namespace.

3. The metadata node first determines that the file does not exist and that the client has permission to create the file, and then creates a new file.

4. DistributedFileSystem returns DFSOutputStream, and the client is used to write data.

5. The client begins to write data, and DFSOutputStream divides the data into blocks and writes it to data queue.

6. Data queue is read by Data Streamer and tells the metadata node to allocate data nodes to store data blocks (3 blocks are replicated by default). The assigned data nodes are placed in a pipeline.

7. Data Streamer writes the data block to the first data node in the pipeline. The first data node sends the data block to the second data node. The second data node sends the data to the third data node.

8. DFSOutputStream saves the ack queue for the sent data block, waiting for the data node in the pipeline to tell you that the data has been written successfully.

if the data node fails during the write process:

closes pipeline and places blocks from ack queue into the beginning of data queue.

If the current data block of is given a new mark by the metadata node in the written data node, the error node can detect that its data block is outdated and will be deleted after restart.

failed data nodes are removed from the pipeline, and additional data blocks are written to the other two data nodes in the pipeline.

The metadata node is notified that the block is not replicated enough and a third backup will be created in the future.

9. When the client finishes writing data, the close function of stream is called. This operation writes all data blocks to the data node in pipeline and waits for ack queue to return success. Finally, the metadata node is notified that the write is complete.

Some additions:

1. After NameNode starts, it will enter a special state that becomes safe mode. NameNode does not do block replication at this time. NameNode receives heartbeats and block status reports from all DataNode. The block status report includes a list of blocks owned by a DataNode. Each data block has a specified minimum number of replicas, and when the NameNode test confirms that the number of replicas of a data block reaches the required number, the data block is considered to be copy-safe. When the secure DataNode reaches a certain ratio, NameNode will introduce a secure mode, and then continue to determine which copies of the data do not reach the specified number.

2. Advantages and disadvantages of Hadoop

Advantages:

Dealing with very large files

Streaming access to data

Running on a cheap cluster of business machines

Disadvantages:

Not suitable for low latency acc

Unable to store a large number of small files efficiently

Entities involved in the execution of MapReduceMapReduce jobs

Client

The function is to submit MapReduce jobs to JobTracker, the master node of the cluster. The client may or may not be a node on the cluster. When submitting a job from a node outside the cluster, you need to specify the address of the JobTRacker.

Calculate the primary node JobTracker

The function of JobTracker is mainly responsible for the assignment of tasks, the management and monitoring of computing tasks.

Slave node TaskTracker

Run the task assigned to him by JobTracker.

HDFS

At the beginning of the calculation, the program reads the data from the HDFS; after the calculation is finished, the data is saved to the HDFS.

The workflow of MapReduce

The work project for the entire MapReduce job is as follows:

MapReduce workflow

1. Submission of assignments

The client calls the runjob () method of JobClient to create a JobClient instance, and then calls the submitJob () method to submit the job to the cluster, which mainly includes the following steps:

1) request a new job ID through getNewJobId () of JobTracker

2) check the output of the job (for example, if the output directory is not specified or already exists, an exception is thrown) to avoid overwriting the contents of the original file

3) for the input shard of a computing job, in order to be efficient, the shard size should be the same as that of the HDFS block (an exception is thrown when the shard cannot be calculated, such as because the input path does not exist)

4) copy the resources needed to run the job (such as job Jar files, configuration files, calculated input fragments, etc.) to a directory named after the job ID. (there are multiple replicas in the cluster that TaskTracker can access)

5) tell the job to be ready for execution by calling the submitJob () method of JobTracker

6) JobTracker schedules tasks to be executed on the work node. Runjob trains nodes in rotation every 1 second to check the progress of the execution and guide the completion of the task.

two。 Initialization of the job

1) after JobTracker receives a call to its submitJob () method, it puts the call in an internal queue and hands it over to the job scheduler (such as FIFO scheduler, capacity scheduler, fair scheduler, etc.) for scheduling.

2) initialization is mainly to create an object that represents a running job-- encapsulating the task and recording information, in order to track the status and progress of the task

3) in order to create a task run list, the job scheduler first obtains the input shard information calculated by JobClient from HDFS.

4) then create a MapTask for each shard and create a ReduceTask. Task is specified as ID at this time, please distinguish between the ID of Job and the ID of Task.

3. Assignment of tasks

1) TaskTracker communicates with JobTracker regularly through "heartbeat", mainly to tell JobTracker whether it is still alive and whether it is ready to run new tasks, etc.

2) before JobTracker assigns a Task to TaskTracker, JobTracker needs to select a job according to priority and select a task among the highest priority job. Hadoop default job scheduling Map Task has a higher priority than Reduce Task

3) TaskTracker runs a certain number of map task and reduce task according to certain policies. The number of runners is determined by the number of TaskTracker and the memory size.

4. Execution of tasks

1) after TaskTracker is assigned a task, copy the Jar files of the job from HDFS to the file system where TaskTracker is located (Jar localization is used to start JVM), and TaskTracker copies all the files needed by the application from the distributed cache to the local disk

2) TaskTracker creates a local working directory for the task and decompresses the contents of the Jar file into this folder

3) TaskTracker starts a new JVM to run each Task (including MapTask and ReduceTask) so that Client's MapReduce does not affect the TaskTracker daemon (for example, causing a crash or hanging, etc.).

The child process communicates with the parent process through the umbilical interface, and the child process of Task informs the parent process of its progress every few seconds until the task is completed.

5. Update of process and status

A job and each of its tasks has a status information, including the running status of the job or task, the progress of Map and Reduce, counter values, status messages or descriptions (which can be set by user code). How do these status messages communicate with Client, which changes during the job?

When a task is running, keep track of its progress (that is, the percentage of the task completed). For MapTask, the task progress is the percentage of processed input. For ReduceTask, the situation is a little more complicated, but the system still estimates the percentage of Reduce input processed

These messages are aggregated by Child JVM- > TaskTracker- > JobTracker over time intervals. JobTracker produces a global view of all running jobs and the status of their tasks. It can be viewed through Web UI. At the same time, JobClient queries JobTracker every second to get the latest status and outputs it to the console.

4.3, Shuffle and Sort

The Shuffle phase starts with the output of the Map, including the process that the system performs sorting and transmits the Map output to Reduce as input. The Sort phase refers to the process of sorting the Key output from the map side. Different Map may output the same Key, and the same Key must be sent to the same Reduce side for processing. The Shuffle stage can be divided into Shuffle on the map side and Shuffle on the reduce side. The working process of the Shuffle phase and the Sort phase is as follows:

1. Shuffle on Map

When the Map function starts to produce output, it doesn't simply write the data to disk, because frequent disk operations can lead to serious performance degradation. Its processing process is more complex, the data is first written to a buffer in memory, and some pre-sorting is done to improve efficiency.

Each MapTask has a circular memory buffer for writing output data (the default size is 100MB). When the amount of data in the buffer reaches a certain threshold (the default is 80%), the system will start a background thread to write the contents of the buffer to disk (that is, the spill phase). During the disk write process, the Map output continues to be written to the buffer, but if the buffer is filled during that time, Map blocks until the disk write process is complete

Before writing to the disk, the thread first divides the data into corresponding partition according to the Reducer to which the data will eventually be passed. In each partition, the background thread sorts by Key (quick sort), and if there is a Combiner (that is, Mini Reducer), it will run on the sorted output

Once the memory buffer reaches the overflow write threshold, an overflow write file is created, so there will be multiple overflow writes after MapTask completes its last output record. Before the MapTask is completed, the overflow write file is merged into a single index file and data file (multiplex sort) (Sort phase)

After the merge of overflow write files is completed, Map deletes all temporary overflow write files and informs TaskTracker that the task is completed, and as soon as one of the MapTask completes, ReduceTask begins to copy its output (Copy phase).

The output file of Map is placed on the local disk of TaskTracker running MapTask, which is the input data needed to run ReduceTask,TaskTracker, but the Reduce output is not like this. It is generally written to HDFS (Reduce phase).

2. Shuffle on the reduce side

Copy phase: the Reduce process starts some data copy threads and requests the TaskTracker where the MapTask is located by HTTP to get the output file.

Merge phase: put the data copied from the Map into the memory buffer first. There are three forms of Merge: memory to memory, memory to disk, and disk to disk. By default, the first form is not enabled, the second Merge mode runs until the end of the spill phase, and then the third disk-to-disk Merge mode is enabled to generate the final file.

Reduce phase: the final file may exist on disk or in memory, but it is on disk by default. When the input file for the Reduce is set, the entire Shuffle ends, and then the Reduce executes, putting the results into the HDFS.

The characteristics of HBaseHBase Table

1. Large: a table can have billions of rows and millions of columns

2. Schemaless: each row has a sortable primary key and any number of columns, which can be dynamically increased as needed, and different rows in the same table can have completely different columns.

3. Column-oriented: column-oriented storage and permission control, column (family) independent retrieval

Sparse: null columns do not take up storage space, and tables can be designed to be very sparse

4. Multiple versions of data: there can be multiple versions of the data in each cell. By default, the version number is automatically assigned, which is the timestamp when the cell is inserted.

5. Single data type: all data in Hbase is a string and there is no type.

Physical Cluster Architecture of HBase

The physical composition of HBase cluster consists of a Zookeeper cluster, a Master master server and multiple RegionServer slave servers. At the same time, HBse depends on HDFS cluster at the bottom. The overall system architecture is as follows:

Physical Cluster Architecture of HBase

The operation of the HBase cluster depends on Zookeeper, which by default manages a Zookeeper instance as the cluster "authority". If a server crashes during zone allocation, Zookeeper is required to coordinate the allocation. The client also needs to access Zookeeper to understand the cluster properties when reading and writing data in HBase.

At the bottom, all the information about HBase is stored in HDFS, and HBase is a distributed database built on HDFS.

Storage architecture of HBase

The storage architecture of HBase is as follows:

Storage architecture of HBase

The important role of the HBASE cluster is its daemon. It includes the HMaster process running on the cluster Master node and the HRegionServer process running on each slave node RegionServer node, and several Zookeeper processes in addition to the first time. The specific component entities are as follows:

1 、 Client

HBse Client uses HBase's RPC mechanism to communicate with HMaster (for administrative operations) and HRegionServer (for read and write operations).

2 、 HMaster

HMaster is the main process of the cluster Master node, HMaster has no single point problem, multiple HMaster can be started in HBase, and a Master is always running through the Master Election mechanism of Zookeper. HMaster is mainly responsible for the management of Table and Region in function. Including: Table addition and deletion of the search, Region distribution and so on.

3 、 HRegionServer

The HRegionServer process runs on the cluster slave node and is mainly responsible for reading and writing data to the HDFS system in response to users'IO requests. Its internal functional structure is as follows:

Internal functional structure diagram of HRegionServer

As you can see from the figure, HRegionServer manages a series of HRegion objects, and each HRegion consists of multiple HStore. HRegion corresponds to a Region;HStore in Table corresponds to a column cluster in Table. Each column cluster is a storage unit.

HStore consists of two parts: MemStore and StoreFile. The data written by the user will first be put into the MemSore,MemStore, and after it is full, the flush becomes a StoreFile.

4 、 Zookeeper

In addition to storing the address of the-ROOT table and the address of the Master in the Zookeeper, there will also be information about the HRegionServer, so that the Hmaster can know the status of the HRegionServer at any time.

5 、 Hlog

Each HRegionServer has a Hlog object, and when the user writes the data, a copy is backed up into the HLOG so that an unexpected good recovery occurs.

HBase Runtime Features

1. Areas in Hbase

When the number of data records in HBase continues to increase, when the threshold is reached, it will be split into multiple parts from the direction of the row, each is a region, and a region is represented by [startkey,endkey]. Different region will be assigned by the Master of the cluster to the corresponding RegionServer for management.

2. Two special tables in HBse

The running HBase retains two special catalog tables:-ROOT- and .meta. -the ROOT- table contains. META. List of all areas of the table, while .meta. The table holds the zone list information for all user spaces, as well as the server address of the RegionServer.

Two special tables in HBse

VI. Brief introduction of Hive6.1 and Hive

Hive is a data warehouse software tool based on Hadoop. It provides a series of tools to help users extract, transform and load large-scale data, namely ETL operations. Hive defines a simple SQL-like query language called HiveSql. In essence, Hive is a SQL interpreter that converts HiveSql statements entered by users into MapReduce jobs and executes them on the Hadoop cluster.

6.2. the architecture of Hive

The above is all the contents of the article "sample Analysis of each Module of Hadoop". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.