Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Related concepts of Hadoop

2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

What Hadoop is Hadoop is an open source big data framework Hadoop is a distributed computing solution Hadoop = HDFS (Distributed File System)+ MapReduce (Distributed Computing) Hadoop Core HDFS Distributed File System: Storage is the foundation of big data technology MapReduce Programming Model: Distributed Computing is the solution for big data applications Hadoop infrastructure

 HDFS concept

  data block

  NameNode

  DataNode

Block: abstract blocks rather than entire files as storage units; default size 64MB is generally set to 128 MB, backup X3.

NameNode: manages the namespace of the file system, stores file metadata; maintains all files and directories of the file system, and maps files to data blocks; records information about the data nodes where each block in each file is located.

DataNode: stores and retrieves data blocks; updates NameNode with a list of stored blocks.

HDFS is suitable for large file storage, supports TB and PB data storage, and has a copy policy. It can be built on cheap machines and has certain fault tolerance and recovery mechanisms. Support streaming data access, write once and read many times is the most efficient. HDFS disadvantages are not suitable for large amounts of small file storage is not suitable for concurrent writing, does not support random file modification. Low latency access methods such as random reads are not supported. Understanding of Hadoop's functional modules

1. HDFS module

HDFS is responsible for the storage of large data. By dividing large files into blocks for distributed storage, HDFS breaks through the limitation of server hard disk size and solves the problem that a single machine cannot store large files. HDFS is a relatively independent module that can provide services for YARN and other modules such as HBase.

2. YARN module

YARN is a general resource coordination and task scheduling framework created to solve the NameNode overload and other problems in MapReduce in Hadoop 1.x.

YARN is a general purpose framework that can run not only MapReduce, but also other computing frameworks such as Spark and Storm.

3. MapReduce module

MapReduce is a computational framework that provides a way to process data in a distributed way through Map and Reduce phases. It is only suitable for offline processing of big data, and is not suitable for applications with high real-time requirements.

How to store small files with Hadoop?

A. Merge small files into large files on the client side.

Hadoop passes each small file to the map() function, and Hadoop creates a mapper when it calls the map() function, which creates a large number of mappers and makes the application run inefficiently. If you use and store small files, you usually create a lot of mappers. The main purpose of solving the small file problem is to speed up the execution of Hadoop programs by merging small files into larger files. Solving the small file problem can reduce the execution times of the map() function and correspondingly improve the overall performance of hadoop jobs.

b. Use Hadoop's CombineFileInputFormat to merge small files.

Use the Hadoop API(abstract class CombineFileInputFormat) to solve small file problems. The basic idea of the abstract class CombineFileInputFormat is to allow small files to be merged into Hadoop splits or chunks by using a custom InputFormat. How does the cluster continue to provide service when there is a node failure, and how does it read and write? What factors affect MapReduce performance?

Hardware (or resource) factors such as CPU, disk I/O, network bandwidth, and memory size.

b. Underlying storage system.

c. The size of input data, shuffle data, and output data, which is closely related to the running time of the job.

Job algorithms (or programs) such as map, reduce, partition, combine, and compress. Some algorithms are difficult to conceptualize in MapReduce or may be less efficient in MapReduce. HDFS Write Flow

1. The client wants to initiate a write data request for NameNode.

2. Write DataNode nodes in blocks, DataNode automatically completes copy backup

3. DataNode reports storage completion to NameNode, which notifies client

HDFS Read Flow

1. The client initiates a read data request to NameNode.

NameNode Find the nearest DataNode information

3. The client downloads files from DataNode in blocks

MapReduce

MapReduce is a programming model, a programming method, and an abstract theory.

YARN concept (generic resource coordination and task scheduling framework) ResourceManager

Allocate and schedule resources, start and monitor Application Master, monitor NodeManagerApplication Master

Request resources for MR-type programs and allocate them to internal tasks, responsible for data slicing, monitoring task execution, and fault tolerant NodeManager

Manage resources for a single node, process commands from ResourceManager, process commands from ApplicationMaster

MapReduce has four stages.

Split stage Map stage (coding required) Shuffle stage Reduce stage (coding required) MapReduce programming model-input a large file, split it into multiple fragments after splitting - Each file fragment is processed by a separate machine, which is the Map method. - Summarize the results of each machine calculation and get the final result, which is the Reduce method.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report