Hadoop distributed processing framework 04/27 Update SLTechnology News&Howtos

Hadoop distributed processing framework

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

[introduction] Hadoop is a distributed system infrastructure developed by the Apache Foundation.

1. Explain what is the Apache Foundation?

A: the Apache Software Foundation (Apache Software Foundation, or ASF for short) is a non-profit organization dedicated to supporting open source software projects. In the Apache projects and subprojects it supports, the released software products follow the Apache license (Apache License).

[effect]

Hadoop implements a distributed file system, HDFS. HDFS has high fault tolerance and is designed to be deployed on low-cost hardware, and it provides high throughput to access application data, making it suitable for applications with very large datasets.

Hadoop's framework mouth and new designers: HDFS (massive data storage) and MapReduce (providing computing for massive data storage).

[core]

At the bottom of the ① Hadoop is HDFS, which stores files at all stages in the Hadoop cluster.

The upper layer of ② HDFS is MapReduce, which consists of job and task.

③ uses HDFS and MapReduce processes, as well as data warehouse tools Hive and distributed database Hbase.

2. Explain HDFS?

Answer: for external clients, HDFS is like a traditional hierarchical file system that can create, delete, move, or rename files, and so on.

The architecture of HDFS is based on a specific set of nodes, including:

NameNode (only one), which provides metadata services within HDFS, is software that typically runs on a separate machine in an HDFS instance and is responsible for managing file system namespaces and controlling access to external clients. It determines whether to map Enjian to different nodes of the same architecture.

DataNode, which provides storage blocks for HDFS. The files stored in it are divided into blocks and then copied to multiple computers (DataNode). All internal communications are based on the standard TCP/IP protocol. DataNode is also a piece of software that usually runs on a separate machine in an HDFS instance. The Hadoop cluster contains a NameNode and a large number of DataNode. DataNode is organized in the form of a rack, which connects all systems through a switch.

3. Explain how to use HDFS for file operation.

① HDFS is not a universal file system, its main purpose is to access large files written in the form of streams.

② if the client writes a file to HDFS, it first needs to cache the file to a local temporary storage location.

③ if the cached data is larger than the required HDFS block size, a request to create a file is sent to NameNode. The NameNode will respond to the client with the DataNode identity and target block.

④ also notifies the DataNode that a copy of the file block will be saved. When the client starts sending a temporary file to the first DataNode, the block contents are immediately piped to the replica DataNode.

The ⑤ client is responsible for creating a checksum file that holds the same HDFS namespace. After the last file block is sent, NameNode commits the file creation to its persistent metadata store.

4. The application of Hadoop in practice?

A: Hadoop technology is widely used in the Internet. For example, Yahoo uses a 4000-node Hadoop cluster to support advertising systems and web search.

Facebook uses a 1000-node Hadoop cluster to store Japanese style, supporting data analysis and machine learning

Baidu uses Hadoop to process weekly 200Tb data for search log analysis and web page data mining.

Taobao's Hadoop system is used to store and process data related to e-commerce transactions.

5. Compare MapReduce with Hadoop?

Hadoop is a framework for distributed data and computing. It is very good at storing a large number of semi-structured data sets. Data can be stored randomly, so the failure of a disk does not result in data loss. Hadoop is also very good at distributed computing-quickly processing large sets of data across multiple machines.

MapReduce is a programming model that deals with a large number of semi-structured data sets. A programming model is a way to deal with and structure specific problems.

6. What is the basic principle of HDFS?

When ① HDFS stores data, it first cuts the data into blocks and assigns an ordered number to the blocks

② for data backup

③ puts replicated backups in different DataNode

④ replicas stored on the NameNode Award DataNode when DataNode is down

⑤ thus enables NameNode to maintain the management of DataNode (heartbeat [node state] ah, bad eyes [data stored by nodes] can't be avoided, inexplicable heartbeat to you)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.