Hadoop: a distributed Storage and Computing platform for big data (Lecture 3) 04/27 Update SLTechnology News&Howtos

Hadoop: a distributed Storage and Computing platform for big data (Lecture 3)

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1.hadoop:

Author: Doug Cutting

Inspired by three Google papers

two。 Version:

Apache: official version (1.1.2), learn to use

Cloudera: add functions to the apache version for commercial use

Yahoo: it is now focused on the version of apache

The core project of 3.hadoop

HDFS: (Hadoop Distributed File System) distributed file system

MapReduce: parallel computing framework

4.HDFS architecture (in the master-slave structure, the master node is responsible for management. Responsible for the operation from the node)

Master-slave structure (there is only one master node namenode, there can be many slave nodes datanodes)

Namenode is responsible for:

Receive the user's operation request

Maintain the directory structure of the file system

Manage the relationship between files and block, and between block and datanode

Datanode is responsible for:

Storage file

The files are divided into block and stored on disk

To ensure data security, there will be multiple copies of the file

Architecture of 5.MapReduce

Master-slave structure (there is only one master node JobTracker, many slave nodes TaskTrackers can be used)

JobTracker is responsible for:

Receive computing tasks submitted by customers

Assign computing tasks to TaskTracker for execution

Monitor the implementation of TaskTracker

TaskTrackers is responsible for:

Perform computing tasks assigned by JobTracker

Characteristics of 6.Hadoop:

Capacity expansion (Scalable): reliable storage and processing of gigabyte (PB) data

Low cost (Economical): data can be distributed and processed through a server farm of ordinary machines

Efficient: by distributing data, hadoop can process data in parallel on the node where the data is located

Reliable: hadoop can automatically maintain multiple copies of data and automatically redeploy computing tasks after task failure

Physical Distribution of 7.Hadoop Cluster

Description:

a. The Rack below represents two cabinets, each storing multiple servers, the left and right cabinets are connected with their own switches, and the left and right switches are connected to the total switch, so the servers on the cabinet can access each other.

b. The two master nodes on the cabinet each own a server, while the slave nodes are grouped together and stored on one server.

8. Single node physical structure

Description: the left and right pictures show the master node and the slave node, respectively. The master and slave nodes in the picture use the server of the linux system and run on the java virtual machine, because hadoop is developed based on java.

9.Hadoop deployment mode

Local deployment (rarely used)

Pseudo-distribution pattern (learning to use)

Cluster mode (used by companies)

10. Prepare the software before installation

VitualVox

Centos

Jdk-6u24-linux-xxx.bin

Hadoop-1.1.2.tar.gz

11. Pseudo-distribution mode installation steps: (6 steps)

Turn off the firewall

Modify ip

Modify hostname

Set up ssh automatic login

Install jdk

Install hadoop

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.