Getting started with Hadoop 05/01 Update SLTechnology News&Howtos

Getting started with Hadoop

2025-05-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Introduction to MapReduce&HDFS

1. Brief introduction to Hadoop:

Structured data: tables, relational databases / / with strict constraints

Semi-structured data: html,json,yaml, metadata / / constrained, lack of strict constraints

Unstructured data: no predefined models, metadata / / log data, etc.

Search engine: search component, index component

Web crawler: most of the content crawled to is semi-structured or unstructured data

Build an inverted index [based on exact search or fuzzy search based on relevance matching] and store it in the storage system [non-RDBMS].

2003: The Google File System//google how to achieve file storage, does not support random and real-time access to data, only suitable for storing a small number of large files.

If the climb to the html page changes, it needs to be modified. Then goole fs cannot meet the requirement

2004: MapReduce: simplified Data Precessing On Large Cluster//MapReduce programming model in which a task runs on each node and then collects the results

2006: BigTable: A Distributed Storage System for Structure Data / / Storage distributed storage of structured data

GFS- > after copycat HDFS:

MapReduce- > MapReduce:

BigTable- > HBase:

HDFS + MapReduce = Hadoop / / A toy of the author's son

Database of HBase:hadoop

Nutch: a web crawler that crawls data for Lucene

A flaw in Hadoop: MapReduce is a batch program

HDFS uses a storage format with central nodes.

Client

| |

Metadata node

| |

= =

Node1 node2 node3 node n

1 1'2 2'

Data query process: client- > metadata nodes [which nodes are the data distributed on]-- > [node 1meme 2meme 3Zhonology n]-- > client user query [write code]-- > call MapReduce's development framework first-- > let the framework run.

Map: this code needs to be run on node1 and node2 respectively, and each node handles its own part

/ / node1 owns 1 Node2 has 2

Reduce: finally, merge the running results on node1 and node2

Final speed: depends on the slowest node

MapReduce:

1. Develop API

two。 Operation framework

3. Provide a runtime environment

Disadvantages of NAS and SAN: there is only one storage system. In the face of massive data, data access is needed. Disk IO and network IO will face great challenges.

So there is distributed storage.

II. HDFS and MapReduce

1. No central node

two。 Having a central node HDFS / / metadata node is the bottleneck and core. / / GFS,HDFS

Metadata node: NN: name node / / HA, memory data persistence. His data is stored in memory.

/ / transaction log, which is written to persistent storage, and then reloaded after downtime to reduce lost data.

Backend hosts should ensure that services are available + data are available / / DN:data node

File system detection may be needed after downtime, and a lot of time will be wasted when the amount of data is too large.

/ / to put it simply: once NN crashes, it will take half an hour to start again, because hadoop 1.x NN does not support HA

Available later in SNN:second namenode

NN needs to constantly update data in memory, write to logs, merge logs and image files, etc.

SNN: responsible for merging data. If NN crashes, SNN loads shared storage files and works on its own.

It saves time, but it still takes no less time to detect the file system. You just don't have to fix NN right away.

LB: distribute different requests to different hosts

Client

| |

NN--SNN

| |

[mirror] [SHM]

| |

= =

Node1 node2 node3 node n

1 1'2 2'

NN after HDFS 2.0can be highly available

Metadata is no longer stored in local storage, but in a shared storage, which is based on memory.

For example: NFS [brain fissure] is not commonly used, ZooKeeper

The update operations of NN1 and NN2 are synchronized to ZooKeeper, so each node can get the same data from ZooKeeper.

/ / zookeeper: distributed coordination tool (distributed lock), Chubby of google (not open source)

Http://www.cnblogs.com/wuxl360/p/5817471.html / / reference website

Data node: the data node that stores each chunk

Each copy of the data store: stored to another node. 3 copies are stored by default.

When storing, it stores one, and HDFS finds two other nodes to store it.

Each storage node reports its stored block information + its own status information to the service node cycle.

NN has two tables:

1. Data-centric, distributed on which nodes

two。 Which data blocks are held with the node as the core?

What to do with the data: / / the cluster running the program

MapReduce: works in a cluster.

Map: decentralized operation

Reduce: merging

A task can be divided into several Map, controlled by the framework of MapReduce.

Need a total node to schedule JobTracker

Ideally: let the node with the requested data run tasks / / but some node may already be busy

If you have a node busy solution with data:

1. wait for

two。 Find the node where the copy is located / / maybe the copy node is also busy.

3. Find a free node and run the task. / / replicas may be required to this idle node

For HDFS,

Data nodes shared by Hadoop and MapReduce

Client

| |

JobTracker

| |

= =

Node1 node2 node3 node n

1 1'2 2'

/ / only these nodes are no longer called Data Node but Task Tracker.

These node need to run two types of processes: DataNode / Task Tracker / / responsible for data storage and processing

So a Hadoop is a combination of two types of clusters: the same class node is used to store and process data.

Third, data processing model

The program runs node to load the data into the node where the program is located to run / / the data is close to the program

Hadoop takes the data as the center and allows the program to run / / close to the data on the node where the data is located.

The work of JobTracker and Name node is not conflicting, so they can be deployed on the same node

[JobTracker/NameNode]

| | |

TaskTracker/DataNode1 taskTracker/DataNode2 taskTracker/DataNode3...

The task submitted by each person does not necessarily run on all nodes, but is likely to run on several nodes.

You can limit the maximum number of tasks a node can run.

Fourth, functional programming:

Lisp:ML functional programming language: high-order functions

Map,fold

Map: map a task to multiple tasks

Fold: folding

Example: map (f ())} / / map will run the f function as multiple copies, each on multiple nodes.

Map: take a function as an argument and apply it to all elements in the list

The list of examples is: 1, 2, 3, 4, 5

To get everyone's age, execute it on 1, 2, 3, 4, 5, respectively.

For example, the result after map is: 22, 3, 3, 44, 12, 34.

Fold: accept two parameters 1: function, 2: initial value

Fold (g (), init) / /

Replace the processing result of the first with init, then use g () and the processing result of the first to process the second data, which is pushed in turn.

Example: 22 fold 33 414 fold (g (1), init) = > foldg (g (2) 22) = > fold (g (g3), 33),

And finally find the biggest one.

MapReduce: / / after any program calls the APi, it is divided into two segments

Mapper: an instance running on task tracker, resulting in a list of results

Reducer: among the many results obtained from mapper

Count the number of times each word appears in a book:

Mapper: one unit per 100 pages, with 5 mapper used to split into words; count

For example, it has been broken down into 10000 words, many of which are repeated.

Mapper needs to ensure that duplicate words are sent to the same reducer

Called: shuffle and sort / / Transport sorting process

Reducer

Reducer1,reducer2// starts two reducer,mapper to send to reducer1 and 2 in turn, and sends them to the same reducer repeatedly, ensuring that the words counted by each reducre are different.

Final merger

Reducer1:

This:500

Is:10

Reducer2:

How: 30

Do: 20

The merger of the two becomes the result.

The object data counted by MapReducer are all key-value data, not kv data, which need to be converted to kv data first.

Convert mapper:// to kv data

This 1 is 1, this 1 is 1 / / appears once marked as 1

Data with the same key can only be sent to the same reducer.

Reducer: / / is also kv data

For example, add all the value corresponding to this

This:500

Is:20

Mapper-reducer may need to be executed multiple times to achieve the results, but each time the goal is different.

You can also count local data directly on mapper and send it to reducer

This:500

Is:20

/ / reducer can be started with mapper, or reducer can be run after mapper

The same key is sent to the same reducer: who guarantees

Determined by the mapper reducer framework, / / starting several reducer is defined by the programmer

MapReduce:

1. Develop API

two。 Operation framework

3. Provide a runtime environment

Hadoop can realize parallel processing.

HDFS + MapReduce = Hadoop

A program that calls MapReduce API

| |

[NameNode] [JobTracker]

TaskTracker/DataNode1 taskTracker/DataNode2 taskTracker/DataNode3...

Typical applications of Hadoop include: search, log processing, recommendation system, data analysis, video image analysis, data preservation, etc.

Figure 1:MapReduce framework

5. MapReduce working model

MapReduce: working model

[K1 | m] [K2 | n] [K3 | r] [K4 | s] [K5 | m] [K6 | t] [K7 | m]

\ / |\ /

[mapper] [mapper]

| | |

V V V V

[ik1 | 3] [ik3 | 1] [ik1 | 6] [ik3 | 2] [ik1 | 1] [ik1 | 4] [ik2 | 3] [ik2 | 6]

[partitioner] [partitioner]

+ + +

/ / Shuffle & sort. Aggressive values by keys / /

+ + +

[ik1 | 3Pere6, 1, 1, 4] [ik2 | 3, 3, 6] [ik3 | 1, 1, 2]

| | |

V V V

Reducer reducer reducer

| | |

V V V

[ok1 | 14] [ok2 | 9] [ok3 | 3]

/ / mapper: read key-value pairs and generate key-value pairs

/ / combiner: responsible for merging the same keys on mapper after mapper, that's all, input and output keys must be the same.

/ / partitioner: responsible for distributing the same key to the same reducer, and all partioner are the same

It is up to partitioner to decide how the generated key value is sent.

[K1 | m] [K2 | n] [K3 | r] [K4 | s] [K5 | m] [K6 | t] [K7 | m]

\ / |\ /

[mapper] [mapper]

| | |

V V V V

[ik1 | 3] [ik3 | 1] [ik1 | 6] [ik3 | 2] [ik1 | 1] [ik1 | 4] [ik2 | 3] [ik2 | 6]

[combiner] [combiner]

| | |

V V V V

[ik1:3] [ik3:1] [ik1:6] [ik3:2] [ik1:5] [ik2:9]

[partitioner] [partitioner]

+ + +

/ / Shuffle & sort. Aggressive values by keys / /

+ + +

[ik1 | 3Pere6, 1, 1, 4] [ik2 | 3, 3, 6] [ik3 | 1, 1, 2]

| | |

V V V

Reducer reducer reducer

| | |

V V V

[ok1 | 14] [ok2 | 9] [ok3 | 3]

/ / combiner and partitioner are both written by programmers

Maper-initiated node: there may not be a target shard, but n shards need to be processed, and shards need to be copied from other node to this node to execute the Maper program.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.