In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Introduction to MapReduce&HDFS
1. Brief introduction to Hadoop:
Structured data: tables, relational databases / / with strict constraints
Semi-structured data: html,json,yaml, metadata / / constrained, lack of strict constraints
Unstructured data: no predefined models, metadata / / log data, etc.
Search engine: search component, index component
Web crawler: most of the content crawled to is semi-structured or unstructured data
Build an inverted index [based on exact search or fuzzy search based on relevance matching] and store it in the storage system [non-RDBMS].
2003: The Google File System//google how to achieve file storage, does not support random and real-time access to data, only suitable for storing a small number of large files.
If the climb to the html page changes, it needs to be modified. Then goole fs cannot meet the requirement
2004: MapReduce: simplified Data Precessing On Large Cluster//MapReduce programming model in which a task runs on each node and then collects the results
2006: BigTable: A Distributed Storage System for Structure Data / / Storage distributed storage of structured data
GFS- > after copycat HDFS:
MapReduce- > MapReduce:
BigTable- > HBase:
HDFS + MapReduce = Hadoop / / A toy of the author's son
Database of HBase:hadoop
Nutch: a web crawler that crawls data for Lucene
A flaw in Hadoop: MapReduce is a batch program
HDFS uses a storage format with central nodes.
Client
| |
Metadata node
| |
= =
Node1 node2 node3 node n
1 1'2 2'
Data query process: client- > metadata nodes [which nodes are the data distributed on]-- > [node 1meme 2meme 3Zhonology n]-- > client user query [write code]-- > call MapReduce's development framework first-- > let the framework run.
Map: this code needs to be run on node1 and node2 respectively, and each node handles its own part
/ / node1 owns 1 Node2 has 2
Reduce: finally, merge the running results on node1 and node2
Final speed: depends on the slowest node
MapReduce:
1. Develop API
two。 Operation framework
3. Provide a runtime environment
Disadvantages of NAS and SAN: there is only one storage system. In the face of massive data, data access is needed. Disk IO and network IO will face great challenges.
So there is distributed storage.
II. HDFS and MapReduce
1. No central node
two。 Having a central node HDFS / / metadata node is the bottleneck and core. / / GFS,HDFS
Metadata node: NN: name node / / HA, memory data persistence. His data is stored in memory.
/ / transaction log, which is written to persistent storage, and then reloaded after downtime to reduce lost data.
Backend hosts should ensure that services are available + data are available / / DN:data node
File system detection may be needed after downtime, and a lot of time will be wasted when the amount of data is too large.
/ / to put it simply: once NN crashes, it will take half an hour to start again, because hadoop 1.x NN does not support HA
Available later in SNN:second namenode
NN needs to constantly update data in memory, write to logs, merge logs and image files, etc.
SNN: responsible for merging data. If NN crashes, SNN loads shared storage files and works on its own.
It saves time, but it still takes no less time to detect the file system. You just don't have to fix NN right away.
LB: distribute different requests to different hosts
Client
| |
NN--SNN
| |
[mirror] [SHM]
| |
= =
Node1 node2 node3 node n
1 1'2 2'
NN after HDFS 2.0can be highly available
Metadata is no longer stored in local storage, but in a shared storage, which is based on memory.
For example: NFS [brain fissure] is not commonly used, ZooKeeper
The update operations of NN1 and NN2 are synchronized to ZooKeeper, so each node can get the same data from ZooKeeper.
/ / zookeeper: distributed coordination tool (distributed lock), Chubby of google (not open source)
Http://www.cnblogs.com/wuxl360/p/5817471.html / / reference website
Data node: the data node that stores each chunk
Each copy of the data store: stored to another node. 3 copies are stored by default.
When storing, it stores one, and HDFS finds two other nodes to store it.
Each storage node reports its stored block information + its own status information to the service node cycle.
NN has two tables:
1. Data-centric, distributed on which nodes
two。 Which data blocks are held with the node as the core?
What to do with the data: / / the cluster running the program
MapReduce: works in a cluster.
Map: decentralized operation
Reduce: merging
A task can be divided into several Map, controlled by the framework of MapReduce.
Need a total node to schedule JobTracker
Ideally: let the node with the requested data run tasks / / but some node may already be busy
If you have a node busy solution with data:
1. wait for
two。 Find the node where the copy is located / / maybe the copy node is also busy.
3. Find a free node and run the task. / / replicas may be required to this idle node
For HDFS,
Data nodes shared by Hadoop and MapReduce
Client
| |
JobTracker
| |
= =
Node1 node2 node3 node n
1 1'2 2'
/ / only these nodes are no longer called Data Node but Task Tracker.
These node need to run two types of processes: DataNode / Task Tracker / / responsible for data storage and processing
So a Hadoop is a combination of two types of clusters: the same class node is used to store and process data.
Third, data processing model
The program runs node to load the data into the node where the program is located to run / / the data is close to the program
Hadoop takes the data as the center and allows the program to run / / close to the data on the node where the data is located.
The work of JobTracker and Name node is not conflicting, so they can be deployed on the same node
[JobTracker/NameNode]
| | |
=
TaskTracker/DataNode1 taskTracker/DataNode2 taskTracker/DataNode3...
The task submitted by each person does not necessarily run on all nodes, but is likely to run on several nodes.
You can limit the maximum number of tasks a node can run.
Fourth, functional programming:
Lisp:ML functional programming language: high-order functions
Map,fold
Map: map a task to multiple tasks
Fold: folding
Example: map (f ())} / / map will run the f function as multiple copies, each on multiple nodes.
Map: take a function as an argument and apply it to all elements in the list
The list of examples is: 1, 2, 3, 4, 5
To get everyone's age, execute it on 1, 2, 3, 4, 5, respectively.
For example, the result after map is: 22, 3, 3, 44, 12, 34.
Fold: accept two parameters 1: function, 2: initial value
Fold (g (), init) / /
Replace the processing result of the first with init, then use g () and the processing result of the first to process the second data, which is pushed in turn.
Example: 22 fold 33 414 fold (g (1), init) = > foldg (g (2) 22) = > fold (g (g3), 33),
And finally find the biggest one.
MapReduce: / / after any program calls the APi, it is divided into two segments
Mapper: an instance running on task tracker, resulting in a list of results
Reducer: among the many results obtained from mapper
Count the number of times each word appears in a book:
Mapper: one unit per 100 pages, with 5 mapper used to split into words; count
For example, it has been broken down into 10000 words, many of which are repeated.
Mapper needs to ensure that duplicate words are sent to the same reducer
Called: shuffle and sort / / Transport sorting process
Reducer
Reducer1,reducer2// starts two reducer,mapper to send to reducer1 and 2 in turn, and sends them to the same reducer repeatedly, ensuring that the words counted by each reducre are different.
Final merger
Reducer1:
This:500
Is:10
Reducer2:
How: 30
Do: 20
The merger of the two becomes the result.
The object data counted by MapReducer are all key-value data, not kv data, which need to be converted to kv data first.
Convert mapper:// to kv data
This 1 is 1, this 1 is 1 / / appears once marked as 1
Data with the same key can only be sent to the same reducer.
Reducer: / / is also kv data
For example, add all the value corresponding to this
This:500
Is:20
Mapper-reducer may need to be executed multiple times to achieve the results, but each time the goal is different.
You can also count local data directly on mapper and send it to reducer
This:500
Is:20
/ / reducer can be started with mapper, or reducer can be run after mapper
The same key is sent to the same reducer: who guarantees
Determined by the mapper reducer framework, / / starting several reducer is defined by the programmer
MapReduce:
1. Develop API
two。 Operation framework
3. Provide a runtime environment
Hadoop can realize parallel processing.
HDFS + MapReduce = Hadoop
A program that calls MapReduce API
| |
[NameNode] [JobTracker]
|
=
TaskTracker/DataNode1 taskTracker/DataNode2 taskTracker/DataNode3...
Typical applications of Hadoop include: search, log processing, recommendation system, data analysis, video image analysis, data preservation, etc.
Figure 1:MapReduce framework
5. MapReduce working model
MapReduce: working model
=
[K1 | m] [K2 | n] [K3 | r] [K4 | s] [K5 | m] [K6 | t] [K7 | m]
\ / |\ /
[mapper] [mapper]
| | |
V V V V
[ik1 | 3] [ik3 | 1] [ik1 | 6] [ik3 | 2] [ik1 | 1] [ik1 | 4] [ik2 | 3] [ik2 | 6]
[partitioner] [partitioner]
+ + +
/ / Shuffle & sort. Aggressive values by keys / /
+ + +
[ik1 | 3Pere6, 1, 1, 4] [ik2 | 3, 3, 6] [ik3 | 1, 1, 2]
| | |
V V V
Reducer reducer reducer
| | |
V V V
[ok1 | 14] [ok2 | 9] [ok3 | 3]
=
/ / mapper: read key-value pairs and generate key-value pairs
/ / combiner: responsible for merging the same keys on mapper after mapper, that's all, input and output keys must be the same.
/ / partitioner: responsible for distributing the same key to the same reducer, and all partioner are the same
It is up to partitioner to decide how the generated key value is sent.
=
[K1 | m] [K2 | n] [K3 | r] [K4 | s] [K5 | m] [K6 | t] [K7 | m]
\ / |\ /
[mapper] [mapper]
| | |
V V V V
[ik1 | 3] [ik3 | 1] [ik1 | 6] [ik3 | 2] [ik1 | 1] [ik1 | 4] [ik2 | 3] [ik2 | 6]
[combiner] [combiner]
| | |
V V V V
[ik1:3] [ik3:1] [ik1:6] [ik3:2] [ik1:5] [ik2:9]
[partitioner] [partitioner]
+ + +
/ / Shuffle & sort. Aggressive values by keys / /
+ + +
[ik1 | 3Pere6, 1, 1, 4] [ik2 | 3, 3, 6] [ik3 | 1, 1, 2]
| | |
V V V
Reducer reducer reducer
| | |
V V V
[ok1 | 14] [ok2 | 9] [ok3 | 3]
=
/ / combiner and partitioner are both written by programmers
Maper-initiated node: there may not be a target shard, but n shards need to be processed, and shards need to be copied from other node to this node to execute the Maper program.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.