Basic concepts of Hadoop 07/11 Update SLTechnology News&Howtos

Basic concepts of Hadoop

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

I. basic concepts and models

1 、 big data

Structured data: strictly defined

Semi-structured data: html, json, xml, etc., structured but unconstrained documents

Unstructured data: no metadata, such as log documents

Search engine: ELK, composed of search component and index component, used to search data and store it in distributed storage

Crawlers: searching for semi-structured and unstructured data

Need + efficient storage capacity and efficient analysis and processing platform

2 、 Hadoop

Hadoop is developed in Java language and is a copycat version of these three Google papers.

2003: The Google File System-- > HDFS

2004: MapReduce:Simplified Data Processing On Large Cluster-- > MapReduce

2006: BigTable:A Distributed Storage System for Structure Data-- > Hbase

HDFS+MapReduce=Hadoop

HDFS:Hadoop distributed file system with central node data storage

MapReduce: computing Model, Framework and platform for big data parallel processing

HBase:Hadoop 's database database

Official website: hadoop.apache.org

II. HDFS

1. HDFS problem

There are name nodes: NN:NameNode and second name node SNN:Secondary NameNode

NN data is stored in memory, because the data changes rapidly in memory, the hard disk storage can not keep up with the changing speed of memory, so the image files are stored to the hard disk through a mechanism similar to database transaction log. In this process, the append log is constantly emptied and written, so when the NN server hangs, at least not too many files will be lost. However, due to the file metadata may be inconsistent after downtime, resulting in file verification, the amount of data will take a lot of time.

SNN is a secondary name node, and the additional log of NN is placed on the shared storage so that SNN can access it. When the NN is down, SNN can start in time according to the additional log, at least optimizing the file verification time when there is only one NN.

After the Hadoop2 or HBase2 version, the data sharing can be stored in zookeeper, and several nodes can get the view at the same time through zookeeper, which solves the problem just now, and can also be set with high availability.

2. Working principle of HDFS data node:

Data node exists: DN:DataNode

When there is data storage, the HDFS file system, in addition to storing DN, will look for two data nodes to store as copies. The data nodes are chained together, that is, only the first copy will have the second copy, and the second copy will have the third copy. After each storage, the node will report its status and data block list to the metadata block or the server. When data is lost at some point, the chained data is restarted to make up for the missing data blocks.

III. MapReduce

1. JobTracker: job tracker

Each node responsible for running the job is called a task tracker, TaskTracker, in MapReduce.

2. Each node runs two types of processes:

DataNode: responsible for data management operations such as storing or deleting data

TaskTracker: responsible for completing queue processing, belonging to Hadoop cluster

3. Program characteristics

Traditional program scheme: where the program is, the data is loaded.

Hadoop solution: where the data is, the program runs.

4. Hadoop distributed running processing framework

Task submission may be submitted by N individuals at the same time, each job may not necessarily run on all nodes, may be on some nodes, or even on one node, in order to be able to limit the access of too many tasks on a node, so we use task slot, task slot to determine the maximum number of tasks a node can run.

5. Functional programming: MapReduce refers to this operation mechanism.

Lisp,ML functional programming language: high-order functions

Map,fold

Map: map a task to multiple tasks, treat a function as a parameter, and apply it to the

All elements generate a list of results that can be mapped to multiple functions.

Map (f ())

Fold: constantly collapses the result onto a function, receiving two parameters: the function, and the initial value.

Fold (g (), init): first, combine the initial value of init, get the result of g (init) through the function g (), then take the result g (init) as the initial value in the second round, get the result of g (g (init)) through the function g (), and so on will get a final result.

6. MapReduce process

Mapper: each mapper is every instance, and each mapper will generate a list after processing, which is equivalent to the process of map function. If the data received by mapper is key data, it will be used directly. If it is not key data, it will be converted into key data first.

Reducer: reducer will not be carried out until all mapper has been run, which is equivalent to the process of fold function. There may be more than one reducer. Reducer only deals with key-type data, and the received data is folded.

The folded data of reducer is still key data, and the folding process is called shuttle and sort, which is very important.

Easy to understand: the number of times each word appears in a book in Tongji:

Mapper: one unit every 100 pages, such as 5 mappers, is used to split into words, for example, this 1 is 1 this this 1 how 1, words are split one by one, and the data processed by mapper is k-v data.

Reducer:reducer only deals with key-type data, passes the split words into reducer for statistical processing and sorting, and sends the data with the same key to the same reducer. The final result, this 500, is 200, etc., and the result is still KV data.

Shuffle and sort: after receiving the mapper, the process that reducer calculates the number of occurrence of words with key data is called shuffle and sort.

7. Hadoop data graph

Hadoop only provides a data storage platform, any job, any data program processing must be written by Hadoop developers to call MapReduce programs to be used. What the specific task of mapper, what reducer uses, depends on what the developer's definition is.

(1) partitioner: a divider with the function of deciding which reducer to send the mapper key value to through the shuffle and sort process

(2) combiner: if the keys in the key value data generated by mapper are the same, then the keys will be merged, otherwise they will not be merged and distributed, and will also be developed by hadoop developers. Its input key and output key must be consistent.

(3) when there are multiple reduce:

Sort: each map sorted locally is called sort

(4) when using a single reduce:

(5) shuffle and sort stage:

(6) Job submission request process:

(7) Internal structure of JobTracker

Role: job scheduling, management monitoring, etc., so the runtime JobTracker will be very busy, so it has become a performance bottleneck, but after the MRv2 version, job scheduling, management and monitoring functions have been cut

(8) version change

MRv1 (Hadoop2)-- > MRv2 (Hadoop2)

MRv1: cluster resource manager, data handler

MRv2:

YARN: cluster Resource Manager

MRv2: data handler

Tez: execution engine

MR: batch job

RT Stream Graph: real-time streaming graph processing, data structure of graph algorithm

(9) the running process of the second generation hadoop resource task

Mapreduce separates resource management from task running. The program is run by its own Application Master, while resource allocation is done by Resource Manager. So when a client submits a task, Resource Manager asks each Node Manager if there is a free container to run the program. If so, it goes to the existing node to start the master process Application Master. Then App Mstr applies for a resource task from Resource Manager, and Resource Manager will tell App Mstr after assigning the resource task, and then App Mstr can use contrainer to run the job.

During the running process, each container will give its own job feedback to the App Mstr. When a task in the container ends, App Mstr will also report to Resource Manager,Resource Manager that the resources will be collected back.

RM:Resource Manager

NM:Node Manager

AM:Application Manager

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.