What is the Hadoop schema architecture like? 07/06 Update SLTechnology News&Howtos

What is the Hadoop schema architecture like?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces what the Hadoop pattern architecture is like, the article is very detailed, has a certain reference value, interested friends must read it!

1. The model of Hadoop 1.0:

| |

Split 0-> map- [sort]-> [1mai 3..] | / merge

| | = > reducer-- > part 0 = > HDFS replication |

Split 1-> map- [sort]-> [2mai 6..] |-

| | = > reducre--- > part 1 = > HDFS replication |

Split 2-> map- [sort]-> [4Jing 2..] |

| |

/ / INPUT HDFS | / / output HDFS

/ / there are 3 map,reducer started, only 2 are started. Sort: send to reducer after local sorting.

Same key sent to the same reducer

/ / merge: integrate multiple data streams into one data stream

Workflow:

Client--- > Job--- > Hadoop MapReduce master

| |

Job parts Job parts

| | |

[Input]-- map1reduceA---- > [Output]

[Data]-map2 = "reduceB---- > [Data]

\ _ _ map3

/ / where map1,2,3 and reduceA,B are cross-used. That is to say, map1 can correspond to reduceA and reduceB at the same time, and everything else can.

/ / MapReduce divides the tasks to be handled into two parts, Map and Reduce

Client App

(MapReduce Client)-> Job Tracker

| |

_ | _ _

[task tracker1] [task tracker1] [task tracker1]

Map reduce reduce reduce map map reduce map

JobTracker: there is a task list and status information

JobA---- > [map task1]

JobB [map task2]

JobC [map task3]

... [reduce task 1]

[reduce task 2]

/ / the number of tasks that any task tracker can run is limited and can be defined

/ / Task slot: determine how many job can be run

Jobtracker:

1. Responsible for task distribution

two。 Check Task tracker status, tracker failure restart, etc.

3. Monitor the status of tasks

There is a single point of failure in Job tracker. After hadoop2.0, these functions are implemented respectively.

Cut into two parts after Mapreduce 2.0

II. HadooP 1.0,2.0

1.0: pig:data flow,Hive:sql

2. 0: MR:batch batch, Pig:data flow,Hive:sql

RT:stream graph: real-time streaming graphics processing

Tez:execution engine// execution engine

MRv1:cluster resouce manager,Data procession

MRV2:

1.YARN:Cluster resource manager

Yet Another Resource Negotiator, another resource coordinator

2.MRv2:Data procession

MR:batch job

Tez:execution engine / / provides a runtime environment

Programs that can be directly on top of YARN are:

MapReduce:Batch,Tez,HBase,Streaming,Graph,SPark,HPC MPI [High performance], Weave

Hadoop2.0

Clinet-- > RM--- > node1/node2/node n...

Resouce Manager: RM is independent

There are [node manager+App Master+ Container] running on node

Node manager:NM, which runs on each node, periodically reports node information to RM

Clinet request job: Application master on node decides how many mapper and how many reducer to start

Mapper and reducer are called Container / / jobs that run inside the container.

There is only one Application master, and APP M for the same task is only on one node, but Container runs on multiple nodes and periodically reports its processing status to APP M.

APP M reports the health of the task to RM, and RM shuts down APP M after the task is completed.

When a task fails, it is managed by App M, not by RM

2.0 working model

A [NM/Container 1/APP M (B)]

\ /

[RM]-- [NM/Container 2/APP M (A)]

B [NM/Container 3 / A]

/ / Task A runs 3 container on two nodes

/ / Task B runs 1 container on a node

Mapreduce status:container reports to APP M / / container includes map and reducer tasks

Job submission:

Node status:NM cycle reports to RM

Resouce Request: App M applies to RM, and then APP M can use the container of other node

Client request-- > RM to find free node, run APP M on idle node-> APP M applies to RM to run container resource, RM asks NM to ask container,RM to allocate coantainer, then tell APP M

APP M uses container to run tasks. In the process of running, Container constantly reports its status and progress to APP M, and APP M reports its running status to RM.

APP M reports that the run is complete, RM withdraws container and shuts down APP M

RM:resource manager

NM:node manager

AM:application master

Container:mr task running

Hadoop development route:

2003 nutch / / Spider Program

2004-2006:Mapreduce + GFS, thesis

2011:hadoop 1.0.0

2013:hadoop 2.0

Http://hadoop.apache.org/

Programs that can be directly on top of YARN are:

MapReduce:Batch,Tez,HBase,Streaming,Graph,SPark,HPC MPI [High performance], Weave

III. Hadoop 2.0 ecosystem and basic components

/ / depends on YARN on YARN, and the rest can be used independently

2. HDFS (Hadoop distributed file system)

The GFS paper, which originated from Google, was published in October 2003. HDFS is a GFS clone.

HDFS is the foundation of data storage management in Hadoop system. It is a highly fault-tolerant system that can detect and respond to hardware failures and is used to run on low-cost general-purpose hardware.

HDFS simplifies the file consistency model, provides high-throughput application data access through streaming data access, and is suitable for applications with large datasets.

It provides a mechanism to write once and read multiple times, and the data is distributed in blocks across different physical machines of the cluster at the same time.

3. Mapreduce (distributed computing framework)

The MapReduce paper, derived from google, was published in December 2004. Hadoop MapReduce is a google MapReduce clone.

MapReduce is a distributed computing model, which is used to calculate a large amount of data. It shields the details of distributed computing framework and abstracts computing into two parts: map and reduce.

Map performs specified operations on independent elements on the dataset to generate intermediate results in the form of key-value pairs. Reduce specifies all values of the same "key" in the intermediate result to get the final result.

MapReduce is very suitable for data processing in a distributed parallel environment composed of a large number of computers.

4. HBASE (distributed inventory database)

Bigtable paper from Google, published in November 2006. HBase is a Google Bigtable clone.

HBase is a column-oriented, scalable, highly reliable, high-performance, distributed and column-oriented dynamic schema database for structured data based on HDFS.

HBase uses BigTable's data model: the enhanced sparse sort Mapping Table (Key/Value), where keys are made up of row keywords, column keywords, and timestamps.

HBase provides random, real-time read and write access to large-scale data. At the same time, the data stored in HBase can be processed by MapReduce, which perfectly combines data storage and parallel computing.

HBase: copycat version of BitTable, column storage, SQL for row storage.

Ancestors: store multiple commonly used columns in one.

Cell: the intersection of rows and columns, when each cell is stored, multiple versions can coexist, the previous version will not be deleted, and the previous version can be traced back to the old version.

You can specify how many versions to save. Every cell is a key-value pair. It is possible to have one more field or one less field in any row. There is no strong schema constraint.

HBASE works on top of HDFS and is transformed into chunk

When big data block is needed, it is read into HBase, read and modified, and then overwrite or write to HDFS

So as to realize random reading and writing. HDFS does not support random read and write.

HBase interface:

HBase is based on distributed implementation: another cluster is needed, which relies heavily on ZooKeeper to solve brain fissure.

HDFS itself has redundancy, and each chunk is stored as multiple copies.

HBase runs on top of HDFS as a column-oriented database, and HDFS lacks immediate read and write operations, which is why HBase appears.

HBase is modeled on Google BigTable and stored in the form of key-value pairs. The goal of the project is to quickly locate and access the required data in billions of rows of data in the host.

HBase is a database, a NoSql database, like other databases to provide immediate read and write function, Hadoop can not meet the real-time needs, HBase can meet.

If you need to access some data in real time, store it in HBase.

You can use Hadoop as a static data warehouse and HBase as a data store to store data that will change if you do something.

5. Zookeeper (distributed collaboration Service)

Chubby paper from Google, published in November 2006. Zookeeper is a Chubby clone.

Solve the problems of data management in distributed environment: unified naming, state synchronization, cluster management, configuration synchronization and so on.

Many components of Hadoop depend on Zookeeper, which runs on a cluster of computers and is used to manage Hadoop operations.

6. HIVE (data Warehouse) Bee

Open source by facebook, originally used to solve massive structured log data statistics problems.

Hive defines a query language (HQL) similar to SQL, which converts SQL to MapReduce tasks and executes on Hadoop. It is usually used for offline analysis.

HQL is used to run query statements stored on Hadoop, and Hive allows developers who are not familiar with MapReduce to write data query statements, which are then translated into MapReduce tasks on Hadoop.

Hive: help convert to MapReduce task / / MapReduce: is a bat program, slow speed

HQ is similar to SQl statement, so it is suitable for data operation offline, and real-time online query or operation is very slow in real production environment.

Hive acts as a data warehouse in Hadoop.

You can use HiveQL for select,join, and so on.

If you have data warehouse requirements and you are good at writing SQL and do not want to write MapReduce jobs, you can use Hive instead.

Friends who are familiar with SQL can use Hive to process and analyze offline data.

7.Pig (ad-hoc script)

By yahoo! Open source, the design motivation is to provide a MapReduce-based ad-hoc (computing occurs when query) data analysis tool

Pig defines a data flow language-Pig Latin, which is an abstraction of the complexity of MapReduce programming. The Pig platform includes a running environment and a scripting language (Pig Latin) for analyzing Hadoop data sets.

The compiler translates Pig Latin into MapReduce program sequence and converts scripts into MapReduce tasks to be executed on Hadoop. It is usually used for offline analysis.

Pig: scripting language Interface A lightweight scripting language that manipulates hadoop, originally introduced by Yahoo, but is now in decline.

However, I personally recommend using Hive.

8.Sqoop (data ETL/ synchronization tool)

Sqoop is an acronym for SQL-to-Hadoop, which is mainly used to transfer data before traditional databases and Hadoop. The import and export of data is essentially a Mapreduce program, which makes full use of the parallelization and fault tolerance of MR.

Sqoop uses database technology to describe data architecture, which is used to transfer data between relational database, data warehouse and Hadoop.

9.Flume (log collection tool)

Cloudera open source log collection system has the characteristics of distributed, high reliability, high fault tolerance, easy to customize and expand.

It abstracts the data from the process of generating, transmitting, processing and finally writing to the target path to the data flow. In the specific data flow, the data source supports customizing the data sender in the Flume, thus supporting the collection of different protocol data.

At the same time, Flume data stream provides the ability to deal with log data simply, such as filtering, format conversion and so on. In addition, Flume has the ability to write logs to various data targets (customizable).

Generally speaking, Flume is a massive log collection system that is scalable and suitable for complex environments. Of course, it can also be used to collect other types of data.

10.Mahout (data Mining algorithm Library)

Mahout originated in 2008 as a sub-project of Apache Lucent. It has made great progress in a very short period of time and is now a top-level project of Apache.

The main goal of Mahout is to create scalable implementations of classic algorithms in the field of machine learning, which are designed to help developers create smart applications more easily and quickly.

Mahout now includes widely used data mining methods, such as clustering, classification, recommendation engine (collaborative filtering) and frequent set mining.

In addition to algorithms, Mahout also includes data input / output tools, data mining support architectures such as integration with other storage systems such as databases, MongoDB, or Cassandra.

11. Oozie (workflow scheduler)

Oozie is an extensible working system integrated into the stack of Hadoop to coordinate the execution of multiple MapReduce jobs. It can manage a complex system based on external events, including the timing of data and the appearance of data.

An Oozie workflow is a set of actions (for example, Hadoop's Map/Reduce job, Pig job, and so on) placed in the control dependency DAG (directed acyclic graph Direct Acyclic Graph), which specifies the order in which the actions are executed.

Oozie uses hPDL, a XML process definition language, to describe this diagram.

12. Yarn (distributed Resource Manager)

YARN is the next generation MapReduce, namely MRv2, which is evolved on the basis of the first generation MapReduce. It is mainly proposed to solve the poor scalability of the original Hadoop and does not support multi-computing framework. Yarn is the next generation Hadoop computing platform, and yarn is a general runtime framework. Users can write their own computing framework and run in this running environment. The framework written by yourself is used as a lib on the client side, which can be packaged when using the submission job. The framework provides the following components for:

-Resource management: including application management and machine resource management

-two-tier resource scheduling

-Fault tolerance: fault tolerance is considered in each component

-scalability: scalable to tens of thousands of nodes

13. Mesos (distributed Resource Manager)

Mesos was born in a research project of UC Berkeley and has become an Apache project. At present, some companies use Mesos to manage cluster resources, such as Twitter.

Similar to yarn, Mesos is a platform for unified resource management and scheduling, and also supports a variety of computing frameworks, such as MR, steaming and so on.

14. Tachyon (distributed memory file system)

Tachyon (/ 'tki:n/ means tachyon) is a memory-centric distributed file system with high performance and fault tolerance.

Can provide reliable file sharing services at memory-level speed for cluster frameworks such as Spark and MapReduce.

Tachyon was born in AMPLab of UC Berkeley.

15. Tez (DAG Computing Model)

Tez is Apache's latest open source computing framework that supports DAG jobs. It originates directly from the MapReduce framework. The core idea is to further split the Map and Reduce operations.

That is, Map is split into Input, Processor, Sort, Merge and Output, and Reduce is split into Input, Shuffle, Sort, Merge, Processor and Output, etc.

In this way, these decomposed meta-operations can be flexibly combined to generate new operations. After some control programs are assembled, these operations can form a large DAG job.

At present, hive supports mr and tez computing models, and tez can perfect binary mr programs and improve computing performance.

16. Spark (memory DAG Computing Model)

Spark is an Apache project, which is billed as "lightning fast cluster computing". It has a thriving open source community and is by far the most active Apache project.

The earliest Spark is a general parallel computing framework like Hadoop MapReduce, which is open source by UC Berkeley AMP lab.

Spark provides a faster and more general data processing platform. Compared with Hadoop, Spark can make your program run 100 times faster in memory or 10 times faster on disk

17. Giraph (graphic Computing Model)

Apache Giraph is a scalable distributed iterative graph processing system based on Hadoop platform, inspired by BSP (bulk synchronous parallel) and Google's Pregel.

It first came from Yahoo. Yahoo developed Giraph using the principles of "Pregel: a large-scale Chart processing system" published by Google engineers in 2010. Later, Yahoo donated Giraph to the Apache Software Foundation.

At present, everyone can download Giraph, which has become an open source project of the Apache Software Foundation, supported by Facebook and improved in many ways.

18. GraphX (graphic Computing Model)

Spark GraphX was originally a distributed graph computing framework project of Berkeley AMPLAB, and is now integrated into the spark running framework to provide it with BSP large-scale parallel graph computing power.

19. MLib (Machine Learning Library)

Spark MLlib is a machine learning library, which provides a variety of algorithms for classification, regression, clustering, collaborative filtering and so on.

20. Streaming (flow Computing Model)

Spark Streaming supports real-time processing of convective data and calculates real-time data in a micro-batch way.

21. Kafka (distributed message queuing)

Kafka is an open source messaging system developed by Linkedin in December 2010, which is mainly used to deal with active streaming data.

Active streaming data is very common in web applications, including the pv of the site, what users visit, what content they search, and so on.

These data are usually recorded in the form of logs and then processed statistically at regular intervals.

twenty-two。 Phoenix (hbase sql interface)

Apache Phoenix is the SQL driver of HBase. Phoenix enables Hbase to be accessed through JDBC and converts your SQL query into Hbase scanning and corresponding actions.

23. Ranger (Security Management tool)

Apache ranger is a hadoop cluster permissions framework, which provides complex data permissions for operation, monitoring and management. It provides a centralized management mechanism to manage all data permissions in the yarn-based hadoop ecosystem.

24. Knox (hadoop Security Gateway)

Apache knox is a restapi gateway to access the hadoop cluster. It provides a simple access interface point for all rest access, and can complete 3A authentication (Authentication,Authorization,Auditing) and SSO (single sign-on).

25. Falcon (data Lifecycle Management tool)

Apache Falcon is a new data processing and management platform for Hadoop, designed for data movement, data pipeline coordination, life cycle management and data discovery. It enables end users to quickly "onboard" their data and related processing and management tasks to the Hadoop cluster.

26.Ambari (install deployment configuration management tools)

The function of Apache Ambari is to create, manage and monitor the cluster of Hadoop. It is a web tool to make Hadoop and related big data software easier to use.

Note: Hadoop should try not to run on virtual machines because it has a great impact on IO.

Hadoop Distribution:

Community Edition: Apache Hadoop

Third-party distribution:

Founder of Cloudera:hadoop source: CDH / / iso image, the most formed

Hortonworks: original hadoop staff: HDP / / iso image, non-open source

Intel:IDH

MapR:

Amazon Elastic Map Reduce (EMR)

Apache hadoop or CDH is recommended

The above is all the content of the article "what is the Hadoop schema architecture?" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.