Overview of ecosystem 10/20 Update SLTechnology News&Howtos

Overview of ecosystem

2025-10-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Overview of Hadoop ecosystem and version evolution

The main content of this paper is to grasp the various components of the Hadoop system as a whole, and the functions of each component, and elaborate macroscopically, so as to lay a foundation for further study of hadoop.

There are some aspects of the Hadoop ecosystem:

1. Open source is free, because it is open source, so don't worry about using it.

2. the community is active and easy to communicate

3. Design all aspects of distributed storage and computing

4. It has been working well in the enterprise.

The difference between Hadoop1.0 and 2.0

Hadoop has been updated to the second generation, which is a great improvement over the first generation:

The main difference is the addition of a new participation YARN between HDFS and MapReduce.

At this point, first of all, the most important parts of Hadoop are: one is distributed storage (HDFS) and the other is big data computing (MapRudece). The only computing framework that the first generation of Hadoop runs on FDHS is MapReduce, and the application is greatly limited. In the second generation, YARN came out of nowhere.

It is easy to understand that YARN is like an extension nest on which many computing frameworks can be run on YARN. YARN is responsible for the unified management and scheduling of cluster resources.

Introduction to Hadoop (Overview):

1 distributed storage system HDFS (Hadoop Distributed FileSystem)

Distributed storage system

Provides data storage services with high reliability, high scalability and high throughput

two。 Resource management system YARN (Yet Another Resource Negotiator):

Responsible for unified management and scheduling of cluster resources

3. Distributed Computing Framework MapReduce

Distributed computing framework

It has the advantages of easy programming, high fault tolerance and high expansibility.

The simple hierarchical model is shown in the following figure:

HDFS

HDFS features:

Good scalability

High fault tolerance

Suitable for the storage of massive data above PB level

Basic principles:

Divide the file into large data blocks and store them on multiple machines

Make data segmentation, fault tolerance, balance and other functions transparent to users (how to implement them without developers' questions) HDFS can be regarded as a disk with large capacity and high fault tolerance.

The figure above shows the work division of HDFS: as you can see, HDFS consists of two main parts:

NAMENODE node (NN) and DATA NODE (DN) node

In big data, the HDFS cluster runs in Master-Slave mode, and there are mainly two types of nodes: one Namenode (i.e. Master) and multiple Datanode (i.e. Slave). The Namespace of the Namenode manager file system. It maintains the file system tree (filesystem tree) and the metadata (metadata) of all files and folders in the file tree.

As can be seen above, ZooKeeper acts as a dispatching manager here, while also monitoring the status of each node of the HDFS.

Because, the main purpose of this article is to talk about Haoop from a macro point of view.

YARN:

A new role added to Hadoop2.0

Responsible for resource management and scheduling of cluster

Enables multiple computing frameworks to run in one cluster

Characteristics of Yarn:

Good scalability and high availability

Unified management and scheduling of multiple types of applications

Comes with a variety of multi-user schedulers, which are suitable for shared cluster environment

The above picture more intuitively reflects that in this cluster, in different periods of time

Hadoop (mainly refers to MapReduce) Spark (memory-based computing framework, in-memory) MPI (also a computing framework) all suffer from insufficient performance, so that they can be intervened through YARN to complement each other. For example, if the performance of MapReduce cannot be fully brought into full play, less resources can be allocated to Spark to coordinate the resource scheduling of the cluster.

The above figure reflects the working mechanism of YARN, and we can still see the mode operation mechanism of Master-Slave.

This picture is more intuitive, showing that YARN is like inserting one by one, on which a variety of distributed computing frameworks can be extended.

Such as MapReduce for batch processing, interactive Tez, Storm for streaming, Giraph for graph processing, Spark for memory processing, etc.

MapReduce

Good scalability

High fault tolerance

Suitable for offline processing of massive data above PB level

In the figure above, you can see that the MapReduce work is divided into two parts, Map and Rudece:

First of all, the user writes the program, assigns the work through master, first Split the data submitted by the user into several blocks, each block into a MapWorker job, so that the data can be processed in parallel, this process can be regarded as distribution processing; the result of processing is first put into the local disk, and then the worker in the Reduce part is calculated, and the final output result is passed into HDFS. It is worth noting that the input file format is always consistent with the final processing format, including the result file format of intermediate processing.

This chart reflects the MapReduce program for counting the words that appear in the file, first dividing the words in the file into several parts, then processing statistics in the Map job, reshuffling the cards by shuff, and then handing over the job to Rudce. At this time, it can be seen that the data processed in the middle has changed and has been merged. At this time, as long as statistics are carried out on the basis of merger, the final results will be output.

Here are two general maps of the Hadoop ecosystem to compare the differences between 1.0 and 2.0:

Hadoop1.0

Hadoop2.0

A brief description:

Each component is shown in the function diagram, so I won't repeat it here.

Let's talk about Hive,Pig Mahout.

First of all, we have to deal with the data, generally by writing MapReduce programs to deal with the input data, the code is written by java. This makes it difficult for those workers who do not have the foundation of java but want to be dealt with by big data.

In this way, Hive and Pig come on stage, and it is understandable that they are two translators.

Hive (proposed by Facebook) provides an SQL language similar to traditional Sql statements to write the data code to be processed. Through Hive, the code written by Hql can be automatically transformed into a MapReduce program.

In the same way, Pig proposed a new PigLatin language to write programs that process data. He can also be converted to MapReduce programs.

Mahout provides a data mining library, which contains a large number of algorithms, including three major categories:

Recommend (Recommendation)

Clustering (Clustering)

Classification (Classification)

In this way, developers do not spend a lot of time in the construction calculation to improve work efficiency.

Oozie enables developers to schedule repetitive jobs in e-mail notifications or in various programming languages such as Java, UNIX Shell, Apache Hive, Apache Pig, and Apache Sqoop.

At present, there are a variety of computing frameworks and assignments:

MapReduce Storm Hive Pig et al.

In this way, the following problems arise:

1. There are dependencies between different jobs (DAG)

two。 Some job cycles are executed.

3. Some jobs need to be executed regularly.

4. The execution status of the job needs to be monitored and alerted (email, SMS)

At this time, Oozie is needed to carry out unified management and scheduling.

HBase:Hadoop 's database

High reliability

High performance

Column oriented

Good scalability

Composition:

Table tables: similar to tables in traditional data

ColumnFamily: column family

Table consists of one or more Column Family horizontally

A column family can be composed of any number of Column

RowKey: row key

Primary key of Table

Records in Table are sorted by RowKey

TimeStamp: timestamp

Each row of data corresponds to a timestamp

Version number

Zookeeper: it is a software that provides consistency services for distributed applications. The functions include: configuration maintenance, domain name service, distributed synchronization, group service and so on.

The goal of ZooKeeper is to encapsulate complex and error-prone key services and provide users with simple and easy-to-use interfaces and systems with efficient performance and stable functions.

Coverage: HDFS YARN Storm HBaseFlume Dubbo (Alibaba) Metaq (Alibaba)

Sqoop: (data synchronization tool)

The bridge between Hadoop and traditional database

Supports a variety of databases, including MySql, DB2, etc.

Pluggable, users can support new databases according to their needs

Essentially a MapReduce program:

Make full use of MapReduce fault tolerance of MapReduce distributed parallelism

Flume (log collection tool):

Distributed system

High reliability

High fault tolerance

Easy to customize and extend

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.