In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Overview of Hadoop ecosystem and version evolution
The main content of this paper is to grasp the various components of the Hadoop system as a whole, and the functions of each component, and elaborate macroscopically, so as to lay a foundation for further study of hadoop.
There are some aspects of the Hadoop ecosystem:
1. Open source is free, because it is open source, so don't worry about using it.
2. the community is active and easy to communicate
3. Design all aspects of distributed storage and computing
4. It has been working well in the enterprise.
The difference between Hadoop1.0 and 2.0
Hadoop has been updated to the second generation, which is a great improvement over the first generation:
The main difference is the addition of a new participation YARN between HDFS and MapReduce.
At this point, first of all, the most important parts of Hadoop are: one is distributed storage (HDFS) and the other is big data computing (MapRudece). The only computing framework that the first generation of Hadoop runs on FDHS is MapReduce, and the application is greatly limited. In the second generation, YARN came out of nowhere.
It is easy to understand that YARN is like an extension nest on which many computing frameworks can be run on YARN. YARN is responsible for the unified management and scheduling of cluster resources.
Introduction to Hadoop (Overview):
1 distributed storage system HDFS (Hadoop Distributed FileSystem)
Distributed storage system
Provides data storage services with high reliability, high scalability and high throughput
two。 Resource management system YARN (Yet Another Resource Negotiator):
Responsible for unified management and scheduling of cluster resources
3. Distributed Computing Framework MapReduce
Distributed computing framework
It has the advantages of easy programming, high fault tolerance and high expansibility.
The simple hierarchical model is shown in the following figure:
HDFS
HDFS features:
Good scalability
High fault tolerance
Suitable for the storage of massive data above PB level
Basic principles:
Divide the file into large data blocks and store them on multiple machines
Make data segmentation, fault tolerance, balance and other functions transparent to users (how to implement them without developers' questions) HDFS can be regarded as a disk with large capacity and high fault tolerance.
The figure above shows the work division of HDFS: as you can see, HDFS consists of two main parts:
NAMENODE node (NN) and DATA NODE (DN) node
In big data, the HDFS cluster runs in Master-Slave mode, and there are mainly two types of nodes: one Namenode (i.e. Master) and multiple Datanode (i.e. Slave). The Namespace of the Namenode manager file system. It maintains the file system tree (filesystem tree) and the metadata (metadata) of all files and folders in the file tree.
As can be seen above, ZooKeeper acts as a dispatching manager here, while also monitoring the status of each node of the HDFS.
Because, the main purpose of this article is to talk about Haoop from a macro point of view.
YARN:
A new role added to Hadoop2.0
Responsible for resource management and scheduling of cluster
Enables multiple computing frameworks to run in one cluster
Characteristics of Yarn:
Good scalability and high availability
Unified management and scheduling of multiple types of applications
Comes with a variety of multi-user schedulers, which are suitable for shared cluster environment
The above picture more intuitively reflects that in this cluster, in different periods of time
Hadoop (mainly refers to MapReduce) Spark (memory-based computing framework, in-memory) MPI (also a computing framework) all suffer from insufficient performance, so that they can be intervened through YARN to complement each other. For example, if the performance of MapReduce cannot be fully brought into full play, less resources can be allocated to Spark to coordinate the resource scheduling of the cluster.
The above figure reflects the working mechanism of YARN, and we can still see the mode operation mechanism of Master-Slave.
This picture is more intuitive, showing that YARN is like inserting one by one, on which a variety of distributed computing frameworks can be extended.
Such as MapReduce for batch processing, interactive Tez, Storm for streaming, Giraph for graph processing, Spark for memory processing, etc.
MapReduce
Good scalability
High fault tolerance
Suitable for offline processing of massive data above PB level
In the figure above, you can see that the MapReduce work is divided into two parts, Map and Rudece:
First of all, the user writes the program, assigns the work through master, first Split the data submitted by the user into several blocks, each block into a MapWorker job, so that the data can be processed in parallel, this process can be regarded as distribution processing; the result of processing is first put into the local disk, and then the worker in the Reduce part is calculated, and the final output result is passed into HDFS. It is worth noting that the input file format is always consistent with the final processing format, including the result file format of intermediate processing.
This chart reflects the MapReduce program for counting the words that appear in the file, first dividing the words in the file into several parts, then processing statistics in the Map job, reshuffling the cards by shuff, and then handing over the job to Rudce. At this time, it can be seen that the data processed in the middle has changed and has been merged. At this time, as long as statistics are carried out on the basis of merger, the final results will be output.
Here are two general maps of the Hadoop ecosystem to compare the differences between 1.0 and 2.0:
Hadoop1.0
Hadoop2.0
A brief description:
Each component is shown in the function diagram, so I won't repeat it here.
Let's talk about Hive,Pig Mahout.
First of all, we have to deal with the data, generally by writing MapReduce programs to deal with the input data, the code is written by java. This makes it difficult for those workers who do not have the foundation of java but want to be dealt with by big data.
In this way, Hive and Pig come on stage, and it is understandable that they are two translators.
Hive (proposed by Facebook) provides an SQL language similar to traditional Sql statements to write the data code to be processed. Through Hive, the code written by Hql can be automatically transformed into a MapReduce program.
In the same way, Pig proposed a new PigLatin language to write programs that process data. He can also be converted to MapReduce programs.
Mahout provides a data mining library, which contains a large number of algorithms, including three major categories:
Recommend (Recommendation)
Clustering (Clustering)
Classification (Classification)
In this way, developers do not spend a lot of time in the construction calculation to improve work efficiency.
Oozie enables developers to schedule repetitive jobs in e-mail notifications or in various programming languages such as Java, UNIX Shell, Apache Hive, Apache Pig, and Apache Sqoop.
At present, there are a variety of computing frameworks and assignments:
MapReduce Storm Hive Pig et al.
In this way, the following problems arise:
1. There are dependencies between different jobs (DAG)
two。 Some job cycles are executed.
3. Some jobs need to be executed regularly.
4. The execution status of the job needs to be monitored and alerted (email, SMS)
At this time, Oozie is needed to carry out unified management and scheduling.
HBase:Hadoop 's database
High reliability
High performance
Column oriented
Good scalability
Composition:
Table tables: similar to tables in traditional data
ColumnFamily: column family
Table consists of one or more Column Family horizontally
A column family can be composed of any number of Column
RowKey: row key
Primary key of Table
Records in Table are sorted by RowKey
TimeStamp: timestamp
Each row of data corresponds to a timestamp
Version number
Zookeeper: it is a software that provides consistency services for distributed applications. The functions include: configuration maintenance, domain name service, distributed synchronization, group service and so on.
The goal of ZooKeeper is to encapsulate complex and error-prone key services and provide users with simple and easy-to-use interfaces and systems with efficient performance and stable functions.
Coverage: HDFS YARN Storm HBaseFlume Dubbo (Alibaba) Metaq (Alibaba)
Sqoop: (data synchronization tool)
The bridge between Hadoop and traditional database
Supports a variety of databases, including MySql, DB2, etc.
Pluggable, users can support new databases according to their needs
Essentially a MapReduce program:
Make full use of MapReduce fault tolerance of MapReduce distributed parallelism
Flume (log collection tool):
Distributed system
High reliability
High fault tolerance
Easy to customize and extend
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.