How to understand the principle of Hadoop architecture 07/16 Update SLTechnology News&Howtos

How to understand the principle of Hadoop architecture

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article focuses on "how to understand the principles of Hadoop architecture", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to understand the principles of Hadoop architecture.

I. concept

Hadoop, founded in 2006, is an open source software framework that supports data-intensive distributed applications and is released under the Apache 2.0 license. It supports applications that run on large clusters built by commodity hardware. Hadoop is implemented by itself according to the papers of MapReduce and Google file system published by Google Company.

Hadoop, like Google, is named by a child. It is a fictional name with no special meaning. From a computer professional point of view, Hadoop is a distributed system infrastructure developed by the Apache Foundation. The main goal of Hadoop is to deal with "big data" in a distributed environment in a reliable, efficient and scalable way.

The Hadoop framework transparently provides reliability and data movement for applications. It implements a programming paradigm called MapReduce: the application is divided into many small parts, each of which can be executed or reexecuted on any node in the cluster.

Hadoop also provides a distributed file system to store data from all computing nodes, which brings very high bandwidth to the entire cluster. The design of MapReduce and distributed file system enables the whole framework to deal with node failures automatically. It enables applications to work with thousands of independent computing computers and PB-level data.

II. Composition

Core components of 1.Hadoop

Analysis: the core components of Hadoop are: HDFS (distributed file system), MapRuduce (distributed computing programming framework), and YARN (computing resource scheduling system).

File system of 2.HDFS

HDFS

1. Define

The whole Hadoop architecture is mainly through HDFS (Hadoop distributed File system) to achieve the underlying support for distributed storage, and through MR to achieve the program support for distributed parallel task processing.

HDFS is the foundation of data storage management in Hadoop system. It is a highly fault-tolerant system that can detect and respond to hardware failures and is used to run on low-cost general-purpose hardware. HDFS simplifies the file consistency model, provides high-throughput application data access through streaming data access, and is suitable for applications with large datasets.

two。 Composition

HDFS adopts the master-slave (Master/Slave) structure model, and a HDFS cluster is composed of a NameNode and several DataNode. NameNode acts as the primary server, managing the file system namespace and client access to files. DataNode manages the stored data. HDFS supports data in file form.

Internally, the file is divided into several blocks, which are stored on a set of DataNode. NameNode performs the namespace of the file system, such as opening, closing, renaming files or directories, etc., and is also responsible for mapping data blocks to specific DataNode. DataNode is responsible for handling the file reading and writing of the file system client, and the creation, deletion and replication of the database under the unified scheduling of NameNode. NameNode is the manager of all HDFS metadata, and user data never passes through NameNode.

Analysis: NameNode is the manager, DataNode is the file store, and Client is the application that needs to obtain the distributed file system.

MapReduce

1. Define

Hadoop MapReduce is a google MapReduce clone.

MapReduce is a computing model, which is used to calculate a large amount of data. Map performs specified operations on independent elements on the dataset to generate intermediate results in the form of key-value pairs. Reduce specifies all values of the same "key" in the intermediate result to get the final result. The function partition such as MapReduce is very suitable for data processing in a distributed parallel environment composed of a large number of computers.

two。 Composition

Analysis:

(1) JobTracker

JobTracker is called a job tracker, and it is a very important process running on the master node (Namenode). It is the scheduler of the MapReduce system. The daemon used to process the job (the code submitted by the user) determines which files are involved in the processing of the job, then cuts the job into small task and assigns them to the child nodes where the desired data is located.

The principle of Hadoop is to run nearby, where the data and the program are in the same physical node, where the data is, where the program runs. This work is done by JobTracker, monitoring task and restarting failed task (on different nodes). Each cluster has only one JobTracker, similar to a single point of NameNode, located on the Master node.

(2) TaskTracker

TaskTracker is called a task tracker, the last background process of the MapReduce system, which is located on each slave node and combines with datanode (the principle of code and data) to manage the task on each node (assigned by jobtracker).

There is only one tasktracker per node, but a tasktracker can launch multiple JVM, run Map Task and Reduce Task;, interact with JobTracker, and report task status

Map Task: parses each data record, passes it to the user-written map (), and executes, writing the output to the local disk (or, in the case of a map-only job, directly to HDFS).

Reducer Task: from the execution result of Map Task, the input data is read remotely, the data is sorted, and the data is executed according to the reduce function written by the user.

Hive

1. Define

Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, provide complete sql query functions, and transform sql statements into MapReduce tasks to run.

Hive is a data warehouse infrastructure built on Hadoop. It provides a series of tools that can be used for data extraction, transformation loading (ETL), a mechanism that can store, query, and analyze large-scale data stored in Hadoop.

Hive defines a simple SQL-like query language called HQL, which allows users who are familiar with SQL to query data. At the same time, the language also allows familiar with MapReduce developers to develop custom mapper and reducer to handle complex analytical work that cannot be done by built-in mapper and reducer.

two。 Composition

Analysis: Hive architecture includes: CLI (Command Line Interface), JDBC/ODBC, Thrift Server, WEB GUI, Metastore and Driver (Complier, Optimizer and Executor). These components are divided into two categories: server components and client components.

3. Client and server components

(1) client components:

CLI:Command Line Interface, command line interface.

Thrift client: the Thrift client is not written in the architecture diagram above, but many client interfaces of the Hive architecture are built on top of the Thrift client, including JDBC and ODBC interfaces.

The WEBGUI:Hive client provides a way to access the services provided by Hive through web pages. This interface corresponds to the HWI component (Hive Web Interface) of Hive, which starts the HWI service before using it.

(2) Server components:

Driver component: this component includes Complier, Optimizer and Executor. Its function is to parse, compile and optimize HiveQL (SQL-like) statements, generate an execution plan, and then call the underlying MapReduce computing framework.

Metastore component: metadata service component, this component stores Hive metadata, Hive metadata is stored in relational databases, Hive supports relational databases such as Derby and Mysql. Metadata is very important for Hive, so Hive supports separating Metastore services and installing them into remote server clusters, thus decoupling Hive services and Metastore services and ensuring the robustness of Hive.

Thrift service: Thrift is a software framework developed by Facebook, which is used for the development of extensible and cross-language services. Hive integrates this service and allows different programming languages to call Hive interfaces.

Similarities and differences between 4.Hive and traditional Database

(1) query language

Because SQL is widely used in data warehouse, a query language like SQL, HQL, is designed according to the characteristics of Hive. Developers who are familiar with SQL development can easily use Hive for development.

(2) data storage location

Hive is built on top of Hadoop, and all Hive data is stored in HDFS. The database, on the other hand, can save the data in a block device or a local file system.

(3) data format

There is no specific data format defined in Hive, which can be specified by users. User-defined data formats need to specify three attributes: column delimiters (usually spaces, "\ t", "\ x001"), row delimiters ("\ n"), and methods to read file data (there are three file formats TextFile,SequenceFile and RCFile by default in Hive).

(4) data update

Because Hive is designed for data warehouse applications, and the content of data warehouse is read more and write less. Therefore, it is not supported in Hive

For the rewriting and addition of data, all data is determined at the time of loading. The data in the database usually needs to be modified frequently, so you can use INSERT INTO. VALUES to add data, using UPDATE... SET modifies the data.

(5) Index

Hive does not do any processing or even scan the data during the loading process, so it does not index some Key in the data. When Hive wants to access a specific value in the data that meets the conditions, it needs to violently scan the entire data, so the access delay is high. Due to the introduction of MapReduce, Hive can access data in parallel, so even without an index, Hive can still show its advantages for accessing a large amount of data. In the database, the index is usually built on one or more columns, so the database can have high efficiency and low latency for the access of a small number of data with specific conditions. Due to the high latency of data access, Hive is not suitable for online data query.

(6) execution

Most queries in Hive are executed through MapReduce provided by Hadoop (queries like select * from tbl do not require MapReduce). Databases usually have their own execution engine.

(7) delay in execution

When querying data, Hive needs to scan the entire table because there is no index, so the delay is high. Another factor contributing to the high latency of Hive execution is the MapReduce framework. Because MapReduce itself has a high latency, there will also be a high latency when executing Hive queries using MapReduce. In contrast, the execution latency of the database is low. Of course, this low is conditional, that is, the data size is small, when the data scale is too large to exceed the processing capacity of the database, the parallel computing of Hive can obviously show its advantages.

(8) scalability

Because Hive is built on top of Hadoop, the scalability of Hive is consistent with that of Hadoop (the largest Hadoop cluster in the world is in Yahooqi, with a size of about 4000 nodes in 2009). However, due to the strict limitation of ACID semantics, the expansion of rows in database is very limited. At present, the most advanced parallel database, Oracle, has only about 100 sets of scalability in theory.

(9) data scale

Because Hive is built on a cluster and can use MapReduce for parallel computing, it can support large-scale data; correspondingly, the database can support a smaller scale of data.

Hbase

1. Define

HBase-Hadoop Database is a highly reliable, high-performance, column-oriented and scalable distributed storage system. Large-scale structured storage clusters can be built on cheap PC Server by using HBase technology.

HBase is an open source implementation of Google Bigtable, similar to Google Bigtable that uses GFS as its file storage system, and HBase uses Hadoop HDFS as its file storage system

Google runs MapReduce to deal with massive data in Bigtable, while HBase also uses Hadoop MapReduce to deal with massive data in HBase.

Google Bigtable uses Chubby as a collaborative service, and HBase uses Zookeeper as a collaborative service.

two。 Composition

Analysis: as can be seen from the above figure, Hbase is mainly composed of Client, Zookeeper, HMaster and HRegionServer, and Hstore is used as the storage system.

Client

HBase Client uses HBase's RPC mechanism to communicate with HMaster and HRegionServer. For management operations, Client and HMaster perform RPC; for data reading and writing operations, and Client and HRegionServer for RPC.

Zookeeper

In addition to storing the address of the-ROOT- table and the address of the HMaster in the Zookeeper Quorum, HRegionServer will also register itself in the Zookeeper as Ephemeral, so that the HMaster can sense the health status of each HRegionServer at any time.

HMaster

There is no single point of problem with HMaster. Multiple HMaster can be started in HBase. The Master Election mechanism of Zookeeper ensures that there is always a Master running. HMaster is mainly responsible for the management of Table and Region functionally:

Manage users' operations of adding, deleting, modifying and querying Table

Manage HRegionServer load balance and adjust Region distribution

After Region Split, responsible for the allocation of new Region

Responsible for Regions migration on failed HRegionServer after HRegionServer downtime

HStore storage is the core of HBase storage, which consists of two parts, one is MemStore and the other is StoreFiles.

MemStore is Sorted Memory Buffer, and the data written by the user will first be put into MemStore. When the MemStore is full, it will be Flush into a StoreFile (the underlying implementation is HFile). When the number of StoreFile files reaches a certain threshold, Compact merge operation will be triggered, multiple StoreFiles will be merged into one StoreFile, and version merging and data deletion will be carried out during the merge process.

Therefore, we can see that HBase can only add data, and all updates and deletions are carried out in the subsequent compact process, which makes the user's write operations can be immediately returned as soon as they enter the memory, ensuring the high performance of HBase HBase O.

After StoreFiles Compact, a larger and larger StoreFile will be gradually formed. When the size of a single StoreFile exceeds a certain threshold, the Split operation will be triggered, and the current Region Split will be changed into two Region, and the parent Region will be offline, and the Region of the two children from the new Split will be assigned to the corresponding HRegionServer by HMaster, so that the pressure of the original one Region can be diverted to the two Region.

Third, the application example of Hadoop

1. Review the overall architecture of Hadoop

Application of 2.Hadoop-- Traffic query system

(1) the overall framework of traffic query system

(2) the overall flow of traffic query system

(3) data preprocessing function framework of traffic query system

(4) data preprocessing process of traffic query system

(5) functional framework of traffic query NoSQL database

(6) functional framework of traffic query service

(7) data processing flow chart of real-time streaming computing

At this point, I believe you have a deeper understanding of "how to understand the principle of Hadoop architecture". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.