Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the core architecture of Hadoop

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly talks about "what is the core architecture of Hadoop". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the core architecture of Hadoop?"

Through the introduction of the core distributed file system HDFS and MapReduce processing of Hadoop distributed computing platform, as well as the data warehouse tool Hive and distributed database Hbase, it basically covers all the technical cores of Hadoop distributed computing platform.

Through the investigation and summary of this stage, this paper makes a detailed analysis from the perspective of internal mechanism, how HDFS, MapReduce, Hbase and Hive run, as well as the construction of data warehouse based on Hadoop and the internal implementation of distributed database. If there is any deficiency, modify it in time.

Architecture of HDFS

The whole Hadoop architecture is mainly through HDFS to achieve the underlying support for distributed storage, and through MR to achieve the program support for distributed parallel task processing.

HDFS adopts the master-slave (Master/Slave) structure model, and a HDFS cluster is composed of one NameNode and several DataNode (multiple NameNode configurations have been implemented in the Hadoop2.2 version of * *)-this is also a function implemented by some large companies by modifying the hadoop source code, which has been implemented in the * * version). NameNode acts as the primary server, managing the file system namespace and client access to files. DataNode manages the stored data. HDFS supports data in file form.

Internally, the file is divided into several blocks, which are stored on a set of DataNode. NameNode performs the namespace of the file system, such as opening, closing, renaming files or directories, etc., and is also responsible for mapping data blocks to specific DataNode. DataNode is responsible for handling the file reading and writing of the file system client, and the creation, deletion and replication of the database under the unified scheduling of NameNode. NameNode is the manager of all HDFS metadata, and user data never passes through NameNode.

As shown in figure: HDFS architecture diagram

Three roles are involved in the figure: NameNode, DataNode, and Client. NameNode is the manager, DataNode is the file store, and Client is the application that needs to acquire the distributed file system.

File write:

1) Client initiates a request for file writing to NameNode.

2) NameNode returns information about the DataNode it manages to Client based on file size and file block configuration.

3) Client divides the file into multiple block, and writes the block sequentially to the DataNode block according to the address of the DataNode.

File read:

1) Client initiates a request to read the file to NameNode.

2) NameNode returns the DataNode information stored in the file.

3) Client reads file information.

As a distributed file system, HDFS can be used for reference in data management:

File block placement: a Block will have three backups, one on the DateNode specified by the NameNode, one on the DataNode on the same machine as the specified DataNode, and one on the DataNode on the same Rack for the specified DataNode. The purpose of backup is for data security, which is used to take into account the failure of the same Rack and the performance problems caused by different copies of data.

MapReduce architecture

The MR framework consists of a single JobTracker running on the master node and a TaskTracker running on each cluster slave node. The master node is responsible for scheduling all the tasks that make up a job, which are distributed on different slave nodes. The master node monitors their execution and reperforms previously failed tasks. The slave node is only responsible for the tasks assigned by the master node. When a Job is submitted, after the JobTracker receives the submission job and configuration information, it will distribute the configuration information to the slave node, schedule the task and monitor the execution of the TaskTracker. JobTracker can run on any computer in the cluster. TaskTracker is responsible for performing the task, it must run on DataNode, and DataNode is both a data storage node and a compute node. JobTracker distributes map and reduce tasks to idle TaskTracker, which run in parallel and monitor how the task runs. If JobTracker fails, JobTracker transfers the task to another idle TaskTracker to run again.

HDFS and MR together constitute the core of Hadoop distributed system architecture. HDFS implements distributed file system on cluster, while MR implements distributed computing and task processing on cluster. HDFS provides file operation and storage support in the process of MR task processing. MR implements task distribution, tracking, execution and other tasks on the basis of HDFS, and collects results. The two interact to complete the main tasks of the distributed cluster.

Parallel application development on Hadoop is based on the MR programming framework. MR programming model principle: use an input set of key-value pairs to generate an output set of key-value pairs. The MR library implements this framework through two functions, Map and Reduce. The user-defined map function accepts an input key-value pair and then produces a collection of intermediate key-value pairs. MR combines all value with the same key value and passes a reduce function. The Reduce function accepts a combination of key and the associated value, and the reduce function merges these values to form a smaller set of value. Usually we provide the middle value value to the reduce function through an iterator (the purpose of the iterator is to collect these values), so that we can deal with a large collection of values that cannot all be put in memory.

In short, the big data set is divided into many small data set blocks, and several data sets are processed on a node in the cluster and produce intermediate results. For the task on a single node, the map function reads the data row by line to get the data (K1Magne v1), the data enters the cache, and the map (based on key-value) sorting is performed through the map function (the framework will sort the output of map) and input after execution (K2MagneV2). Every machine performs the same operation. The process of sorting by merge on different machines (the process of shuffle can be understood as a process before reduce), merged by * reduce, (k3Powerv3), and output to the HDFS file.

When it comes to reduce, before reduce, you can Combine the intermediate data, that is, pairs with the same key in the middle. The process of Combine is similar to that of reduce, but Combine is part of the map task and only continues after the map function is executed. Combine can reduce the number of intermediate result key-value pairs, thereby reducing network traffic.

The intermediate result of the Map task is saved as a file on the local disk after Combine and Partition. The location of the intermediate result file informs the master JobTracker,JobTracker and then tells the reduce task which DataNode to go to to get the intermediate result. The intermediate results produced by all map tasks are divided into R shares according to their key values according to the hash function, and R reduce tasks are each responsible for a key interval. Each reduce needs to take the intermediate result from many map task nodes that fall within its responsible key interval, and then execute the reduce function to form a final result. If there are R reduce tasks, there will be R final results. In many cases, the R final results do not need to be merged into one final result, because the R final result can be used as the input of another computing task to start another parallel computing task. This results in multiple pieces of output data (HDFS copies) in the above figure.

Hbase data management

Hbase is Hadoop database. What is the difference between it and traditional mysql and oracle? That is, what is the difference between column data and row data. What is the difference between NoSql databases and traditional relational data:

Hbase VS Oracle

1. Hbase is suitable for a large number of inserts and reads at the same time. Enter a Key to get a value or enter some key to get some value.

2. The bottleneck of Hbase is the transmission speed of hard disk. Hbase operation, it can insert into the data, can also update some data, but update is actually insert, just insert a new timestamp line. Delete data, also known as insert, is just an insert line with a delete tag. All operations of Hbase are append insert operations. Hbase is a log set database. It is stored in the same way as log files. It is written to the hard disk in bulk, usually in the form of files. The speed of reading and writing depends on how fast the transfer between the hard disk and the machine is. The bottleneck of Oracle is the hard disk seek time. It often reads and writes randomly during operation. To update a piece of data, first find the block on your hard disk, then read it into memory, modify it in the cache in memory, and write it back over time. Since the block you are looking for is different, there is a random read. The seek time of the hard disk is mainly determined by the rotational speed. On the other hand, the seeking time and technology have not changed basically, which forms the bottleneck of seeking time.

3. Data in Hbase can save many different timestamped versions (that is, many different versions of the same data can be copied, allowing data redundancy, which is also an advantage). The data is sorted by time, so Hbase is particularly suitable for looking for scenarios that look for Top n in chronological order. Find out someone's recent browsing news, recent N blogs, N behaviors, etc., so there are a lot of Hbase applications on the Internet.

4. The limitation of Hbase. You can only do very simple Key-value queries. It is suitable for operation scenarios with high-speed insertion and a lot of reads at the same time. And this scenario is very extreme, not every company has this kind of demand. In some companies, ordinary OLTP (online transaction processing) reads and writes randomly. In this case, the reliability of Oracle and the responsibility of the system are lower than those of Hbase. And the limitation of Hbase is that it only has primary key indexes, so there are problems with modeling. For example, in a table, I want to query a lot of columns for certain conditions. But you can only build a quick query on the primary key. Therefore, it can not be generally said that that kind of technology has advantages.

5. Oracle is a row database, while Hbase is a column database. The advantage of column databases lies in the scenario of data analysis. The difference between data analysis and traditional OLTP. In data analysis, a column is often used as a query condition, and the returned results are often some columns, not all columns. In this case, the performance of the row database response is very inefficient.

Row database: Oracle as an example, the basic components of data files: blocks / pages. The data in the block is written row by row. There is a problem, when we want to read some columns in a block, we can not only read these columns, we must read the whole block into memory, and then read out the contents of these columns. In other words: in order to read some of the columns in the table, you must read all the rows of the entire table before you can read them. This is the worst part of the row database.

Column database: columns are stored as elements. Elements of the same column are squeezed into one block. When you want to read certain columns, you only need to read the relevant column blocks into memory, so that the amount of IO read will be much less. In general, the data elements of the same column are usually in a similar format. This means that when the data format is similar, the data can be greatly compressed. Therefore, column database has great advantages in data compression, compression not only saves storage space, but also saves IO. (at this point, you can take advantage of the optimization between data queries to improve performance, depending on the scenario, when the data reaches millions of dollars.)

Hive data management

Hive is a data warehouse infrastructure built on Hadoop. It provides a series of tools for data extraction, transformation and loading, which is a large-scale data mechanism that can store, query and analyze large-scale data stored in Hadoop. You can map the structured data file under Hadoop to a table in Hive, and provide sql-like query function, except that it does not support update, index and transaction, sql supports all other functions. You can convert sql statements to MapReduce tasks to run as a sql to MapReduce mapper. Provide shell, JDBC/ODBC, Thrift, Web and other interfaces. Advantages: low cost, you can quickly achieve simple MapReduce statistics through sql-like statements. As a data warehouse, the data management of Hive can be introduced from three aspects: metadata storage, data storage and data exchange according to the level of use.

(1) metadata storage

Hive stores metadata in RDBMS, and there are three ways to connect to the database:

Embedded mode: metadata is kept in the Derby of the embedded database, which is generally used for unit testing, allowing only one session connection.

Multi-user mode: install Mysql locally and put metadata into Mysql

Remote mode: metadata is placed in a remote Mysql database

(2) data storage

First of all, Hive does not have a special data storage format, nor does it index the data, which allows you to organize tables in Hive very freely. You only need to tell Hive the column delimiters and row delimiters in the data when you create the table, which can parse the data.

Second, all the data in Hive is stored in HDFS, and Hive contains four data models: Tabel, ExternalTable, Partition, and Bucket.

Table: similar to Table in a traditional database, each Table has a corresponding directory in Hive to store data. For example: a table zz, whose path in HDFS is: / wh/zz, where wh is the directory of the data warehouse specified by $in hive-site.xml, where all Table data (excluding External Table) is stored in this directory.

Partition: similar to an index that divides columns in a traditional database. In Hive, a Partition in a table corresponds to a directory under the table, and all Partition data is stored in the corresponding directory. For example, if the zz table contains two Partition of ds and city, the HDFS subdirectory corresponding to ds=20140214,city=beijing is: / wh/zz/ds=20140214/city=Beijing

Buckets: the hash calculated on the specified column is split according to the hash value, in order to facilitate parallelism, each Buckets corresponds to a file. Score the user column on 32 Bucket, and first calculate the hash for the value of the user column. For example, the HDFS directory of the corresponding hash=0 is: / wh/zz/ds=20140214/city=Beijing/part-00000; corresponds to hash=20, and the directory is: / wh/zz/ds=20140214/city=Beijing/part-00020.

ExternalTable points to data that already exists in HDFS, and you can create a Partition. The structure of metadata is the same as that of Table, but there is a big difference in actual storage. The process of Table creation and data loading can be implemented with a unified statement, and the actual data is transferred to the data warehouse directory, and then the data access will be completed directly in the data warehouse directory. When you delete a table, both the data and metadata in the table are deleted. There is only one process for ExternalTable, because loading data and creating tables are done at the same time. The world data is stored in the HDFS path specified after the Location and is not moved to the data warehouse.

(3) data exchange

User interface: including client, Web interface and database interface

Metadata storage: usually stored in a relational database, such as Mysql,Derby, etc.

Hadoop: use HDFS for storage and MapReduce for calculation.

Key point: Hive stores metadata in databases such as Mysql and Derby. The metadata in Hive includes the name of the table, the columns and partitions of the table and its properties, the properties of the table (whether it is an external table), the directory where the table data is located, and so on.

The data of Hive is stored in HDFS, and most of the queries are done by MapReduce.

At this point, I believe you have a deeper understanding of "what is the core architecture of Hadoop". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report