The implementation principle and basic usage of Hadoop 07/01 Update SLTechnology News&Howtos

The implementation principle and basic usage of Hadoop

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

There are a lot of materials about Hadoop installation and deployment on the Internet. This article will not introduce the installation and deployment of Hadoop. I will focus on the basic principles of Hadoop implementation, so that we can get started quickly when we learn about Hadoop ecology in the future.

What is Hadoop?

First of all, we need to know what Hadoop is. According to official interpretation, Hadoop is a distributed system infrastructure developed by the Apache Foundation that provides highly available, highly scalable, efficient, and low-cost services.

The services provided by Hadoop include HDFS, MapReduce and YARN;, in which HDFS is used for massive data storage, MapReduce is used for massive data analysis and calculation, and YARN is used for resource management and scheduling.

The high availability, high scalability and low cost features of Hadoop are realized through the above three services. Hadoop is suitable for dealing with large amounts of data, and Hadoop is not recommended if the amount of data is small.

The biosphere of Hadoop includes the following commonly used software, of which HDFS, YARN and MapReduce are the core services provided by Hadoop; others are applications built on Hadoop.

HDFS introduction

HDFS is Hadoop distributed File system (Hadoop Distribute File System), which is used to store huge amounts of data. HDFS maintains a set of virtual file system users to manage the saved data. HDFS can use ordinary PCs, which is also the reason why the cost of using Hadoop to deal with big data is relatively low. When the file is saved on Hadoop, Hadoop saves multiple copies of the file according to the configuration, so that when one machine goes down, the access to the file is not affected and high availability is achieved.

HDFS mainly provides two services: NameNode and DataNode, in which Name is used to receive client requests and save metadata; DateNode is the place where the data is really saved. When we start HDFS successfully, we can use the JPS command to see that the NameNode service and DataNode service are actually started, as shown in the following figure:

The metadata information in NameNode includes the virtual directory of the HDFS file, the number of file copies, the name of the file block, and the machine to which the file block belongs, such as metadata: / test/a.log, 2, {blk_1,blk_2}, [{blk_1: [10.2.200.1]}, {blk_2: [10.2.200.2]}] indicates that the a.log file exists under the / test directory in Hadoop. And the file is divided into two parts: blk_1 and blk_2. There is no copy. The blk_1 is on the 10.2.200.1 machine, and the blk_2 is 10.2.200.2. The metafile is saved on the machine where NameNode resides. The file name is fsimage_*, as shown in the following figure:

DataNode is the place where the file data is really saved. DataNode receives the request from the client and completes the reading or writing of the file. HDFS will split the large file and save it. By default, the block size is 128m.

The approximate flow of HDFS processing client requests is shown in the following figure:

The process of uploading a file to Hadoop is (write): 1. The client requests to upload the file to NameNode, NameNode maintains the virtual directory information, calculates how many parts need to be divided according to the file size, and allocates the DataNode node. 2. NameNode returns the block information and the assigned DataNode node information to the client. 3. The client communicates with the DataNode node according to the obtained DataNode node information, and uploads the file to the DataNode node. The actual location where the DataNode node saves the file is configured through dfs.namenode.name.dir.

The process of downloading files from Hadoop is (read): 1, the client requests NameNode to download the file, 2, NameNode returns metadata information according to the virtual directory provided, 3, the client downloads the file on the specific DateNode according to the returned metadata information.

Principle of HDFS implementation

The service of HDFS realizes the communication between the server and the client through RPC invocation. Specifically, both DataNode and NameNode open the socket service, and the client and the server implement the same interface. According to the methods defined by the interface, the client invokes the relevant methods through the dynamic proxy of Java. (if you know Dubbo, you can understand it in more detail, which is similar to Dubbo.)

HDFS's RPC framework is used as follows:

/ / start the service Builder builder = new RPC.Builder (new Configuration ()); / / set the address, port, protocol, and implementation instance of the service builder.setBindAddress ("hadoop"). SetPort (18080) .setProtocol (ILogin.class) .setInstance (new LoginImpl ()); / start the service Server server = builder.build (); server.start () / / client call method ILogin login = RPC.getProxy (ILogin.class,1L,new InetSocketAddress ("192.168.1.1", 18080), new Configuration (); String returnMsg = login.login ("tom")

HDFS related API and shell

HDFS provides Java-related API to operate HDFS through program calls. Through the API of HDFS, we can upload, download, delete, modify, compress and other operations on files on HDFS.

/ / download file FSDataInputStream inputStream = fs.open (new Path ("/ test.txt")); FileOutputStream fio = new FileOutputStream ("/ home/hadoop/hdfs.txt"); IOUtils.copy (inputStream, fio); / / upload file FSDataOutputStream os = fs.create (new Path ("/ test.txt")); FileInputStream is = new FileInputStream ("/ home/hadoop/hdfs.txt"); IOUtils.copy (is, os); / / Delete file fs.delete (new Path ("/ testHadoop"), true)

Shell that is frequently used in HDFS includes

Hadoop fs-ls / hadoop view HDFS directory list hadoop fs-mkdir / test create directory testhadoop fs-put. / test.txt / test or hadoop fs-copyFromLocal. / test.txt / test upload file to HDFShadoop fs-get / test/test.txt or hadoop fs-getToLocal / test/test.txt download file hadoop fs-cp / test/test.txt / test1 copy file hadoop fs-rm / test1/test.txt delete file hadoop fs-mv / test/test.txt / test1 modify file name

MapReduce introduction

In Hadoop, HDFS is used to store data, MapReduce is used to process data, and the operation of MapReduce program depends on YARN to allocate resources; MapReduce program is mainly composed of two stages: Map and Reduce, as long as the client implements the map () function and reduce () function respectively, distributed computing can be realized.

The processing process of the map task is to read the contents of each line in the file, process the row data according to the specific business logic, and generate key-value pairs output from the processing results.

The processing process of the reduce task is: before doing the reduce operation, there will be a shuffle operation, which mainly merges and sorts the processing results of the map, and then processes the key-value pairs output from the map. After getting the processing results, it is also saved to the file in the form of key-value pairs.

The task submission process of MapReduce is roughly shown in the following figure:

MapReduce Java API

MapReduce provides API related to Java. We can implement the MapReduce program based on the provided API, so that we can complete the processing of big data according to the actual business requirements. The example of API is as follows:

/ / map function handler @ Overrideprotected void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {/ / converts the contents of the line to the string String strValue = value.toString (); String [] words = StringUtils.split (strValue,''); / / traverses the array and outputs the data (in key-value form) for (String word: words) {context.write (new Text (word), new LongWritable (1)) }} / / reduce function handling @ Overrideprotected void reduce (Text key, Iterable values, Context context) throws IOException, InterruptedException {int count = 0; for (LongWritable value: values) {count + = value.get ();} context.write (key, new LongWritable (count));} / / calls mapper and reducerjob.setMapperClass (WordCountMapper.class) used by Map and Reduce// job; job.setReducerClass (WordCountReduce.class) / / set the input and output type of reducer job.setOutputKeyClass (Text.class); job.setOutputValueClass (LongWritable.class); / / set the input and output type of mapper job.setMapOutputKeyClass (Text.class); job.setMapOutputValueClass (LongWritable.class); / / set the path where map processes data FileInputFormat.setInputPaths (job, new Path ("/ upload/")) / / set the output path of reducer processing data (Note: the wordcount folder cannot be created, the program will create it automatically, and will report an error if you create it yourself) FileOutputFormat.setOutputPath (job, new Path ("/ wordcount/")); job.waitForCompletion (true)

Introduction to Hive

Through the above introduction, we know that the processing and operation of big data need to be realized by writing MapReduce programs, which puts forward coding requirements for big data processors. In order to simplify the implementation of big data processing, Hive appeared, and the essence of Hive is to achieve the encapsulation of MapReduce. Hive parses the SQL statement, takes the relevant contents of the SQL sentence as the input parameters of the MapReduce program, and calls the pre-realized MapReduce task. In order to complete the processing task of big data. The data processing is realized by writing SQL, which avoids the developers' personal coding to realize MapReduce, which greatly reduces the difficulty of data processing and improves the development efficiency.

To build a Hive database is to create a folder on the HDFS, and to create a table is to create a subfolder under the folder corresponding to the library. The table name is the folder name, and the partition table is to create a subfolder under the table. The bucket table is to divide a large file into several small files according to certain rules. The process of implementing Hive SQL is the process of dealing with the files under the folder corresponding to the relevant tables.

Additional instructions, Hive processing data is actually the process of performing MapReduce, and the execution of MapReduce will have a large number of read and write disk operations, so the process of dealing with big data by Hive will be relatively slow; the emergence of Spark solves this problem, the principle of Spark dealing with data is generally similar to that of MapReduce, but Spark handles data less interactively with disk, and most of them calculate data in memory, so the efficiency of Spark in dealing with data is much higher than that of MapReduce.

Introduction to HBase

HBase can be understood as a NoSql database built on HDFS. Hbase stores a large amount of data by means of key and value. The data file of Hbase is saved on HDFS. The Tabel of Hbase is divided into multiple HRegion in the direction of rows. Each Region is represented by [startKey,EndKey], and Region is distributed in RegionServer. The diagram is as follows:

For the complete source code information of API in this article, please follow the official account of Wechat to reply to hadoop.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.