A rookie's quick start to Hadoop 07/01 Update SLTechnology News&Howtos

A rookie's quick start to Hadoop

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. Related concepts 1. Big data

Big data is not only a concept, but also a technology, which is a technology for all kinds of data analysis on the framework of big data platform represented by Hadoop.

Big data includes the basic big data framework represented by Hadoop and Spark, as well as real-time data processing, offline data processing, data analysis, data mining and prediction analysis with machine algorithms.

2 、 Hadoop

Hadoop is an open source big data framework and a distributed computing solution.

The two cores of Hadoop solve the data storage problem (HDFS distributed file system) and distributed computing problem (MapRe-duce).

Example 1: users want to get the data of a certain path, and the data is stored on many machines. As users do not have to consider which machine they are on, HD-FS automatically handles it.

Example 2: if a 100p file, you want to filter out the lines containing the Hadoop string. In this scenario, HDFS distributed storage breaks through the limit of the hard disk size of the server and solves the problem that a single machine cannot store large files. At the same time, MapReduce distributed computing can first slice and calculate jobs with large amounts of data, and finally summarize and output them.

II. Characteristics of Hadoop

Advantages

1. Large files are supported. Files stored in HDFS can support data at the TB and PB levels.

2. Detect and quickly deal with hardware failures. Data backup mechanism, NameNode detects whether DataNode still exists through the heartbeat mechanism.

3. High expansibility. It can be built on cheap computers to achieve linear (horizontal) expansion, and when new nodes are added to the cluster, NameNode can also sense, distribute and back up data to the corresponding nodes.

4. Mature biosphere. With the help of open source, there are some gadgets derived around Hadoop.

Shortcoming

1. It is impossible to achieve low latency. High data throughput is optimized at the expense of data acquisition latency.

2. It is not suitable for a large number of small files.

3. The efficiency of file modification is low. HDFS is suitable for scenarios where you write once and read multiple times.

Introduction of HDFS 1. Analysis of HDFS framework

HDFS is the master-slave structure of Master and Slave. It is mainly composed of Name-Node, Secondary NameNode and DataNode.

NameNode

Manage HDFS namespaces and block mappings where metadata and file-to-block mappings are stored.

If the NameNode is down, the file cannot be reorganized. What should I do? What fault-tolerant mechanisms are there?

Hadoop can be configured as a HA, that is, a highly available cluster. There are two NameNode nodes in the cluster, one active master node and another stan-dby backup node. The data of the two nodes are consistent at all times. When the primary node is unavailable, the standby node automatically switches and the user is not aware of it, which avoids the single point problem of NameNode.

Secondary NameNode

Assist NameNode, share NameNode work, and restore NameNode in case of emergency.

DataNode

The Slave node actually stores data, reads and writes data blocks, and reports storage information to NameNode.

2. Read and write HDFS files

Files are stored on the DataNode as blocks, which are abstract blocks that act as storage and transfer units, not the whole file.

Why should files be stored in blocks?

First of all, it shields the concept of files and simplifies the design of the storage system, for example, 100T files are larger than disk storage, and files need to be divided into multiple data blocks and then stored to multiple disks; in order to ensure the security of data, data blocks are very suitable for data backup, thus improving the fault tolerance and availability of data.

How do you consider the block size setting?

If the file data block size is too small, the general file will be divided into multiple data blocks, then you have to access multiple data block addresses during access, which is inefficient and will consume more memory for NameNode; if the data block is set too large, the support for parallelism is not very good. At the same time, if the system restart needs to load data, the larger the data block, the longer the system recovery will be.

3.2.1 HDFS file reading process

1. Query the NameNode communication for metadata (the DataNode node where the block is located) and find the DataNode server where the file block is located.

2. Select a DataNode (nearest principle, then random) server and request to establish a socket stream.

3. DataNode begins to send data (read data from disk and put it into stream, and do verification in packet units).

4. The client has received it as a unit of packet, and now it is cached locally, and then the target file is written, and the subsequent block block is equivalent to append to the previous block block and finally synthesize the final needed file.

3.2.2 HDFS file writing process

1. Request to upload a file to NameNode, and NameNode checks whether the target file already exists and the parent directory exists.

2. NameNode returns confirmation that it can be uploaded.

3. Client will first split the file, such as a block block 128m, and if the file is 300m, it will be divided into three blocks, one 128m, one 128m, and one 44m. Request which DataNode servers the first block should be transferred to.

4. NameNode returns the server of DataNode.

5. Client requests a DataNode to upload data. When the first DataNode receives the request, it will continue to call the second DataNode, and then the second will call the third DataNode to complete the establishment of the entire channel and return it to the client step by step.

6. Client starts uploading the first block to A. of course, the DataNode will verify the data when it is written. When the first DataNode is received, it will be transmitted to the second and the second to the third.

7. When a block transfer is complete, client again requests NameNode to upload the server of the second block.

4. Introduction to MapReduce 1. Concept

MapReduce is a programming model, a programming method, and an abstract theory, which adopts the idea of divide and conquer. The core steps of the MapReduce framework are mainly divided into two parts, namely Map and Reduce. Each file fragment is processed by a separate machine, which is the method of Map, which summarizes the results of each machine and gets the final result, which is the method of Reduce.

2. Workflow

When submitting a computing job to the MapReduce framework, it will first split the computing job into several Map tasks, and then assign them to different nodes to execute. Each Map task processes part of the input data. When the Map task is completed, it will generate some intermediate files, which will be used as input data for the Reduce task. The main goal of the Reduce task is to aggregate and output the output of the previous several Map.

3. Run the MapReduce example

Run Word-count, the classic example of MapReduce that comes with Hadoop, to count the words that appear in the text and the number of times they appear. First submit the task to the Hadoop framework.

View the output file directory and result contents after the end of the MapReduce run.

You can see the result of counting the number of times words appear.

5. Hadoop installation

Wall crack recommendation: the most detailed Hadoop environment in history (https://blog.csdn.net/hliq5399/article/details/78193113))

1. Hadoop deployment model

Local mode

Pseudo-distributed mode

Fully distributed mode

The above deployment patterns are distinguished according to the fact that NameNode, Data-Node, ResourceManager, NodeManager and other modules run on several JVM processes and machines.

2. Installation steps (take pseudo-distributed mode as an example)

Learning Hadoop is generally carried out in pseudo-distributed mode. This mode runs each module of Hadoop on each process on one machine. Pseudo-distributed means that although each module runs separately on each process, it only runs on one operating system, not really distributed.

5.2.1 JDK package download, unzipped installation and JAVA environment variable configuration

ExportJAVA_HOME=/home/admin/apps/jdk1.8.0_151

ExportCLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

ExportPATH=$JAVA_HOME/bin:$PATH

5.2.2 Hadoop package download, unzipped installation and Hadoop environment variable configuration

ExportHADOOP_HOME= "/ zmq/modules/hadoop/hadoop-3.1.0"

ExportPATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

5.2.3 configure the JAVA_HOME parameters of Hadoop-env.sh, mapred-env.sh, and yarn-env.sh files

ExportJAVA_HOME= "/ home/admin/apps/jdk1.8.0_151"

5.2.4 configure core-site.xml, configure the address of HDFS and Hadoop temporary directory

5.2.5 configure hdfs-site.xml and set the number of backups for HDFS storage. If this is a pseudo-distributed deployment, enter 1

5.2.6 format HDFS, start NameNode, Data-Node, SecondaryNameNode, and view the process

5.2.7 build complete, operate HDFS (commonly used new directories, upload and download files, etc.), and run MapReduceJob

VI. More Hadoop

The above introduction is only the preliminary study and use of Hadoop, Ha-doop 's HA fully distributed deployment, Hadoop's resource scheduling YARN, Hadoop's high availability and fault-tolerant mechanism, other components of the Hadoop ecosystem and so on have not been studied, sighing that Hadoop is in deep water, .

Brief introduction of the author: Mengqin, two years + testing experience, is currently mainly responsible for internal platform product testing and some external delivery project testing.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.