Basic concepts of Hadoop 07/04 Update SLTechnology News&Howtos

Basic concepts of Hadoop

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Hadoop is a distributed storage and computing platform for massive data.

For data, we can roughly divide it into three categories:

Structured data (structured data can be processed through RDBMS, and fast query speed can be achieved by creating search codes as index pages)

Semi-structured data (you can generally use type XML for tagging)

Unstructured data

In fact, unstructured data will occupy a large proportion, and the storage and calculation of unstructured data will become more difficult.

Hadoop is inspired by two papers by Google, including MapReduce. We can think of Hadoop as an open source implementation of MapReduce, and Hadoop is written in the java language.

First of all, for a large amount of data, we need to store it and then analyze it.

HDFS:Hadoop distributed file system

Framework for parallel data processing based on MapReduce:Hadoop

It can be understood as Hadoop=HDFS+MapReduce, that is, a Hadoop cluster is a HDFS cluster plus a MapReduce cluster.

How does HDFS accomplish distributed storage:

There is usually one master node in a HDFS cluster (multiple master nodes have been implemented in the new version of Hadoop), which is called NameNode (NN for short).

There will be n slave nodes in the HDFS, called DataNode (DN)

The actual data storage is completed by the data node, and NameNode is mainly used to divide the data into blocks, and then assigned to the data node for storage. In addition, NameNode is used to receive user requests, manage slave nodes, maintain the directory structure of the file system, manage the relationship between files and Block, and the relationship between block and data nodes, so as to achieve the purpose of distributed storage of massive data.

Features of HDFS: HDFS is designed to store large files, but not suitable for large and small files

HDFS is a user-space file system (in fact, the data is eventually stored on a file system such as ext3, but the data needs to be abstracted again by HDFS)

HDFS does not support modifying data (the new version supports appending)

Do not support mounting and access through system calls, you can only use dedicated access excuses, such as dedicated command line tools, API

MapReduce, which generally refers to MapReduce, has three meanings:

Programming model

Operational framework

The concrete implementation tool of MapReduce programming idea

The idea of MapReduce is roughly divided into two stages: Map and Reduce.

Map is to divide the processing of a large file into blocks to achieve distributed computing.

Reduce is to summarize the calculation results of each block.

Calculating the data is actually the way to extract Key-Value. When the Map is handed over to Reduce, the content extracted by the same Key must be sent to the same Reduce process for final merging.

For the calculation of data, it is necessary for processors to write MapReduce programs according to the programming model of MapReduce combined with the purpose of data processing, so there is a great restriction on the combination of HDFS+MapReduce for the calculation of massive data.

Hadoop also has a number of components that make up the ecosystem of Hadoop:

HDFS+MapReduce forms the core of Hadoop:

Hive:Hive is developed by Facebook. Hive abstracts the framework provided by MapReduce as a system. When users want operations such as line query, they can submit a SQL statement to Hive, and then the Hive implementation converts the SQL statement that is easy for users to understand into a MapReduce program to execute, and finally outputs the result to the user (it can be understood that Hive provides a SQL interface, but it is not fully compatible with SQL).

Hbase: because HDFS cannot mount or modify data, when Hbase works on HDFS, a Hbase cluster is formed, and another process is started on the node of Hbase. In this case, the data can be stored on Hbase first, and then stored in HDFS by Hbase. And Hbase has a version number for the data record, so the data can be modified.

In many cases, we need to analyze and calculate the logs generated by the web server cluster, so how to store the logs generated by the web server on HDFS? first of all, HDFS cannot be mounted, so it cannot be written like a file system without that. This is how log collection tools such as Flume,scrib store logs on HDFS.

As above, in many cases, it may be necessary to analyze, calculate and mine the data stored on the RDBMS with the help of the power of the cluster, so how to import the data from RDBMS into HDFS is realized by Sqoop tool, which can export the data from RDBMS with Sqoop, store it on Hbase first, and then store it on HDFS by Hbase, and then calculate the data through the MapReduce program written.

MAhost is a tool for data mining, that is, machine learning.

Zookeeper: understandably, it is a coordinator to monitor whether each node on the cluster can meet the requirements of the cluster.

Hadoop is still quite good at providing improved data storage by HDFS, but the computing power of MapReduce is slightly inferior. It can be combined with the second generation big data solution spark, using HDFS to complete the distributed storage of massive data, and spark provides the operation of massive data.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.