What's the use of Hadoop? 04/22 Update SLTechnology News&Howtos

What's the use of Hadoop?

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what is the use of Hadoop for you. The editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.

A brief introduction to Hadoop

a. What is Hadoop?

1.Hadoop is an open source distributed computing platform with HDFS (Hadoop Distributed Filesystem,Hadoop distributed File system) and MapReduce as the core, providing users with a transparent distributed infrastructure at the bottom of the system.

two。 The use of HDFS distributed storage mode improves the reading and writing speed, expands the storage capacity, uses MapReduce to integrate the data on the distributed file system, ensures the efficiency of data analysis and processing, and ensures the security of the data by storing redundant data. The high fault tolerance of HDFS makes Hadoop can be deployed in a low-cost computer cluster, while not limited to a certain operating system.

3.Hadoop advantages: high reliability, high scalability, high efficiency, high fault tolerance

B.Hadoop Project and its structure

1.Core/Common, a common tool for supporting other Hadoop subprojects, including FileSystem, RPC, and serialization libraries

2.Avro, a system for data serialization

3.MapReduce, a programming model for parallel removal of large datasets (larger than 1TB)

4.HDFS, distributed file system

5.Chukwa, an open source data collection system for monitoring and analyzing data from large distributed systems

6.Hive, a data warehouse based on Hadoop, provides some tools for data collation, special query and analysis of datasets stored in Hadoop files.

7.HBase, a distributed, column-oriented open source database

8.Pig, a platform for analyzing and evaluating large data sets

The architecture of C.Hadoop

1.HDFS adopts the master-slave (Master/Slave) structure model. A cluster consists of a NameNode and several DataNode.

NameNode: master server, manages file system namespace and client access to files, performs file system namespace operations, and is also responsible for mapping data blocks to specific DataNode

DataNode: manages stored data. Files are divided into blocks that are stored on a set of DataNode and are responsible for processing file read and write requests from file system clients.

2.MapReduce is composed of a single JobTracker running on the master node and a TaskTracker running on each cluster slave node.

D.Hadoop and distributed Development

1.Hadoop is the software of the file system layer in the distributed software system, which realizes the functions of the distributed file system and part of the distributed database.

The principle of 2.MapReduce editing model is that an input key/value pair set is used to generate an output key/value pair set. There are three main functions: map, reduce, and main.

E.Hadoop Computing Model-- MapReduce

1. A MapReduce job (job) usually cuts the input dataset into separate data blocks, which are processed in full parallel by the map task (task).

Data Management of F.Hadoop

Three important components of 1.HDFS: NameNode, DataNode and Client,Client are applications that need to obtain distributed file system files.

2.HBase is mainly ahead of the architecture composed of HRegion, HMaster and HClient in distributed clusters to manage data as a whole.

3.Hive is a data warehouse infrastructure based on Hadoop. It provides a series of tools for data extraction, transformation and loading. It is a mechanism for storing, querying and analyzing large-scale data stored in Hadoop.

II. Installation and configuration of Hadoop

1.hadoop-3.0.0-alpha3, default locahost:9870 and localhost:50090

Third, the case study of Hadoop application

1. Large-scale data processing is often divided into three different tasks: data collection, data preparation, and data representation.

Data preparation, usually thought of as the stage of extracting, transforming, and loading (Extract Transform Load, ETL) data, or as a data factory

The data representation phase generally refers to the data warehouse, which stores the products the customers need, and the customers will select the appropriate products according to their needs.

4. MapReduce computing model

A.MapReduce computing model

1. In Hadoop, the machine role used to execute MapReduce tasks: one is JobTracker, which is used to schedule work, and there is only one machine in a cluster; the other is TaskTracker, which is used to execute work

When 2.JobTracker schedules a task to TaskTracker,TaskTracker for execution, a progress report is returned. JobTracker will record the progress. If a task on one TaskTracker fails, JobTracker will assign the task to another TaskTracker until the task is completed.

B.Hadoop flow

1.Hadoop stream provides an API that allows users to write map or reduce functions in any scripting language, using UNIX standard stream as the interface between the program and Hadoop

2.Hadoop Pipes provides a way to run C++ programs on Hadoop, and pipes uses Sockets

5. Develop MapReduce applications

1.Hadoop 's own web user interface: http://xxx:50030

two。 Performance tuning:

Use large files as much as possible and avoid using small files.

Consider compressing files

3.MapReduce workflow

Mapper is usually used to handle input format conversion, projection (selecting relevant fields), filtering (removing records that are not of interest), etc.

Hadoop Workflow Scheduler (HWS) acts as a server that allows clients to submit a workflow to the scheduler

VI. MapReduce citation cases

VII. The working mechanism of MapReduce

The execution process of A.MapReduce job

The general execution process of 1.MapReduce tasks: code writing-> Job configuration-> Job submission-> assignment and execution of Map tasks-> processing Intermediate results-> assignment and execution of Reduce tasks-> Job completion. Each task includes input preparation-> Task execution-> output results.

2.4 separate entities:

Client (client): write MapReduce code, configure jobs, submit jobs

JobTracker: initialize jobs, assign jobs, communicate with TaskTracker, coordinate the execution of the whole job

TaskTracker: maintains JobTracker communication, performs Map or Reduce tasks on assigned data fragments, and can contain multiple TaskTracker

HDFS: save job data, configuration information, and job results

b. Error handling mechanism

1. In a cluster, there is only one JobTracker at any time, so a JobTracker failure is a single point of failure, usually by creating multiple backup JobTracker nodes

2.TaskTracker failure is normal and will be handled by MapReduce

c. Job scheduling mechanism

1.Hadoop default FIFO scheduler, and also provides schedulers that support fair sharing of multi-user services and cluster resources, namely Fair Scheduler (Fair Scheduler Guide) and capacity Scheduler (Capacity Scheduler Guide)

The 2.shuffle process is included in both the map and reduce, the map side divides, sorts and splits the results of the map, and then merges the outputs belonging to the same partition; the reduce side merges the outputs sent by each map to the same partition, then sorts the merged results, and finally gives them to the reduce for processing

d. Task execution

1. Deductive execution, which means that when all tasks of a job start to run, JobTracker will count the average progress of all tasks. If the TaskTracker node in which a task is located is slow than the average speed of the overall task due to low configuration or high CPU load, JobTracker will start a new backup task, and the kill of the original task or the new task will be removed. The disadvantage is that backup can not solve the problems caused by code defects.

two。 Task JVM reuse, skip bad records

8. Hadoop iDUBO operation

1.Hadoop uses CRC-32 (Cyclic Redundancy Check, cyclic redundancy check, in which 32 checksums generated are 32 bits) to check data integrity.

2.Hadoop uses RPC for interprocess communication and uses Writables serialization mechanism

9. Detailed explanation of HDFS

Introduction to A.HDFS

1. Features: handling large files; streaming access to data; running on cheap business machine clusters

two。 Limitations: not suitable for low-latency data access; unable to store a large number of small files efficiently; does not support multi-user writing and arbitrary modification of files

B.HDFS architecture

Files in 1.HDFS distributed file system are also divided into blocks for storage. It is the unit of file storage processing, and the default block is 64MB.

2.NameNode is the execution schedule in Master management cluster, and DataNode is the execution node of Worker specific tasks.

3. A HDFS cluster is made up of a NameNode and a certain number of DataNodes. A file is actually divided into one or more data blocks, which are stored on a set of DataNode

Management of Hadoop

1. Monitoring tools: Metrics, Ganglia

two。 Backup tool: distcp

3.Hadoop management command: dfsadmin, get the status information of HDFS; fsck, detect file block

11. Detailed explanation of Hive

1.Hive is a data warehouse architecture based on Hadoop file system. It provides many functions for data warehouse: data ETL (extract, transform and load) tools, data storage management and query and analysis capabilities of large data sets. At the same time, Hive also defines a SQL-like language-Hive QL.

2.Hive mainly contains four types of data models: table (Table), external table (External Table), partition (Partition), bucket (Bucket).

12. Detailed explanation of HBase

Introduction to A.HBase

1. Features: provides storage down and operations up

Basic operation of B.HBase

1. Stand-alone mode can be run directly, while distributed mode requires Hadoop.

C.HBase architecture

The server architecture of 1.HBase follows a simple master-slave server architecture, which is composed of HRegion server farm and HBase Master server. The HBase Master server is responsible for managing all HRegion servers, while all servers in HBase coordinate through ZooKeeper and handle errors that may be encountered while the HBase server is running. HBase Master Server itself does not store any data in HBase, and HBase logical tables may be divided into multiple HRegion and then stored in the HRegion Server group. What is stored in HBase Master Server is the mapping from data to HRegion Server

D.HBase data model

1.HBase is a distributed database similar to Bigtable, which is a sparse long-term storage (stored on the hard disk), multi-dimensional, sorted mapping table. The index of this table is row keywords, column keywords, and timestamps. The data in HBase is all strings and has no type

two。 The format of the column name is ":", which is made up of strings. Each table has a collection of family, which is fixed and can only be changed by changing the table structure; the write operation is row-locked; all database updates have a timestamp, each update is a new version, and HBase retains a certain number of versions.

E.HBase and RDBMS

1. Only simple string types

two。 Only very simple insert, query, delete, empty and other operations, tables and tables are separated, there is no complex relationship between tables.

3. Is stored based on columns, and each column family is saved by several files

4. The update operation retains the old version, not the replacement modification in the traditional relational database.

5. Can easily increase or decrease the number of hardware, high error compatibility

6. To meet the needs of mass storage and Internet applications, the use of cheap hardware to build a data warehouse was originally developed as part of a search engine.

13. Detailed explanation of Mahout

Introduction to A.Mahout

The main goal of 1.Apache Mahout is to build scalable machine learning algorithms that are scalable for large datasets

Clustering and Classification in B.Mahout

Three kinds of vectors in 1.Mahout: dense vector (DenseVector), random access vector (RandomAccessSparseVector) and sequence access vector (SequentialAccessSparseVector)

14. Detailed explanation of Pig

Introduction to A.Pig

1.Pig includes a high-level programming language for describing data analysis programs, as well as an infrastructure for evaluating these programs. The outstanding feature is that its structure can withstand a large number of parallel tasks, which enables it to process large-scale data sets.

2.Pig uses the Pig Latin language, similar to SQL, with particular emphasis on query

15. Detailed explanation of Zookeeper

Introduction to A.ZooKeeper

1.ZooKeeper is an open source coordination service designed for distributed applications, which can provide users with synchronization, configuration management, grouping and naming services.

two。 Design objectives:

Simplicity: allows distributed processes to coordinate through the namespace of the shared system, which is very similar to the standard file system and is made up of data registers

Robustness: the servers that make up ZooKeeper services must know each other about the existence of other servers

Ordering: each update operation can be assigned a version number, and this version number is globally ordered, there is no repetition.

Speed advantage: running on thousands of machine nodes

Leader Election of B.ZooKeeper

1.ZooKeeper needs to elect a Leader among all the services (which can be understood as a server), and then let that Leader manage the cluster, and the rest is Follower. When the Leader fails, the ZooKeeper should be able to quickly elect the next Leader in the Follower, which is the Leader mechanism of ZooKeeper.

C.ZooKeeper lock service

1. In ZooKeeper, fully distributed locks are synchronized globally, that is, no two different clients think they have the same lock at the same time

e. Typical application scenarios (found on the Internet)

1. Unified naming service

two。 Configuration management: the configuration information can be managed by Zookeeper, save the configuration information in a directory node of Zookeeper, and then monitor the status of the configuration information by all the application machines that need to be modified. Once the configuration information changes, each application machine will receive a notification from Zookeeper, and then obtain new configuration information from Zookeeper and apply it to the system.

3. Cluster management: the implementation is to create an EPHEMERAL type directory node on Zookeeper, and then each Server calls the getChildren (String path, boolean watch) method on the parent directory node where they create the directory node and sets watch to true. Because it is an EPHEMERAL directory node, when the Server that created it dies, the directory node will be deleted, so the Children will be changed, and the Watch on the getChildren will be called So the other Server knew that some Server had died. The same principle applies to the new Server.

4. Shared lock

5. Queue management

16. Detailed explanation of Avro

Introduction to A.Avro

1.Avro is a data serialization system that can transform data structures or objects into a format that is convenient for storage or transmission, especially at the beginning of its design, it can be used to support data-intensive applications and is suitable for large-scale data storage and exchange.

The 2.Avro schema is defined in JSON and provides similar functions to systems such as Thrift and Protocol Buffers

Seventeen. Detailed explanation of Chukwa

Introduction to A.Chukwa

1.Chukwa can handle a large number of client requests through extension, and can aggregate the data flow of multiple clients. It adopts a pipelined data processing mode and a modular collection system, and has a simple and standard interface in each module.

B.Chukwa architecture

1. There are three main components:

Client (Agent): enables internal process communication protocols to be compatible with local log files

Collector (Collector) and splitter (Demux): take advantage of Collectors strategy

HICC (Hadoop Infrastructure Care Center): data visualization page

XVIII. Common plug-ins and development of Hadoop

1.Hadoop Studio

2.Hadoop Eclipse

3.Hadoop Streaming: helps users create and run a special class of MapReduce jobs that use executable files or script files as mapper or reducer, that is, allowing the use of non-Java languages

4.Hadoop Libhdfs: JNI developed for Hadoop distributed file system based on C programming interface. It provides a C language interface to manage DFS files and file systems.

This is the end of this article on "what's the use of Hadoop". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it out for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.