What are the knowledge points of Hadoop ecosystem 04/28 Update SLTechnology News&Howtos

What are the knowledge points of Hadoop ecosystem

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what are the knowledge points of Hadoop ecosystem". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what are the knowledge points of Hadoop ecosystem"?

1. General situation of Hadoop ecosystem

Hadoop is a software framework capable of distributed processing of a large amount of data. It has the characteristics of reliability, high efficiency and scalability.

The core of Hadoop is HDFS and Mapreduce,hadoop2.0, including YARN.

The following picture shows the ecosystem of hadoop:

2. HDFS (Hadoop distributed file system)

The GFS paper, which originated from Google, was published in October 2003. HDFS is a GFS clone.

It is the foundation of data storage management in Hadoop system. It is a highly fault-tolerant system that can detect and respond to hardware failures and is used to run on low-cost general-purpose hardware. HDFS simplifies the file consistency model, provides high-throughput application data access through streaming data access, and is suitable for applications with large datasets.

Client: split files; access HDFS; to interact with NameNode to get file location information; interact with DataNode to read and write data.

The NameNode:Master node, which has only one in hadoop1.X, manages the namespace and block mapping information of the HDFS, configures the replica policy, and handles client requests.

The DataNode:Slave node stores the actual data and reports the storage information to the NameNode.

Secondary NameNode: assist NameNode and share its workload; combine fsimage and fsedits regularly and push them to NameNode; in case of emergency to help restore NameNode, but Secondary NameNode is not a hot backup for NameNode.

3. Mapreduce (distributed computing framework)

The MapReduce paper, derived from google, was published in December 2004. Hadoop MapReduce is a google MapReduce clone.

MapReduce papers derived from google

MapReduce is a computing model, which is used to calculate a large amount of data. Map performs specified operations on independent elements on the dataset to generate intermediate results in the form of key-value pairs. Reduce specifies all values of the same "key" in the intermediate result to get the final result. The function partition such as MapReduce is very suitable for data processing in a distributed parallel environment composed of a large number of computers.

4. Hive (Hadoop-based data warehouse)

Open source by facebook, originally used to solve massive structured log data statistics problems.

Hive defines a query language (HQL) similar to SQL, which converts SQL to MapReduce tasks and executes on Hadoop.

5. Hbase (distributed inventory database)

Bigtable paper from Google, published in November 2006. HBase is a Google Bigtable clone.

HBase is a scalable, highly reliable, high-performance, distributed and column-oriented dynamic schema database for structured data. Unlike traditional relational databases, HBase uses BigTable's data model: enhanced sparse sort Mapping Table (Key/Value), where keys are made up of row keywords, column keywords, and timestamps. HBase provides random, real-time read and write access to large-scale data. At the same time, the data stored in HBase can be processed by MapReduce, which perfectly combines data storage and parallel computing.

Data model: Schema-- > Table-- > Column Family-- > Column-- > RowKey-- > TimeStamp-- > Value

6. Zookeeper (distributed collaboration service)

Chubby paper from Google, published in November 2006. Zookeeper is a Chubby clone.

Solve the problems of data management in distributed environment: unified naming, state synchronization, cluster management, configuration synchronization and so on.

7. Sqoop (data synchronization tool)

Sqoop is an acronym for SQL-to-Hadoop, which is mainly used to transfer data before traditional databases and Hadoop.

The import and export of data is essentially a Mapreduce program, which makes full use of the parallelization and fault tolerance of MR.

8. Pig (data flow system based on Hadoop)

By yahoo! Open source, the design motivation is to provide a MapReduce-based ad-hoc (computing occurs when query) data analysis tool

A data flow language-Pig Latin is defined, which converts scripts into MapReduce tasks and executes them on Hadoop.

It is usually used for offline analysis.

9. Mahout (data mining algorithm library)

Mahout originated in 2008 as a sub-project of Apache Lucent. It has made great progress in a very short period of time and is now a top-level project of Apache.

The main goal of Mahout is to create scalable implementations of classic algorithms in the field of machine learning, which are designed to help developers create smart applications more easily and quickly. Mahout now includes widely used data mining methods, such as clustering, classification, recommendation engine (collaborative filtering) and frequent set mining. In addition to algorithms, Mahout also includes data input / output tools, data mining support architectures such as integration with other storage systems such as databases, MongoDB, or Cassandra.

10. Flume (log collection tool)

Cloudera open source log collection system has the characteristics of distributed, high reliability, high fault tolerance, easy to customize and expand.

It abstracts the data from the process of generating, transmitting, processing and finally writing to the target path to the data flow. In the specific data flow, the data source supports customizing the data sender in the Flume, thus supporting the collection of different protocol data. At the same time, Flume data stream provides the ability to deal with log data simply, such as filtering, format conversion and so on. In addition, Flume has the ability to write logs to various data targets (customizable). Generally speaking, Flume is a massive log collection system that is scalable and suitable for complex environments.

At this point, I believe you have a deeper understanding of "what are the knowledge points of the Hadoop ecosystem". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.