Overview of HADOOP Biosphere knowledge 04/20 Update SLTechnology News&Howtos

Overview of HADOOP Biosphere knowledge

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

one。 General situation of hadoop ecology

Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed operation and storage. It has the characteristics of reliability, high efficiency and scalability.

The core of Hadoop is YARN,HDFS and Mapreduce.

The following picture shows the hadoop ecosystem, which integrates the spark ecosystem. In the future, hadoop will coexist with spark, and both hadoop and spark can be deployed on the resource management systems of yarn and mesos.

1. HDFS (Hadoop distributed file system)

The GFS paper, which originated from Google, was published in October 2003. HDFS is a GFS clone.

HDFS is the foundation of data storage management in Hadoop system. It is a highly fault-tolerant system that can detect and respond to hardware failures and is used to run on low-cost general-purpose hardware.

HDFS simplifies the file consistency model, provides high-throughput application data access through streaming data access, and is suitable for applications with large datasets.

It provides a mechanism to write once and read multiple times, and the data is distributed in blocks across different physical machines of the cluster at the same time.

2. Mapreduce (distributed computing framework)

The MapReduce paper, derived from google, was published in December 2004. HadoopMapReduce is a google MapReduce clone.

MapReduce is a distributed computing model, which is used to calculate a large amount of data. It shields the details of distributed computing framework and abstracts computing into two parts: map and reduce.

Map performs specified operations on independent elements on the dataset to generate intermediate results in the form of key-value pairs. Reduce specifies all values of the same "key" in the intermediate result to get the final result.

MapReduce is very suitable for data processing in a distributed parallel environment composed of a large number of computers.

3. HBASE (distributed inventory database)

Bigtable paper from Google, published in November 2006, HBase is a GoogleBigtable clone.

HBase is a column-oriented, scalable, highly reliable, high-performance, distributed and column-oriented dynamic schema database for structured data based on HDFS.

HBase uses BigTable's data model: the enhanced sparse sort Mapping Table (Key/Value), where keys are made up of row keywords, column keywords, and timestamps.

HBase provides random, real-time read and write access to large-scale data. At the same time, the data stored in HBase can be processed by MapReduce, which perfectly combines data storage and parallel computing.

4. Zookeeper (distributed collaboration Service)

Chubby paper from Google, published in November 2006. Zookeeper is a Chubby clone.

Solve the problems of data management in distributed environment: unified naming, state synchronization, cluster management, configuration synchronization, etc.

Many components of Hadoop depend on Zookeeper, which runs on a cluster of computers and is used to manage Hadoop operations.

5. HIVE (data warehouse)

Open source by facebook, originally used to solve massive structured log data statistics problems.

Hive defines a query language (HQL) similar to SQL, which converts SQL to MapReduce tasks and executes on Hadoop. It is usually used for offline analysis.

HQL is used to run query statements stored on Hadoop, and Hive allows developers who are not familiar with MapReduce to write data query statements, which are then translated into MapReduce tasks on Hadoop.

6.Pig (ad-hoc script)

By yahoo! Open source, the design motivation is to provide a MapReduce-based ad-hoc (computing occurs when query) data analysis tool

Pig defines a data flow language-PigLatin, which is an abstraction of the complexity of MapReduce programming. The Pig platform includes a running environment and a scripting language (PigLatin) for analyzing Hadoop data sets.

The compiler translates Pig Latin into MapReduce program sequence and converts scripts into MapReduce tasks to be executed on Hadoop. It is usually used for offline analysis.

7.Sqoop (data ETL/ synchronization tool)

Sqoop is an acronym for SQL-to-Hadoop, which is mainly used to transfer data before traditional databases and Hadoop. The import and export of data is essentially a Mapreduce program, which makes full use of the parallelization and fault tolerance of MR.

Sqoop uses database technology to describe data architecture, which is used to transfer data between relational database, data warehouse and Hadoop.

8.Flume (log collection tool)

Cloudera open source log collection system has the characteristics of distributed, high reliability, high fault tolerance, easy to customize and expand.

It abstracts the data from the process of generating, transmitting, processing and finally writing to the target path to the data flow. In the specific data flow, the data source supports customizing the data sender in the Flume, thus supporting the collection of different protocol data.

At the same time, Flume data stream provides the ability to deal with log data simply, such as filtering, format conversion and so on. In addition, Flume has the ability to write logs to various data targets (customizable).

Generally speaking, Flume is a massive log collection system that is scalable and suitable for complex environments. Of course, it can also be used to collect other types of data.

9. Oozie (workflow scheduler)

Oozie is an extensible working system integrated into the stack of Hadoop to coordinate the execution of multiple MapReduce jobs. It can manage a complex system based on external events, including the timing of data and the appearance of data.

An Oozie workflow is a set of actions (for example, Hadoop's Map/Reduce job, Pig job, and so on) placed in the control dependency DAG (directed acyclic graph DirectAcyclic Graph), which specifies the order in which the actions are executed.

Oozie uses hPDL, a XML process definition language, to describe this diagram.

10. Yarn (distributed Resource Manager)

YARN is the next generation MapReduce, namely MRv2, which is evolved on the basis of the first generation MapReduce. It is mainly proposed to solve the poor scalability of the original Hadoop and does not support multi-computing framework.

Yarn is the next generation Hadoop computing platform, and yarn is a general runtime framework. Users can write their own computing framework and run in this running environment.

The framework written by yourself is used as a lib on the client side, which can be packaged when using the submission job. The framework provides the following components for:

Resource management: including application management and machine resource management

Two-tier resource scheduling

Fault tolerance: fault tolerance is considered in each component

Scalability: scalable to tens of thousands of nodes

Memory DAG computing model)

Spark is an Apache project, which is billed as "lightning fast cluster computing". It has a thriving open source community and is by far the most active Apache project.

The earliest Spark is a general parallel computing framework like Hadoop MapReduce, which is open source by UC BerkeleyAMP lab.

Spark provides a faster and more general data processing platform. Compared with Hadoop, Spark can make your program run 100 times faster in memory or 10 times faster on disk

12. Kafka (distributed message queue)

Kafka is an open source messaging system developed by Linkedin in December 2010, which is mainly used to deal with active streaming data.

Active streaming data is very common in web applications, including the pv of the site, what users visit, what content they search, and so on.

These data are usually recorded in the form of logs and then processed statistically at regular intervals.

13.Ambari (install deployment configuration management tools)

The function of Apache Ambari is to create, manage and monitor the cluster of Hadoop. It is a web tool to make Hadoop and related big data software easier to use.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.