Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

One of big data's series of Hadoop entry theories-introduction to hadoop biosphere

2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Technorati markers: hadoop, biosphere, ecosystem,yarn,spark, getting started

1. General situation of hadoop ecology

Hadoop is a distributed system infrastructure developed by the Apache Foundation.

Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed operation and storage.

It has the characteristics of reliability, high efficiency and scalability.

The core of Hadoop is YARN,HDFS and Mapreduce.

The following picture shows the hadoop ecosystem, which integrates the spark ecosystem. In the coming period of time, hadoop will coexist in spark, hadoop and spark

Can be deployed on the resource management system of yarn and mesos

The following will be a brief introduction to the above components, which can be described in the following series of blog posts.

2. HDFS (Hadoop distributed file system)

The GFS paper, which originated from Google, was published in October 2003. HDFS is a GFS clone.

HDFS is the foundation of data storage management in Hadoop system. It is a highly fault-tolerant system that can detect and respond to hardware failures and is used to run on low-cost general-purpose hardware.

HDFS simplifies the file consistency model, provides high-throughput application data access through streaming data access, and is suitable for applications with large datasets.

It provides a mechanism to write once and read multiple times, and the data is distributed in blocks across different physical machines of the cluster at the same time.

3. Mapreduce (distributed computing framework)

The MapReduce paper, derived from google, was published in December 2004. Hadoop MapReduce is a google MapReduce clone.

MapReduce is a distributed computing model, which is used to calculate a large amount of data. It shields the details of distributed computing framework and abstracts computing into two parts: map and reduce.

Map performs specified operations on independent elements on the dataset to generate intermediate results in the form of key-value pairs. Reduce specifies all values of the same "key" in the intermediate result to get the final result.

MapReduce is very suitable for data processing in a distributed parallel environment composed of a large number of computers.

4. HBASE (distributed inventory database)

Bigtable paper from Google, published in November 2006. HBase is a Google Bigtable clone.

HBase is a column-oriented, scalable, highly reliable, high-performance, distributed and column-oriented dynamic schema database for structured data based on HDFS.

HBase uses BigTable's data model: the enhanced sparse sort Mapping Table (Key/Value), where keys are made up of row keywords, column keywords, and timestamps.

HBase provides random, real-time read and write access to large-scale data. At the same time, the data stored in HBase can be processed by MapReduce, which perfectly combines data storage and parallel computing.

5. Zookeeper (distributed collaboration Service)

Chubby paper from Google, published in November 2006. Zookeeper is a Chubby clone.

Solve the problems of data management in distributed environment: unified naming, state synchronization, cluster management, configuration synchronization and so on.

Many components of Hadoop depend on Zookeeper, which runs on a cluster of computers and is used to manage Hadoop operations.

6. HIVE (data warehouse)

Open source by facebook, originally used to solve massive structured log data statistics problems.

Hive defines a query language (HQL) similar to SQL, which converts SQL to MapReduce tasks and executes on Hadoop. It is usually used for offline analysis.

HQL is used to run query statements stored on Hadoop, and Hive allows developers who are not familiar with MapReduce to write data query statements, which are then translated into MapReduce tasks on Hadoop.

7.Pig (ad-hoc script)

By yahoo! Open source, the design motivation is to provide a MapReduce-based ad-hoc (computing occurs when query) data analysis tool

Pig defines a data flow language-Pig Latin, which is an abstraction of the complexity of MapReduce programming. The Pig platform includes a running environment and a scripting language (Pig Latin) for analyzing Hadoop data sets.

The compiler translates Pig Latin into MapReduce program sequence and converts scripts into MapReduce tasks to be executed on Hadoop. It is usually used for offline analysis.

8.Sqoop (data ETL/ synchronization tool)

Sqoop is an acronym for SQL-to-Hadoop, which is mainly used to transfer data before traditional databases and Hadoop. The import and export of data is essentially a Mapreduce program, which makes full use of the parallelization and fault tolerance of MR.

Sqoop uses database technology to describe data architecture, which is used to transfer data between relational database, data warehouse and Hadoop.

9.Flume (log collection tool)

Cloudera open source log collection system has the characteristics of distributed, high reliability, high fault tolerance, easy to customize and expand.

It abstracts the data from the process of generating, transmitting, processing and finally writing to the target path to the data flow. In the specific data flow, the data source supports customizing the data sender in the Flume, thus supporting the collection of different protocol data.

At the same time, Flume data stream provides the ability to deal with log data simply, such as filtering, format conversion and so on. In addition, Flume has the ability to write logs to various data targets (customizable).

Generally speaking, Flume is a massive log collection system that is scalable and suitable for complex environments. Of course, it can also be used to collect other types of data.

10.Mahout (data Mining algorithm Library)

Mahout originated in 2008 as a sub-project of Apache Lucent. It has made great progress in a very short period of time and is now a top-level project of Apache.

The main goal of Mahout is to create scalable implementations of classic algorithms in the field of machine learning, which are designed to help developers create smart applications more easily and quickly.

Mahout now includes widely used data mining methods, such as clustering, classification, recommendation engine (collaborative filtering) and frequent set mining.

In addition to algorithms, Mahout also includes data input / output tools, data mining support architectures such as integration with other storage systems such as databases, MongoDB, or Cassandra.

11. Oozie (workflow scheduler)

Oozie is an extensible working system integrated into the stack of Hadoop to coordinate the execution of multiple MapReduce jobs. It can manage a complex system based on external events, including the timing of data and the appearance of data.

An Oozie workflow is a set of actions (for example, Hadoop's Map/Reduce job, Pig job, and so on) placed in the control dependency DAG (directed acyclic graph Direct Acyclic Graph), which specifies the order in which the actions are executed.

Oozie uses hPDL, a XML process definition language, to describe this diagram.

12. Yarn (distributed Resource Manager)

YARN is the next generation MapReduce, namely MRv2, which is evolved on the basis of the first generation MapReduce. It is mainly proposed to solve the poor scalability of the original Hadoop and does not support multi-computing framework. Yarn is the next generation Hadoop computing platform, and yarn is a general runtime framework. Users can write their own computing framework and run in this running environment. The framework written by yourself is used as a lib on the client side, which can be packaged when using the submission job. The framework provides the following components for:

-Resource management: including application management and machine resource management

-two-tier resource scheduling

-Fault tolerance: fault tolerance is considered in each component

-scalability: scalable to tens of thousands of nodes

13. Mesos (distributed Resource Manager)

Mesos was born in a research project of UC Berkeley and has become an Apache project. At present, some companies use Mesos to manage cluster resources, such as Twitter.

Similar to yarn, Mesos is a platform for unified resource management and scheduling, and also supports a variety of computing frameworks, such as MR, steaming and so on.

14. Tachyon (distributed memory file system)

Tachyon (/ 't?ki:??n/ means tachyon) is a memory-centric distributed file system with high performance and fault tolerance.

Can provide reliable file sharing services at memory-level speed for cluster frameworks such as Spark and MapReduce.

Tachyon was born in AMPLab of UC Berkeley.

15. Tez (DAG Computing Model)

Tez is Apache's latest open source computing framework that supports DAG jobs. It originates directly from the MapReduce framework. The core idea is to further split the Map and Reduce operations.

That is, Map is split into Input, Processor, Sort, Merge and Output, and Reduce is split into Input, Shuffle, Sort, Merge, Processor and Output, etc.

In this way, these decomposed meta-operations can be flexibly combined to generate new operations. After some control programs are assembled, these operations can form a large DAG job.

At present, hive supports mr and tez computing models, and tez can perfect binary mr programs and improve computing performance.

16. Spark (memory DAG Computing Model)

Spark is an Apache project, which is billed as "lightning fast cluster computing". It has a thriving open source community and is by far the most active Apache project.

The earliest Spark is a general parallel computing framework like Hadoop MapReduce, which is open source by UC Berkeley AMP lab.

Spark provides a faster and more general data processing platform. Compared with Hadoop, Spark can make your program run 100 times faster in memory or 10 times faster on disk

17. Giraph (graphic Computing Model)

Apache Giraph is a scalable distributed iterative graph processing system based on Hadoop platform, inspired by BSP (bulk synchronous parallel) and Google's Pregel.

It first came from Yahoo. Yahoo developed Giraph using the principles of "Pregel: a large-scale Chart processing system" published by Google engineers in 2010. Later, Yahoo donated Giraph to the Apache Software Foundation.

At present, everyone can download Giraph, which has become an open source project of the Apache Software Foundation, supported by Facebook and improved in many ways.

18. GraphX (graphic Computing Model)

Spark GraphX was originally a distributed graph computing framework project of Berkeley AMPLAB, and is now integrated into the spark running framework to provide it with BSP large-scale parallel graph computing power.

19. MLib (Machine Learning Library)

Spark MLlib is a machine learning library, which provides a variety of algorithms for classification, regression, clustering, collaborative filtering and so on.

20. Streaming (flow Computing Model)

Spark Streaming supports real-time processing of convective data and calculates real-time data in a micro-batch way.

21. Kafka (distributed message queuing)

Kafka is an open source messaging system developed by Linkedin in December 2010, which is mainly used to deal with active streaming data.

Active streaming data is very common in web applications, including the pv of the site, what users visit, what content they search, and so on.

These data are usually recorded in the form of logs and then processed statistically at regular intervals.

twenty-two。 Phoenix (hbase sql interface)

Apache Phoenix is the SQL driver of HBase. Phoenix enables Hbase to be accessed through JDBC and converts your SQL query into Hbase scanning and corresponding actions.

23. Ranger (Security Management tool)

Apache ranger is a hadoop cluster permissions framework, which provides complex data permissions for operation, monitoring and management. It provides a centralized management mechanism to manage all data permissions in the yarn-based hadoop ecosystem.

24. Knox (hadoop Security Gateway)

Apache knox is a restapi gateway to access the hadoop cluster. It provides a simple access interface point for all rest access, and can complete 3A authentication (Authentication,Authorization,Auditing) and SSO (single sign-on).

25. Falcon (data Lifecycle Management tool)

Apache Falcon is a new data processing and management platform for Hadoop, designed for data movement, data pipeline coordination, life cycle management and data discovery. It enables end users to quickly "onboard" their data and related processing and management tasks to the Hadoop cluster.

26.Ambari (install deployment configuration management tools)

The function of Apache Ambari is to create, manage and monitor the cluster of Hadoop. It is a web tool to make Hadoop and related big data software easier to use.

References:

Introduction to Hadoop ecosystem http://blog.csdn.net/qa962839575/article/details/44256769?ref=myread

Big data and Hadoop ecosystem, Hadoop distribution and Hadoop-based enterprise application http://www.36dsj.com/archives/26942

Oozie introduces http://blog.csdn.net/wf1982/article/details/7200663

Unified Resource Management and scheduling platform (system) introduces http://blog.csdn.net/meeasyhappy/article/details/8669688

Introduction to Tachyon http://blog.csdn.net/u014252240/article/details/41810849

Apache Tez: a computing framework http://segmentfault.com/a/1190000000458726 that runs on YARN and supports DAG jobs

Giraph: http://tech.it168.com/a2013/0821/1523/000001523700.shtml, an open source graphics processing platform based on Hadoop

Hadoop Family Learning Roadmap http://blog.fens.me/hadoop-family-roadmap/

Introduction to Spark-based Graph Computing Framework GraphX http://www.open-open.com/lib/view/open1420689305781.html

Introduction to getting started with Apache Spark http://blog.jobbole.com/89446/

Http://www.ibm.com/developerworks/cn/opensource/os-cn-bigdata-ambari/index.html, the sharp weapon of Ambari-- big data platform.

Introduction to http://dongxicheng.org/search-engine/kafka/ by message system Kafka

Using Apache Phoenix to implement SQL Operation HBase http://www.tuicool.com/articles/vu6jae

A new data processing and management platform for Hadoop: Apache Falcon http://www.open-open.com/lib/view/open1422533435767.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report