Hadoop biosphere component diagram 10/20 Update SLTechnology News&Howtos

Hadoop biosphere component diagram

2025-10-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1.Hadoop Common is the lowest module of Hadoop system, which provides various tools for each sub-module of Hadoop, such as system configuration tool Configuration, remote call RPC, serialization mechanism and log operation, etc., which is the basis of other modules.

2.HDFS is the acronym of Hadoop distributed file system, which is the cornerstone of Hadoop. HDFS is a highly fault-tolerant file system suitable for deployment on cheap machines. It can provide high-throughput data access and is very suitable for applications on large-scale datasets.

3.YARN is a unified resource management and scheduling platform. It solves many problems such as low utilization of Hadoop resources and incompatibility with heterogeneous computing frameworks in the previous generation. The resource isolation scheme and the implementation of dual scheduler are provided.

4.MapReduce is a programming model. Using the idea of functional programming, the process of dataset is divided into two stages: Map and Reduce. This programming model of MapReduce is very suitable for distributed computing. Hadoop provides the computing framework of MapReduce and implements this programming model. Users can program in many languages such as Java\ C++\ Python\ PHP.

5.Spark is a new generation of computing framework developed by the AMP Lab of the University of California, Berkeley. It has great advantages in iterative computing, significantly improved performance compared with MapReduce, and can be integrated with Yarn, and also provides SparkSQL components.

6.HBase comes from Google's Bigtable paper. HBase is a distributed, column-oriented open source database. The column family, the data model of Bigtable, is adopted. HBase is good at random, real-time read-write access to large-scale data.

As a distributed service framework, 7.Zookeeper is implemented based on Fast Paxos algorithm to solve the problem of consistency in distributed systems. Provide configuration maintenance, name service, distributed synchronization, group service and so on.

8.Hive, which was first developed and used by facebook, is a data warehouse tool based on Hadoop, which can map structured data files to a table and provide simple SQL query functions. And turn SQL into a MapReduce job to run. One thing is that the cost of learning is low. The threshold for the use of Hadoop is reduced.

Like Hive, 9.Pig is also a tool for analyzing and evaluating big data sets. Unlike Hive, Pig provides a high-level, domain-oriented abstract language Pig Latin. Similarly, Pig can also convert Pig Latin into MapReduce jobs. Compared with SQL,Pig Latin, it is more flexible, but the cost of learning is higher.

10.Impala is developed by Cloudera Company, which can provide SQL interface for interactive query to store massive data of HDFS and HBase. In addition to using the same unified storage platform as Hive, Impala also uses the same metadata, SQL syntax, ODBC drivers, and user interface. Impala also provides a familiar unified platform for batch or real-time queries. The characteristic of Impala is that the query is very fast, and its performance is much better than that of Hive. Impala is not based on MapReduce, it is positioned as OLAP and is an open source implementation of Dremel, one of Google's new troika.

11.Mahout is a machine learning and data mining library. It uses the MapReduce programming model to implement classic machine learning algorithms such as Kmuri Means.NativeDifferent Bayes.Collaborative Filtering, and makes it scalable.

12.Flume is a highly available, highly reliable, distributed mass log collection, aggregation and transmission system provided by Cloudera. Flume supports customizing all kinds of data senders in the log system for data collection, while Flume provides the ability to simply process the data and write it to each data receiver.

13.Sqoop is the abbreviation of SQL to Hadoop, and its main function is to exchange data between structured data storage and Hadoop, that is to say, Sqoop can import data from relational database into HDFS, Hive, or export from HDFS and Hive to relational database. Sqoop makes use of the advantages of Hadoop, and the whole import and export is parallelized by MapReduce computing framework, which is very efficient.

14.Kafka is a high-throughput distributed publish and subscribe messaging system. With the characteristics of distributed and high availability, it is widely used in big data system. If big data system is compared to a machine, then kafka is the front-end bus, which connects the various components of the platform.

There is also a big data platform such as Storm.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.