Brief introduction of Hadoop products 04/27 Update SLTechnology News&Howtos

Brief introduction of Hadoop products

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Hadoop is a collection of open source software. Here is a brief introduction to these open source software.

Apache Hadoop: a distributed computing open source framework of Apache open source organization, which provides a distributed file system (HDFS) and a software architecture that supports MapReduce distributed computing.

Apache Hive: is a data warehouse tool based on Hadoop, which can map structured data files to a database table, and quickly realize simple MapReduce statistics through SQL-like statements, without the need to develop special MapReduce applications, so it is very suitable for statistical analysis of data warehouse. People who are familiar with sql can get started quickly.

Apache Pig: a large-scale data analysis tool based on Hadoop, it provides a SQL-LIKE language called Pig Latin. The compiler of this language converts SQL-like data analysis requests into a series of optimized MapReduce operations.

Apache HBase: a distributed storage system with high reliability, high performance, column-oriented and scalable. Large-scale structured storage cluster can be built on cheap PC Server by using HBase technology.

Apache Sqoop: is a tool for transferring data from Hadoop and relational databases to each other. Data from a relational database (MySQL, Oracle, Postgres, etc.) can be imported into Hadoop's HDFS, and HDFS data can also be imported into relational databases.

Apache Zookeeper: a distributed and open source coordination service designed for distributed applications, it is mainly used to solve some data management problems often encountered in distributed applications, simplify the difficulty of distributed application coordination and management, and provide high-performance distributed services.

Apache Mahout: a distributed framework for machine learning and data mining based on Hadoop. Mahout implements some data mining algorithms with MapReduce, which solves the problem of parallel mining.

Apache Cassandra: is an open source distributed NoSQL database system. Originally developed by Facebook, it is used to store data in a simple format and integrates Google BigTable's data model with Amazon Dynamo's fully distributed architecture.

Apache Avro: a data serialization system designed to support data-intensive, mass data exchange applications. Avro is a new data serialization format and transmission tool, which will gradually replace the original IPC mechanism of Hadoop.

Apache Ambari: a Web-based tool that supports provisioning, management, and monitoring of Hadoop clusters.

Apache Chukwa: an open source data collection system for monitoring large distributed systems, it can collect various types of data into files suitable for Hadoop processing and save them in HDFS for Hadoop to do various MapReduce operations.

Apache Hama: a HDFS-based BSP (Bulk Synchronous Parallel) parallel computing framework, Hama can be used for large-scale, big data computing, including graphs, matrices, and network algorithms.

Apache Flume: is a distributed, reliable, highly available mass log aggregation system, which can be used for log data collection, log data processing, log data transmission.

Apache Giraph: a scalable, distributed iterative graph processing system based on the Hadoop platform, inspired by BSP (bulk synchronous parallel) and Google's Pregel.

Apache Oozie: a workflow engine server for managing and coordinating tasks running on the Hadoop platform (HDFS, Pig, and MapReduce).

Apache Crunch: a Java library based on Google's FlumeJava library for creating MapReduce programs. Similar to Hive,Pig, Crunch provides a pattern library for common tasks such as joining data, performing aggregation, and sorting records

Apache Whirr: a set of class libraries (including Hadoop) that run on cloud services that provide a high degree of complementarity. Whirr Learning supports Amazon EC2 and Rackspace services.

Apache Bigtop: a tool for packaging, distributing and testing Hadoop and its surrounding ecology.

Apache HCatalog: is based on Hadoop data table and storage management, to achieve central metadata and schema management, across Hadoop and RDBMS, using Pig and Hive to provide relational views.

Cloudera Hue: is a WEB-based monitoring and management system to achieve HDFS,MapReduce/YARN, HBase, Hive, Pig web operation and management.

CDH produced by Cloudera, which contains a variety of tools of Hadoop Ecological Park, is a "packaged release"; that is, Cloudera has carried out secondary development on the basis of the original Hadoop and other open source projects, thus getting CDH.

Reference: Hadoop's learning roadmap

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.