What are the open source tools of Hadoop 07/06 Update SLTechnology News&Howtos

What are the open source tools of Hadoop

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article focuses on "what are the open source tools of Hadoop", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn what are the open source tools of Hadoop.

1. Apache Mesos

Code hosting address: Apache SVN

Mesos provides efficient resource isolation and sharing across distributed applications and frameworks, supporting Hadoop, MPI, Hypertable, Spark, and so on.

Mesos is an open source project in the Apache Incubator that uses ZooKeeper for fault-tolerant replication, Linux Containers to isolate tasks, and supports multiple resource plan allocations (memory and CPU). Provide Java, Python and C++ APIs to develop new parallel applications, and provide a Web-based user interface to view the status of the cluster.

2. Hadoop YARN

Code hosting address: Apache SVN

YARN, also known as MapReduce 2.0, proposes a resource isolation solution, Container, based on Mesos,YARN, but it is not yet mature and only provides memory isolation for Java virtual machines.

Compared to the MapReduce 1.x Magi YARN architecture, there is not much change on the client side, and it still maintains most of the compatibility in calling API and interfaces. However, in YARN, developers use ResourceManager, ApplicationMaster and NodeManager to replace the core JobTracker and TaskTracker in the original framework. ResourceManager is a central service, which is responsible for scheduling and starting the ApplicationMaster to which each Job belongs, and also monitoring the existence of ApplicationMaster; NodeManager is responsible for maintaining the state of Container and keeping the heartbeat to RM. ApplicationMaster is responsible for all the work in a Job lifecycle, similar to JobTracker in the old framework.

Real-time solution on Hadoop

We have mentioned earlier that in Internet companies, based on business logic requirements, enterprises tend to adopt a variety of computing frameworks, such as companies engaged in search business: MapReduce for web indexing, Spark for natural language processing, and so on. What we share in this section are three frameworks: Storm, Impala and Spark:

3. Cloudera Impala

Code hosting address: GitHub

Impala is developed by Cloudera, an open source Massively Parallel Processing (MPP) query engine. The same metadata, SQL syntax, ODBC driver, and user interface (Hue Beeswax) as Hive can provide fast, interactive SQL queries directly on HDFS or HBase. Impala was developed under the inspiration of Dremel, and the first version was released in late 2012.

Instead of using slow Hive+MapReduce batches, Impala can query data directly from HDFS or HBase using SELECT, JOIN and statistical functions through a distributed query engine similar to that in commercial parallel relational databases (composed of Query Planner, Query Coordinator and Query Exec Engine), thus greatly reducing latency.

4. Spark

Code hosting address: Apache

Spark is an open source data analysis cluster computing framework, originally developed by the University of California, Berkeley AMPLab, based on HDFS. Like Hadoop, Spark is used to build large-scale, low-latency data analysis applications. Spark is implemented in Scala language and Scala is used as the application framework.

Spark uses memory-based distributed datasets to optimize iterative workloads and interactive queries. Unlike Hadoop, Spark and Scala are tightly integrated, and Scala manages distributed datasets like local collective objects. Spark supports iterative tasks on distributed datasets, which can actually be run with Hadoop on the Hadoop file system (through YARN, Mesos, and so on).

5. Storm

Code hosting address: GitHub

Storm is a distributed, fault-tolerant real-time computing system developed by BackType and captured by Twitter. Storm is a stream processing platform, which is mostly used for real-time computing and updating the database. Storm can also be used for "continuous computing" (continuous computation) to continuously query the data stream and output the results to the user in the form of a stream during the calculation. It can also be used in "distributed RPC" to run expensive operations in parallel.

Other solutions on Hadoop

As mentioned earlier, based on the real-time needs of the business, various laboratories have invented stream real-time processing tools such as Storm, Impala, Spark, Samza and so on. In this section, we will share open source solutions based on performance, compatibility, and data types in the lab, including Shark, Phoenix, Apache Accumulo, Apache Drill, Apache Giraph, Apache Hama, Apache Tez, and Apache Ambari.

6. Shark

Code hosting address: GitHub

Shark stands for "Hive on Spark", a large-scale data warehouse system built for Spark, compatible with Apache Hive. Hive QL can be executed 100x faster without modifying existing data or queries.

Shark supports Hive query language, meta-storage, serialization format, and custom functions, and seamlessly integrates with existing Hive deployments, making it a faster and more powerful alternative.

7. Phoenix

Code hosting address: GitHub

Phoenix is a SQL middle layer built on top of Apache HBase, written entirely in Java, and provides a client-side embeddable JDBC driver. The Phoenix query engine converts the SQL query into one or more HBase scan and orchestrates execution to produce standard JDBC result sets. Using HBase API, co-processors, and custom filters directly, the performance order of magnitude is milliseconds for simple queries and seconds for millions of rows. Phoenix is completely hosted on GitHub.

Phoenix's noteworthy features include: 1, embedded JDBC driver, which implements most of the java.sql interfaces, including metadata API;2, which can model columns through multiple row keys or key / value units; 3 DDL support; 4, versioned pattern repository; 5 DML support; 5, limited transaction support through client-side batch processing; 6, closely following the ANSI SQL standard.

8. Apache Accumulo

Code hosting address: Apache SVN

Apache Accumulo is a reliable, scalable, high-performance, sorted, distributed key storage solution based on unit access control and customizable server-side processing. Using Google BigTable design ideas, based on Apache Hadoop, Zookeeper and Thrift. Accumulo was first developed by NSA and later donated to the Apache Foundation.

Compared with Google BigTable,Accumulo, which mainly improves the unit-based access and server-side programming mechanism, the latter modification allows Accumulo to modify key-value pairs at any point in the data processing process.

9. Apache Drill

Code hosting address: GitHub

In essence, Apache Drill is an open source implementation of Google Dremel and is essentially a distributed mpp query layer. Supporting SQL and some languages used on NoSQL and Hadoop data storage systems will help Hadoop users to query massive data sets faster. At the moment, Drill can only be counted as a framework, including only the initial functions in Drill's vision.

The purpose of Drill is to support a wider range of data sources, data formats and query languages. Correlation analysis can be completed by quickly scanning PB byte data (about a few seconds). It will be a distributed system designed for interactive analysis of large data sets.

10. Apache Giraph

Code hosting address: GitHub

Apache Giraph is a scalable distributed iterative graph processing system inspired by BSP (bulk synchronous parallel) and Google's Pregel. Unlike them, it is an open source, Hadoop-based architecture.

Giraph processing platform is suitable for running large-scale logical calculations, such as page ranking, shared links, personalized ranking and so on. Giraph focuses on social graph computing and is at the heart of Facebook's Open Graph tool, handling trillions of connections between users and their behaviors in minutes.

11. Apache Hama

Code hosting address: GitHub

Apache Hama is a computing framework based on BSP (Bulk Synchronous Parallel) based on Hadoop, which imitates Google's Pregel. It is used to handle large-scale scientific calculations, especially matrix and graph calculations. The system architecture in the cluster environment is composed of BSPMaster/GroomServer (Computation Engine), Zookeeper (Distributed Locking) and HDFS/HBase (Storage Systems).

12. Apache Tez

Code hosting address: GitHub

Apache Tez is a DAG (directed acyclic graph, Directed Acyclic Graph) computing framework based on Hadoop Yarn. It divides the Map/Reduce process into several sub-processes, and can combine multiple Map/Reduce tasks into a larger DAG task, thus reducing the file storage between Map/Reduce. At the same time, the sub-processes are combined reasonably to reduce the running time of the task. Developed by Hortonworks and provided with major support.

13. Apache Ambari

Code hosting address: Apache SVN

Apache Ambari is an open source framework for supplying, managing and monitoring Apache Hadoop clusters, providing an intuitive operation tool and a robust Hadoop API that can hide complex Hadoop operations and greatly simplify cluster operations. The first version was released in June 2012.

Apache Ambari is now a top-level project of Apache. As early as August 2011, Hortonworks introduced Ambari as an Apache Incubator project and developed a vision for the ultimate simplicity of Hadoop cluster management. He has grown significantly in the development community for more than two years, from a small team to a contributor to various Hortonworks organizations. The Ambari user base has been growing steadily, and many organizations rely on Ambari to deploy and manage Hadoop clusters on a large scale in their large data centers.

Currently, the Hadoop components supported by Apache Ambari include: HDFS, MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop.

At this point, I believe you have a deeper understanding of "what are the open source tools of Hadoop?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.