Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the open source tools for Hadoop

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what are the Hadoop open source tools?" in the operation of actual cases, many people will encounter such a dilemma, and then let the editor lead you to learn how to deal with these situations! I hope you can read it carefully and be able to achieve something!

1. Hadoop related tools

1. Hadoop

Apache's Hadoop project has almost been equated with big data. It has grown into a complete ecosystem, and many open source tools are oriented to highly scalable distributed computing.

Supported operating systems: Windows, Linux, and OS X.

Related link: http://hadoop.apache.org

2. Ambari

As part of the Hadoop ecosystem, this Apache project provides an intuitive Web-based interface for configuring, managing, and monitoring Hadoop clusters. Some developers want to integrate the functions of Ambari into their applications, and Ambari also provides them with API that takes full advantage of REST (Representative State transfer Protocol).

Supported operating systems: Windows, Linux, and OS X.

Related link: http://ambari.apache.org

Big data and the concept of artificial intelligence are vague, according to what line to learn, after learning to which aspect, want to know more, students who want to learn welcome to join big data learning qq group: 458345782, there are a lot of practical information (zero foundation and advanced classic actual combat) to share with you, so that you can understand the most complete big data high-end practical learning process system. Start with java and linux, and then gradually go deep into HADOOP-hive-oozie-web-flume-python-hbase-kafka-scala-SPARK and other related knowledge to share!

3. Avro

This Apache project provides a data serialization system with rich data structures and compact formats. Patterns are defined in JSON, which can be easily integrated with dynamic languages.

Supported operating system: independent of the operating system.

Related link: http://avro.apache.org

4. Cascading

Cascading is an application development platform based on Hadoop. Provide business support and training services.

Supported operating system: independent of the operating system.

Related link: http://www.cascading.org/projects/cascading/

5. Chukwa

Chukwa is based on Hadoop and can collect data from large distributed systems for monitoring. It also contains tools for analyzing and displaying data.

Supported operating systems: Linux and OS X.

Related link: http://chukwa.apache.org

6. Flume

Flume can collect log data from other applications and send that data to Hadoop. "it is powerful and fault-tolerant, with reliability mechanisms that can be adjusted and optimized and many failover and recovery mechanisms," the official website said. "

Supported operating systems: Linux and OS X.

Related link: https://cwiki.apache.org/confluence/display/FLUME/Home

7. HBase

HBase is designed for very large tables with billions of rows and millions of columns, a distributed database that provides random real-time read / write access to big data. It is similar to Google's Bigtable, but based on Hadoop and Hadoop distributed File system (HDFS).

Supported operating system: independent of the operating system.

Related link: http://hbase.apache.org

8. Hadoop distributed File system (HDFS)

HDFS is an Hadoop-oriented file system, but it can also be used as a standalone distributed file system. Based on Java, it is fault-tolerant, highly scalable and highly configurable.

Supported operating systems: Windows, Linux, and OS X.

Related link: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

9. Hive

Apache Hive is a data warehouse for the Hadoop ecosystem. It allows users to query and manage big data using HiveQL, a language similar to SQL.

Supported operating system: independent of the operating system.

Related link: http://hive.apache.org

10. Hivemall

Hivemall combines a variety of machine learning algorithms for Hive. It includes many highly scalable algorithms for data classification, recursion, recommendation, k-nearest neighbor, anomaly detection and feature hashing.

Supported operating system: independent of the operating system.

Related link: https://github.com/myui/hivemall

11. Mahout

According to the official website, the purpose of the Mahout project is to "create an environment for rapidly building scalable, high-performance machine learning applications." It includes many algorithms for data mining on Hadoop MapReduce, as well as some novel algorithms for Scala and Spark environments.

Supported operating system: independent of the operating system.

Related link: http://mahout.apache.org

12. MapReduce

As an integral part of Hadoop, the MapReduce programming model provides a way to deal with large distributed data sets. It was originally developed by Google, but now it is also used by several other big data tools introduced in this article, including CouchDB, MongoDB and Riak.

Supported operating system: independent of the operating system.

Related link: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

13. Oozie

This workflow scheduling tool is specially designed to manage Hadoop tasks. It can trigger tasks according to time or data availability and integrate with MapReduce, Pig, Hive, Sqoop, and many other related tools.

Supported operating systems: Linux and OS X.

Related link: http://oozie.apache.org

14. Pig

Apache Pig is a platform for distributed big data analysis. It relies on a programming language called Pig Latin and has the advantages of simplified parallel programming, optimization and extensibility.

Supported operating system: independent of the operating system.

Related link: http://pig.apache.org

15. Sqoop

Enterprises often need to transfer data between relational databases and Hadoop, and Sqoop is a tool that can accomplish this task. It can import data into Hive or HBase and export from Hadoop to a relational database management system (RDBMS).

Supported operating system: independent of the operating system.

Related link: http://sqoop.apache.org

16. Spark

As an alternative to MapReduce, Spark is a data processing engine. It claims that it is up to 100 times faster than MapReduce when used in memory and up to 10 times faster than MapReduce when used on disk. It can be used with Hadoop and Apache Mesos, or independently.

Supported operating systems: Windows, Linux, and OS X.

Related link: http://spark.apache.org

17. Tez

Tez is based on Apache Hadoop YARN, "an application framework that allows you to build a complex directed acyclic graph for tasks to process data." It allows Hive and Pig to simplify complex tasks that would otherwise require multiple steps to complete.

Supported operating systems: Windows, Linux, and OS X.

Related link: http://tez.apache.org

18. Zookeeper

The big data management tool calls itself "a centralized service that can be used to maintain configuration information, name, provide distributed synchronization and provide group services." It allows nodes in the Hadoop cluster to coordinate with each other.

Supported operating systems: Linux, Windows (for development environments only) and OS X (for development environments only).

Related link: http://zookeeper.apache.org

Big data Analysis platform and tools

19. Disco

Disco, originally developed by Nokia, is a distributed computing framework that, like Hadoop, is based on MapReduce. It includes a distributed file system and a database that supports billions of keys and values.

Supported operating systems: Linux and OS X.

Related link: http://discoproject.org

20. HPCC

As an alternative to Hadoop, HPCC, the big data platform, promises to be very fast and scalable. In addition to the free community version, HPCC Systems also offers a paid enterprise version, fee-based modules, training, consulting and other services.

Supported operating system: Linux.

Related link: http://hpccsystems.com

21. Lumify

Lumify is owned by Altamira Technologies (known for its national security technology), an open source big data integration, analysis and visualization platform. You only need to try the demo version on Try.Lumify.io to see how it actually works.

Supported operating system: Linux.

Related link: http://www.jboss.org/infinispan.html

twenty-two。 Pandas

The Pandas project includes data structures and data analysis tools based on Python programming language. It allows business organizations to use Python as an alternative to R for big data's analytical projects.

Supported operating systems: Windows, Linux, and OS X.

Related link: http://pandas.pydata.org

23. Storm

Storm is now an Apache project that provides real-time processing of big data (unlike Hadoop, which only provides batch processing). Its users include Twitter, US Weather Channel, WebMD, Alibaba, Yelp, Yahoo Japan, Spotify, Group, Flipboard and many other companies.

Supported operating system: Linux.

Related link: https://storm.apache.org

Database / data Warehouse

24. Blazegraph

Blazegraph was formerly known as "Bigdata", which is a highly scalable, high-performance database. It has versions that use both open source and commercial licenses.

Supported operating system: independent of the operating system.

Related link: http://www.systap.com/bigdata

25. Cassandra

The NoSQL database was originally developed by Facebook and is now used by more than 1500 business organizations, including Apple, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netfilx, Reddit and others. It can support very large clusters; for example, Apple's deployed Cassandra system includes more than 75000 nodes and has more than 10 PB of data.

Supported operating system: independent of the operating system.

Related link: http://cassandra.apache.org

twenty-six。 CouchDB

CouchDB claims to be "a database that fully embraces the Internet". It stores data in JSON documents, which can be queried through Web browsers and processed with JavaScript. It is easy to use and has high availability and high scalability on the distributed network.

Supported operating systems: Windows, Linux, OS X and Android.

Related link: http://couchdb.apache.org

twenty-seven。 FlockDB

FlockDB, developed by Twitter, is a very fast and scalable graphical database that is good at storing social network data. Although it is still available for download, the open source version of the project has not been updated for some time.

Supported operating system: independent of the operating system.

Related link: https://github.com/twitter/flockdb

twenty-eight。 Hibari

The Erlang-based project calls itself "a distributed ordered key-value storage system that ensures strong consistency". It was originally developed by Gemini Mobile Technologies and is now used by several telecom operators in Europe and Asia.

Supported operating system: independent of the operating system.

Related link: http://hibari.github.io/hibari-doc/

twenty-nine。 Hypertable

Hypertable is a Hadoop-compatible big data database that promises high performance, and its users include eBay, Baidu, Gaopeng, Yelp and many other Internet companies. Provide business support services.

Supported operating systems: Linux and OS X.

Related link: http://hypertable.org

thirty。 Impala

Cloudera claims that the SQL-based Impala database is "the leading open source analysis database for Apache Hadoop". It can be downloaded as a stand-alone product and is part of Cloudera's commercial big data product.

Supported operating systems: Linux and OS X.

Related link: http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html

thirty-one。 InfoBright Community Edition

InfoBright is designed for data analysis, which is a column-oriented database with high compression ratio. InfoBright.com provides fee-based products and support services based on the same code.

Supported operating systems: Windows and Linux.

Related link: http://www.infobright.org

thirty-two。 MongoDB

MongoDB has been downloaded more than 10 million times, which is an extremely popular NoSQL database. Enterprise version, support, training and related products and services are available on MongoDB.com.

Supported operating systems: Windows, Linux, OS X, and Solaris.

Related link: http://www.mongodb.org

thirty-three。 Neo4j

Neo4j claims to be the "fastest and most scalable native graphics database". It promises large-scale scalability, fast password query performance and improved development efficiency. Users include eBay, Pitney Bowes, Wal-Mart, Lufthansa and CrunchBase.

Supported operating systems: Windows and Linux.

Related link: http://neo4j.org

thirty-four。 OrientDB

This multi-model database combines some functions of the graphic database and some functions of the document database. Provide fee-based support, training and consulting services.

Supported operating system: independent of the operating system.

Related link: http://www.orientdb.org/index.htm

thirty-five。 Pivotal Greenplum Database

Pivotal claims that Greenplum is "the best enterprise analysis database of its kind", which can analyze huge amounts of data very quickly and powerfully. It is part of the Pivotal large database suite.

Supported operating systems: Windows, Linux, and OS X.

Related link: http://pivotal.io/big-data/pivotal-greenplum-database

thirty-six。 Riak

Riak is "fully functional" and has two versions: KV is a distributed NoSQL database, and S2 provides cloud-oriented object storage. It is available in both open source and commercial versions, as well as accessories that support Spark, Redis and Solr.

Supported operating systems: Linux and OS X.

Related link: http://basho.com/riak-0-10-is-full-of-great-stuff/

thirty-seven。 Redis

Redis is now sponsored by Pivotal, a key caching and storage system. Provide paid support. Note: although the project does not officially support Windows, Microsoft has a Windows derivative on GitHub.

Supported operating system: Linux.

Related link: http://redis.io

IV. Business Intelligence

thirty-eight。 Talend Open Studio

Talend has been downloaded more than 2 million times, and its open source software provides data integration. The company also develops paid tools such as big data, cloud, data integration, application integration and master data management. Its users include American International Group (AIG), Comcast, eBay, General Electric, Samsung, Ticketmaster and Verizon.

Supported operating systems: Windows, Linux, and OS X.

Related link: http://www.talend.com/index.php

thirty-nine。 Jaspersoft

Jaspersoft provides flexible, embeddable business intelligence tools for a wide range of business organizations: Gaopeng, Crown Technologies, USDA, Ericsson, time Warner Cable, Olympic Steel, the University of Nesaska and General Dynamics. In addition to the open source community version, it also offers a fee-based reporting version, Amazon Web Services (AWS) version, professional version and enterprise version.

Supported operating system: independent of the operating system.

Related link: http://www.jaspersoft.com

forty。 Pentaho

Pentaho, owned by Hitachi data Systems, provides a range of data integration and business analysis tools. Three community editions are available on the official website; visit Pentaho.com for information on the fee-based version.

Supported operating systems: Windows, Linux, and OS X.

Related link: http://community.pentaho.com

forty-one。 SpagoBI

Spago, known as the "open source leader" by market analysts, provides business intelligence, middleware and quality assurance software, as well as a Java EE application development framework. The software is free and open source, but it also provides paid support, consulting, training and other services.

Supported operating system: independent of the operating system.

Related link: http://www.spagoworld.org/xwiki/bin/view/SpagoWorld/

forty-two。 KNIME

The full name of KNIME is "Constance Information Mining tool" (Konstanz Information Miner), which is an open source analysis and reporting platform. Several commercial and open source extensions are provided to enhance its functionality.

Supported operating systems: Windows, Linux, and OS X.

Related link: http://www.knime.org

forty-three。 BIRT

The full name of BIRT is "business intelligence and reporting tools". It provides a platform for making visual elements and reports that can be embedded in applications and websites. It is part of the Eclipse community and is supported by Actuate, IBM, and Innovent Solutions.

Supported operating system: independent of the operating system.

Related link: http://www.eclipse.org/birt/

5. Data Mining

44.DataMelt

As the successor of jHepWork, DataMelt can handle mathematical operations, data mining, statistical analysis and data visualization and other tasks. It supports Java and related programming languages, including Jython, Groovy, JRuby, and Beanshell.

Supported operating system: independent of the operating system.

Related link: http://jwork.org/dmelt/

forty-five。 KEEL

The full name of KEEL is "knowledge extraction based on Evolutionary Learning", which is a Java-based machine learning tool that provides algorithms for a series of big data tasks. It also helps to evaluate the effectiveness of algorithms in dealing with recursion, classification, clustering, pattern mining, and similar tasks.

Supported operating system: independent of the operating system.

Related link: http://keel.es

forty-six。 Orange

Orange believes that data mining should be "fruitful and interesting", whether you have years of experience or are new to the field. It provides visual programming and Python scripting tools for data visualization and analysis.

Supported operating systems: Windows, Linux, and OS X.

Related link: http://orange.biolab.si

forty-seven。 RapidMiner

RapidMiner claims to have more than 250000 users, including PayPal, Deloitte, eBay, Cisco and Volkswagen. It offers a wide range of open source and paid versions, but be aware that the free open source version only supports data in CSV or Excel format.

Supported operating system: independent of the operating system.

Related link: https://rapidminer.com

forty-eight。 Rattle

The full name of Rattle is "R analysis tool that is easy to learn and use". It provides a graphical interface for the R programming language, simplifying these processes: building statistical or visual summaries of data, building models, and performing data transformations.

Supported operating systems: Windows, Linux, and OS X.

Related link: http://rattle.togaware.com

forty-nine。 SPMF

SPMF now includes 93 algorithms, which can be used for sequential pattern mining, association rule mining, itemset mining, sequential rule mining and clustering. It can be used independently or integrated into other Java-based programs.

Supported operating system: independent of the operating system.

Related link: http://www.philippe-fournier-viger.com/spmf/

fifty。 Weka

Waikato knowledge Analysis Environment (Weka) is a set of Java-based machine learning algorithms for data mining. It can perform data preprocessing, classification, recursion, clustering, association rules and visualization.

Supported operating systems: Windows, Linux, and OS X.

Related link: http://www.cs.waikato.ac.nz/~ml/weka/

VI. Query engine

fifty-one。 Drill

The Apache project allows users to use SQL-based queries to query Hadoop, NoSQL databases and cloud storage services. It can be used for data mining and ad hoc queries, and it supports a wide range of databases, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage and Swift.

Supported operating systems: Windows, Linux, and OS X.

Related link: http://drill.apache.org

VII. Programming language

fifty-two。 R

R is similar to the S language and environment and is designed to handle statistical calculations and graphics. It includes an integrated set of big data tools for data processing, computing and visualization.

Supported operating systems: Windows, Linux, and OS X.

Related link: http://www.r-project.org

fifty-three。 ECL

Enterprise Control language (ECL) is the language that developers use to build big data applications on the HPCC platform. The official HPCC Systems website has an integrated development environment (IDE), tutorials, and many related tools for dealing with the language.

Supported operating system: Linux.

Related link: http://hpccsystems.com/download/docs/ecl-language-reference

Big data search

fifty-four。 Lucene

Java-based Lucene can perform full-text search very quickly. According to the official website, it can retrieve more data than 150GB every hour on modern hardware, and it contains powerful and efficient search algorithms. The development work is sponsored by the Apache Software Foundation.

Supported operating system: independent of the operating system.

Related link: http://lucene.apache.org/core/

fifty-five。 Solr

Based on Apache Lucene, Solr is a highly reliable and highly scalable enterprise search platform. Well-known users include eHarmony, Sears, StubHub, Zappos, Best Buy, AT&T, Instagram, Netflix, Bloomberg and Travelocity.

Supported operating system: independent of the operating system.

Related link: http://lucene.apache.org/solr/

IX. In-memory technology

fifty-six。 Ignite

The Apache project calls itself "a high-performance, integrated, distributed in-memory platform that can be used to perform real-time computing and processing on large data sets, several orders of magnitude faster than traditional disk-based or flash technology." The platform includes data grid, computing grid, service grid, streaming media, Hadoop acceleration, advanced clustering, file system, message passing, event and data structure and other functions.

Supported operating system: independent of the operating system.

Related link: https://ignite.incubator.apache.org

fifty-seven。 Terracotta

Terracotta claims that its BigMemory technology is "one of the best in-memory data management platforms in the world", with 2.1 million developers and 250 enterprise organizations deploying its software. The company also provides commercial software, as well as support, consulting and training services.

Supported operating system: independent of the operating system.

Related link: http://www.terracotta.org

fifty-eight。 Pivotal GemFire/Geode

Earlier this year, Pivotal announced that it would open source code for key components of its big data suite, including GemFire's in-memory NoSQL database. It has submitted a proposal to the Apache Software Foundation to manage the core engine of the GemFire database under the name "Geode". A commercial version of the software is also available.

Supported operating systems: Windows and Linux.

Related link: http://pivotal.io/big-data/pivotal-gemfire

fifty-nine。 GridGain

Apache Ignite-driven GridGrain provides in-memory data structures for rapid processing of big data, as well as Hadoop accelerators based on the same technology. It has both a paid enterprise version and a free community version, which includes free basic support.

Supported operating systems: Windows, Linux, and OS X.

Related link: http://www.gridgain.com

sixty。 Infinispan

As a Red Hat JBoss project, Java-based Infinispan is a distributed in-memory data grid. It can be used as a cache, as a high-performance NoSQL database, or to add clustering capabilities to many frameworks.

Supported operating system: independent of the operating system.

This is the end of the content of "what are the Hadoop open source tools"? thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report