In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the relevant knowledge of "what are the Hadoop open source tools?" in the operation of actual cases, many people will encounter such a dilemma, and then let the editor lead you to learn how to deal with these situations! I hope you can read it carefully and be able to achieve something!
1. Hadoop related tools
1. Hadoop
Apache's Hadoop project has almost been equated with big data. It has grown into a complete ecosystem, and many open source tools are oriented to highly scalable distributed computing.
Supported operating systems: Windows, Linux, and OS X.
Related link: http://hadoop.apache.org
2. Ambari
As part of the Hadoop ecosystem, this Apache project provides an intuitive Web-based interface for configuring, managing, and monitoring Hadoop clusters. Some developers want to integrate the functions of Ambari into their applications, and Ambari also provides them with API that takes full advantage of REST (Representative State transfer Protocol).
Supported operating systems: Windows, Linux, and OS X.
Related link: http://ambari.apache.org
Big data and the concept of artificial intelligence are vague, according to what line to learn, after learning to which aspect, want to know more, students who want to learn welcome to join big data learning qq group: 458345782, there are a lot of practical information (zero foundation and advanced classic actual combat) to share with you, so that you can understand the most complete big data high-end practical learning process system. Start with java and linux, and then gradually go deep into HADOOP-hive-oozie-web-flume-python-hbase-kafka-scala-SPARK and other related knowledge to share!
3. Avro
This Apache project provides a data serialization system with rich data structures and compact formats. Patterns are defined in JSON, which can be easily integrated with dynamic languages.
Supported operating system: independent of the operating system.
Related link: http://avro.apache.org
4. Cascading
Cascading is an application development platform based on Hadoop. Provide business support and training services.
Supported operating system: independent of the operating system.
Related link: http://www.cascading.org/projects/cascading/
5. Chukwa
Chukwa is based on Hadoop and can collect data from large distributed systems for monitoring. It also contains tools for analyzing and displaying data.
Supported operating systems: Linux and OS X.
Related link: http://chukwa.apache.org
6. Flume
Flume can collect log data from other applications and send that data to Hadoop. "it is powerful and fault-tolerant, with reliability mechanisms that can be adjusted and optimized and many failover and recovery mechanisms," the official website said. "
Supported operating systems: Linux and OS X.
Related link: https://cwiki.apache.org/confluence/display/FLUME/Home
7. HBase
HBase is designed for very large tables with billions of rows and millions of columns, a distributed database that provides random real-time read / write access to big data. It is similar to Google's Bigtable, but based on Hadoop and Hadoop distributed File system (HDFS).
Supported operating system: independent of the operating system.
Related link: http://hbase.apache.org
8. Hadoop distributed File system (HDFS)
HDFS is an Hadoop-oriented file system, but it can also be used as a standalone distributed file system. Based on Java, it is fault-tolerant, highly scalable and highly configurable.
Supported operating systems: Windows, Linux, and OS X.
Related link: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
9. Hive
Apache Hive is a data warehouse for the Hadoop ecosystem. It allows users to query and manage big data using HiveQL, a language similar to SQL.
Supported operating system: independent of the operating system.
Related link: http://hive.apache.org
10. Hivemall
Hivemall combines a variety of machine learning algorithms for Hive. It includes many highly scalable algorithms for data classification, recursion, recommendation, k-nearest neighbor, anomaly detection and feature hashing.
Supported operating system: independent of the operating system.
Related link: https://github.com/myui/hivemall
11. Mahout
According to the official website, the purpose of the Mahout project is to "create an environment for rapidly building scalable, high-performance machine learning applications." It includes many algorithms for data mining on Hadoop MapReduce, as well as some novel algorithms for Scala and Spark environments.
Supported operating system: independent of the operating system.
Related link: http://mahout.apache.org
12. MapReduce
As an integral part of Hadoop, the MapReduce programming model provides a way to deal with large distributed data sets. It was originally developed by Google, but now it is also used by several other big data tools introduced in this article, including CouchDB, MongoDB and Riak.
Supported operating system: independent of the operating system.
Related link: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
13. Oozie
This workflow scheduling tool is specially designed to manage Hadoop tasks. It can trigger tasks according to time or data availability and integrate with MapReduce, Pig, Hive, Sqoop, and many other related tools.
Supported operating systems: Linux and OS X.
Related link: http://oozie.apache.org
14. Pig
Apache Pig is a platform for distributed big data analysis. It relies on a programming language called Pig Latin and has the advantages of simplified parallel programming, optimization and extensibility.
Supported operating system: independent of the operating system.
Related link: http://pig.apache.org
15. Sqoop
Enterprises often need to transfer data between relational databases and Hadoop, and Sqoop is a tool that can accomplish this task. It can import data into Hive or HBase and export from Hadoop to a relational database management system (RDBMS).
Supported operating system: independent of the operating system.
Related link: http://sqoop.apache.org
16. Spark
As an alternative to MapReduce, Spark is a data processing engine. It claims that it is up to 100 times faster than MapReduce when used in memory and up to 10 times faster than MapReduce when used on disk. It can be used with Hadoop and Apache Mesos, or independently.
Supported operating systems: Windows, Linux, and OS X.
Related link: http://spark.apache.org
17. Tez
Tez is based on Apache Hadoop YARN, "an application framework that allows you to build a complex directed acyclic graph for tasks to process data." It allows Hive and Pig to simplify complex tasks that would otherwise require multiple steps to complete.
Supported operating systems: Windows, Linux, and OS X.
Related link: http://tez.apache.org
18. Zookeeper
The big data management tool calls itself "a centralized service that can be used to maintain configuration information, name, provide distributed synchronization and provide group services." It allows nodes in the Hadoop cluster to coordinate with each other.
Supported operating systems: Linux, Windows (for development environments only) and OS X (for development environments only).
Related link: http://zookeeper.apache.org
Big data Analysis platform and tools
19. Disco
Disco, originally developed by Nokia, is a distributed computing framework that, like Hadoop, is based on MapReduce. It includes a distributed file system and a database that supports billions of keys and values.
Supported operating systems: Linux and OS X.
Related link: http://discoproject.org
20. HPCC
As an alternative to Hadoop, HPCC, the big data platform, promises to be very fast and scalable. In addition to the free community version, HPCC Systems also offers a paid enterprise version, fee-based modules, training, consulting and other services.
Supported operating system: Linux.
Related link: http://hpccsystems.com
21. Lumify
Lumify is owned by Altamira Technologies (known for its national security technology), an open source big data integration, analysis and visualization platform. You only need to try the demo version on Try.Lumify.io to see how it actually works.
Supported operating system: Linux.
Related link: http://www.jboss.org/infinispan.html
twenty-two。 Pandas
The Pandas project includes data structures and data analysis tools based on Python programming language. It allows business organizations to use Python as an alternative to R for big data's analytical projects.
Supported operating systems: Windows, Linux, and OS X.
Related link: http://pandas.pydata.org
23. Storm
Storm is now an Apache project that provides real-time processing of big data (unlike Hadoop, which only provides batch processing). Its users include Twitter, US Weather Channel, WebMD, Alibaba, Yelp, Yahoo Japan, Spotify, Group, Flipboard and many other companies.
Supported operating system: Linux.
Related link: https://storm.apache.org
Database / data Warehouse
24. Blazegraph
Blazegraph was formerly known as "Bigdata", which is a highly scalable, high-performance database. It has versions that use both open source and commercial licenses.
Supported operating system: independent of the operating system.
Related link: http://www.systap.com/bigdata
25. Cassandra
The NoSQL database was originally developed by Facebook and is now used by more than 1500 business organizations, including Apple, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netfilx, Reddit and others. It can support very large clusters; for example, Apple's deployed Cassandra system includes more than 75000 nodes and has more than 10 PB of data.
Supported operating system: independent of the operating system.
Related link: http://cassandra.apache.org
twenty-six。 CouchDB
CouchDB claims to be "a database that fully embraces the Internet". It stores data in JSON documents, which can be queried through Web browsers and processed with JavaScript. It is easy to use and has high availability and high scalability on the distributed network.
Supported operating systems: Windows, Linux, OS X and Android.
Related link: http://couchdb.apache.org
twenty-seven。 FlockDB
FlockDB, developed by Twitter, is a very fast and scalable graphical database that is good at storing social network data. Although it is still available for download, the open source version of the project has not been updated for some time.
Supported operating system: independent of the operating system.
Related link: https://github.com/twitter/flockdb
twenty-eight。 Hibari
The Erlang-based project calls itself "a distributed ordered key-value storage system that ensures strong consistency". It was originally developed by Gemini Mobile Technologies and is now used by several telecom operators in Europe and Asia.
Supported operating system: independent of the operating system.
Related link: http://hibari.github.io/hibari-doc/
twenty-nine。 Hypertable
Hypertable is a Hadoop-compatible big data database that promises high performance, and its users include eBay, Baidu, Gaopeng, Yelp and many other Internet companies. Provide business support services.
Supported operating systems: Linux and OS X.
Related link: http://hypertable.org
thirty。 Impala
Cloudera claims that the SQL-based Impala database is "the leading open source analysis database for Apache Hadoop". It can be downloaded as a stand-alone product and is part of Cloudera's commercial big data product.
Supported operating systems: Linux and OS X.
Related link: http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
thirty-one。 InfoBright Community Edition
InfoBright is designed for data analysis, which is a column-oriented database with high compression ratio. InfoBright.com provides fee-based products and support services based on the same code.
Supported operating systems: Windows and Linux.
Related link: http://www.infobright.org
thirty-two。 MongoDB
MongoDB has been downloaded more than 10 million times, which is an extremely popular NoSQL database. Enterprise version, support, training and related products and services are available on MongoDB.com.
Supported operating systems: Windows, Linux, OS X, and Solaris.
Related link: http://www.mongodb.org
thirty-three。 Neo4j
Neo4j claims to be the "fastest and most scalable native graphics database". It promises large-scale scalability, fast password query performance and improved development efficiency. Users include eBay, Pitney Bowes, Wal-Mart, Lufthansa and CrunchBase.
Supported operating systems: Windows and Linux.
Related link: http://neo4j.org
thirty-four。 OrientDB
This multi-model database combines some functions of the graphic database and some functions of the document database. Provide fee-based support, training and consulting services.
Supported operating system: independent of the operating system.
Related link: http://www.orientdb.org/index.htm
thirty-five。 Pivotal Greenplum Database
Pivotal claims that Greenplum is "the best enterprise analysis database of its kind", which can analyze huge amounts of data very quickly and powerfully. It is part of the Pivotal large database suite.
Supported operating systems: Windows, Linux, and OS X.
Related link: http://pivotal.io/big-data/pivotal-greenplum-database
thirty-six。 Riak
Riak is "fully functional" and has two versions: KV is a distributed NoSQL database, and S2 provides cloud-oriented object storage. It is available in both open source and commercial versions, as well as accessories that support Spark, Redis and Solr.
Supported operating systems: Linux and OS X.
Related link: http://basho.com/riak-0-10-is-full-of-great-stuff/
thirty-seven。 Redis
Redis is now sponsored by Pivotal, a key caching and storage system. Provide paid support. Note: although the project does not officially support Windows, Microsoft has a Windows derivative on GitHub.
Supported operating system: Linux.
Related link: http://redis.io
IV. Business Intelligence
thirty-eight。 Talend Open Studio
Talend has been downloaded more than 2 million times, and its open source software provides data integration. The company also develops paid tools such as big data, cloud, data integration, application integration and master data management. Its users include American International Group (AIG), Comcast, eBay, General Electric, Samsung, Ticketmaster and Verizon.
Supported operating systems: Windows, Linux, and OS X.
Related link: http://www.talend.com/index.php
thirty-nine。 Jaspersoft
Jaspersoft provides flexible, embeddable business intelligence tools for a wide range of business organizations: Gaopeng, Crown Technologies, USDA, Ericsson, time Warner Cable, Olympic Steel, the University of Nesaska and General Dynamics. In addition to the open source community version, it also offers a fee-based reporting version, Amazon Web Services (AWS) version, professional version and enterprise version.
Supported operating system: independent of the operating system.
Related link: http://www.jaspersoft.com
forty。 Pentaho
Pentaho, owned by Hitachi data Systems, provides a range of data integration and business analysis tools. Three community editions are available on the official website; visit Pentaho.com for information on the fee-based version.
Supported operating systems: Windows, Linux, and OS X.
Related link: http://community.pentaho.com
forty-one。 SpagoBI
Spago, known as the "open source leader" by market analysts, provides business intelligence, middleware and quality assurance software, as well as a Java EE application development framework. The software is free and open source, but it also provides paid support, consulting, training and other services.
Supported operating system: independent of the operating system.
Related link: http://www.spagoworld.org/xwiki/bin/view/SpagoWorld/
forty-two。 KNIME
The full name of KNIME is "Constance Information Mining tool" (Konstanz Information Miner), which is an open source analysis and reporting platform. Several commercial and open source extensions are provided to enhance its functionality.
Supported operating systems: Windows, Linux, and OS X.
Related link: http://www.knime.org
forty-three。 BIRT
The full name of BIRT is "business intelligence and reporting tools". It provides a platform for making visual elements and reports that can be embedded in applications and websites. It is part of the Eclipse community and is supported by Actuate, IBM, and Innovent Solutions.
Supported operating system: independent of the operating system.
Related link: http://www.eclipse.org/birt/
5. Data Mining
44.DataMelt
As the successor of jHepWork, DataMelt can handle mathematical operations, data mining, statistical analysis and data visualization and other tasks. It supports Java and related programming languages, including Jython, Groovy, JRuby, and Beanshell.
Supported operating system: independent of the operating system.
Related link: http://jwork.org/dmelt/
forty-five。 KEEL
The full name of KEEL is "knowledge extraction based on Evolutionary Learning", which is a Java-based machine learning tool that provides algorithms for a series of big data tasks. It also helps to evaluate the effectiveness of algorithms in dealing with recursion, classification, clustering, pattern mining, and similar tasks.
Supported operating system: independent of the operating system.
Related link: http://keel.es
forty-six。 Orange
Orange believes that data mining should be "fruitful and interesting", whether you have years of experience or are new to the field. It provides visual programming and Python scripting tools for data visualization and analysis.
Supported operating systems: Windows, Linux, and OS X.
Related link: http://orange.biolab.si
forty-seven。 RapidMiner
RapidMiner claims to have more than 250000 users, including PayPal, Deloitte, eBay, Cisco and Volkswagen. It offers a wide range of open source and paid versions, but be aware that the free open source version only supports data in CSV or Excel format.
Supported operating system: independent of the operating system.
Related link: https://rapidminer.com
forty-eight。 Rattle
The full name of Rattle is "R analysis tool that is easy to learn and use". It provides a graphical interface for the R programming language, simplifying these processes: building statistical or visual summaries of data, building models, and performing data transformations.
Supported operating systems: Windows, Linux, and OS X.
Related link: http://rattle.togaware.com
forty-nine。 SPMF
SPMF now includes 93 algorithms, which can be used for sequential pattern mining, association rule mining, itemset mining, sequential rule mining and clustering. It can be used independently or integrated into other Java-based programs.
Supported operating system: independent of the operating system.
Related link: http://www.philippe-fournier-viger.com/spmf/
fifty。 Weka
Waikato knowledge Analysis Environment (Weka) is a set of Java-based machine learning algorithms for data mining. It can perform data preprocessing, classification, recursion, clustering, association rules and visualization.
Supported operating systems: Windows, Linux, and OS X.
Related link: http://www.cs.waikato.ac.nz/~ml/weka/
VI. Query engine
fifty-one。 Drill
The Apache project allows users to use SQL-based queries to query Hadoop, NoSQL databases and cloud storage services. It can be used for data mining and ad hoc queries, and it supports a wide range of databases, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage and Swift.
Supported operating systems: Windows, Linux, and OS X.
Related link: http://drill.apache.org
VII. Programming language
fifty-two。 R
R is similar to the S language and environment and is designed to handle statistical calculations and graphics. It includes an integrated set of big data tools for data processing, computing and visualization.
Supported operating systems: Windows, Linux, and OS X.
Related link: http://www.r-project.org
fifty-three。 ECL
Enterprise Control language (ECL) is the language that developers use to build big data applications on the HPCC platform. The official HPCC Systems website has an integrated development environment (IDE), tutorials, and many related tools for dealing with the language.
Supported operating system: Linux.
Related link: http://hpccsystems.com/download/docs/ecl-language-reference
Big data search
fifty-four。 Lucene
Java-based Lucene can perform full-text search very quickly. According to the official website, it can retrieve more data than 150GB every hour on modern hardware, and it contains powerful and efficient search algorithms. The development work is sponsored by the Apache Software Foundation.
Supported operating system: independent of the operating system.
Related link: http://lucene.apache.org/core/
fifty-five。 Solr
Based on Apache Lucene, Solr is a highly reliable and highly scalable enterprise search platform. Well-known users include eHarmony, Sears, StubHub, Zappos, Best Buy, AT&T, Instagram, Netflix, Bloomberg and Travelocity.
Supported operating system: independent of the operating system.
Related link: http://lucene.apache.org/solr/
IX. In-memory technology
fifty-six。 Ignite
The Apache project calls itself "a high-performance, integrated, distributed in-memory platform that can be used to perform real-time computing and processing on large data sets, several orders of magnitude faster than traditional disk-based or flash technology." The platform includes data grid, computing grid, service grid, streaming media, Hadoop acceleration, advanced clustering, file system, message passing, event and data structure and other functions.
Supported operating system: independent of the operating system.
Related link: https://ignite.incubator.apache.org
fifty-seven。 Terracotta
Terracotta claims that its BigMemory technology is "one of the best in-memory data management platforms in the world", with 2.1 million developers and 250 enterprise organizations deploying its software. The company also provides commercial software, as well as support, consulting and training services.
Supported operating system: independent of the operating system.
Related link: http://www.terracotta.org
fifty-eight。 Pivotal GemFire/Geode
Earlier this year, Pivotal announced that it would open source code for key components of its big data suite, including GemFire's in-memory NoSQL database. It has submitted a proposal to the Apache Software Foundation to manage the core engine of the GemFire database under the name "Geode". A commercial version of the software is also available.
Supported operating systems: Windows and Linux.
Related link: http://pivotal.io/big-data/pivotal-gemfire
fifty-nine。 GridGain
Apache Ignite-driven GridGrain provides in-memory data structures for rapid processing of big data, as well as Hadoop accelerators based on the same technology. It has both a paid enterprise version and a free community version, which includes free basic support.
Supported operating systems: Windows, Linux, and OS X.
Related link: http://www.gridgain.com
sixty。 Infinispan
As a Red Hat JBoss project, Java-based Infinispan is a distributed in-memory data grid. It can be used as a cache, as a high-performance NoSQL database, or to add clustering capabilities to many frameworks.
Supported operating system: independent of the operating system.
This is the end of the content of "what are the Hadoop open source tools"? thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.