[big data] 2015 Bossie selection-20 Best Open Source big data Technologies 05/01 Update SLTechnology News&Howtos

[big data] 2015 Bossie selection-20 Best Open Source big data Technologies

2025-05-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

2015-10-10 Zhang Xiaodong Oriental Cloud Insight

InfoWorld has selected 2015 open source tool winners in the fields of distributed data processing, streaming data analysis, machine learning, and large-scale data analysis. here's a brief introduction to these award-winning technical tools.

1. Spark

In Apache's big data project, Spark is the most popular one, especially the deep participation of heavyweight contributors like IBM, which makes the development and progress of Spark very fast.

The sweetest spark with Spark is still in the field of machine learning. Since last year, DataFrames API has replaced SchemaRDD API, similar to the discoveries of R and Pandas, making data access easier than the original RDD interface.

Among the new developments in Spark are new workflows for establishing repeatable machine learning, scalable and optimizable support for various storage formats, simpler interfaces to access machine learning algorithms, improved cluster resource monitoring and task tracking.

By default in Spark1.5, the TungSten memory manager provides faster processing power by fine-tuning the layout of data structures in memory. Finally, the new spark-packages.org website has more than 100 third-party contributed link library extensions, adding many useful features.

2. Storm

Storm is a distributed computing framework project in the Apache project, which is mainly used in the field of streaming data real-time processing. He is based on the concept of low-latency interaction model to deal with complex event processing requirements. Unlike Spark, Storm can perform a single point of random processing, not just micro-batch tasks, and requires less memory. In my experience, he has an advantage in streaming data processing, especially in scenarios where data needs to be processed quickly during the rapid transfer of data between two data sources.

Spark masks a lot of Storm, but in fact, Spark is not suitable for many application scenarios of lost data processing. Storm is often used in conjunction with Apache Kafka.

3. H2O

H2O is a distributed memory processing engine for machine learning with an impressive array algorithm. Earlier versions only supported the R language, while version 3.0 began to support Python and Java, and it can also be used as an execution engine for Spark at the back end.

The best way to use H2O is to use it as a large memory extension of the R environment, which does not directly act on large data sets, but through extended communication protocols such as REST API to communicate with H2O clusters, H2O to deal with a large amount of data work.

Several useful R expansion packs, such as ddply, have been packaged to allow you to break memory limits on your local machine when dealing with large datasets. You can run H2O on EC2, or Hadoop cluster / YARN cluster, or Docker container. With Spark+ H2O, you can access the parallel access Spark RDDS on the cluster after the data frame is processed by Spark. And then pass it to a machine learning algorithm of H2O.

4. Apex

Apex is an enterprise-class big data dynamic processing platform, which can support both real-time streaming data processing and batch data processing. It can be a YARN native program that supports a large-scale, scalable, fault-tolerant streaming data processing engine. It natively supports general event handling and ensures data consistency (accurate once, at least, at most)

The code, documentation and architecture design of the commercial processing software based on Apex developed by DataTorrent company shows that Apex can clearly separate the application development in terms of supporting DevOps, and the user code usually does not need to know that he is running in a streaming media processing cluster.

Malhar is a related project that provides more than 300 commonly used application templates that implement common business logic. Malhar's link library significantly reduces the time it takes to develop Apex applications and provides connectors and drivers to connect various storage, file systems, messaging systems, and databases. And can be expanded or customized to meet the requirements of personal business. All malhar components are used under Apache license.

5. Druid

Druid was converted to a business-friendly Apache license in February this year, a hybrid engine based on "event streams" that meets OLAP solutions. Initially, it was mainly used in online data processing in the advertising market, where druids allow users to do arbitrary and interactive analysis based on time series data. Some key features include low-latency event processing, fast aggregation, approximate and accurate computing.

The core of Druid is a custom data store that uses dedicated nodes to handle each part of the problem. Real-time analysis is based on real-time management (JVM) nodes, and the final data is stored in the history node responsible for the old data. The agent node directly queries the real-time and history nodes and gives the user a complete event information. Tests show that 500000 event data can be processed in one second, and the processing capacity can reach a peak of 1 million per second. Druid is an ideal real-time processing platform for online advertising processing, network traffic and other activity flows.

6. Flink

The core of Flink is an event stream data stream engine. Although ostensibly similar to Spark, Flink actually uses different in-memory processing methods. First of all, Flink has been used as a stream processor since its design. Batch processing is just a special case of streaming with start and end states, and Flink provides API to deal with different application scenarios, whether it's API (batch processing) and data flow API. Developers in the world of MapReduce should feel at home when dealing with API in the face of DataSet, and it is easy to migrate applications to Flink. In many ways, Flink is like Spark, whose simplicity and consistency make him popular. Like Spark, Flink is written in Scala.

7. Elasticsearch

Elasticsearch is a search for distributed file servers based on Apache Lucene. Its core, Elasticsearch, builds a data index in near real time based on JSON format, which can achieve fast full-text retrieval. Combined with open source Kibana BI display tools, you can create an impressive data visualization interface.

Elasticsearch is easy to set up and expand, and it can automatically use new hardware to slice as needed. His query syntax is not quite the same as SQL, but it is also familiar with JSON. Most users do not interact with data at that level. Developers can interact with each other using native JSON-over-HTTP interfaces or several common development languages, including Ruby,Python,PHP,Perl,Java,JavaScript.

8. SlamData

If you are looking for a user-friendly tool that can understand the latest popular NoSQL data visualization tools, then you should take a look at SlamData. SlamData allows you to use familiar SQL syntax to make nested queries of JSON data without the need for transformation or syntax modification.

One of the main features of this technology is its connector. Spark,SlamData from MongoDB,HBase,Cassandra and Apache can be easily integrated with most industry-standard external data sources, and data can be converted and analyzed. You might ask, "wouldn't I have a better datapool or data warehouse tool? please recognize that this is in the NoSQL domain."

9. Drill

Drill is a distributed system for interactive analysis of large data sets, spawned by Google's Dremel. Drill is designed for low-latency analysis of nested data. It has a clear design goal, flexibly extends to 10000 servers to handle query records, and supports megabytes of data records.

Nested data can be obtained from a variety of data sources (such as HDFS,HBase,Amazon S3, and Blobs) and in a variety of formats (including JSON,Avro, and buffers), and you do not need to specify a schema ("read-time mode") when reading.

Drill is based on ANSI 2003 SQL's query language, so there is no learning pressure on data engineers, which allows you to connect query data and span multiple data sources (for example, connect HBase tables and logs in HDFS). Finally, Drill provides interfaces based on ODBC and JDBC to interface with your favorite BI tools.

10. HBASE

HBase reached the milestone of 1.X this year and continued to improve. Like other non-relational distributed data stores, HBase feedback query results very quickly, so it is good at often used in background search engines, such as eBay, Brocade and Yahoo. As a stable and mature software product, the fresh features of HBase do not often appear, but this stability is often the most concern of enterprises.

Recent improvements include the addition of regional servers to improve high availability, rolling upgrade support, and YARN compatibility improvements. His feature updates include scanner updates, guaranteed improved performance, and the ability to use HBase as a persistent storage capability for streaming applications like Storm and Spark. HBase can also support SQL queries through the Phoenix project, and its SQL compatibility is steadily improving. Phoenix recently added a Spark connector to add the ability to customize functions.

11. Hive

With the development and maturity of Hive over the years, the official version 1.0 has been released this year, which is used in the field of SQL-based data warehousing. At present, the foundation mainly focuses on improving performance, scalability and SQL compatibility. The latest version 1.2 significantly improves ACID semantic compatibility, cross-data center replication, and cost-based optimizer.

Hive1.2 also brings improved SQL compatibility, making it easier for organizations to transfer from existing data warehouses through ETL tools. In the planning, we talk about the main improvements: the speed with memory cache as the core to improve the integration of LLAP,Spark machine learning library, improve the pre-nested subquery of SQL, intermediate type support and so on.

12. Kylin

Kylin is an OLAP analysis system developed by eBay to handle very large amounts of data. It uses the standard SQL syntax, much like many data analysis products. Kylin uses Hive and MR to build cubes, Hive for prelinks, MR for preaggregation, HDFS to store intermediate files when building cubes, HBase to store cubes, and HBase's coprocessor (coprocessor) to respond to queries.

Like most other analytical applications, Kylin supports a variety of access methods, including programmatic access to JDBC,ODBC API and REST API interfaces.

13. CDAP

CDAP (Cask Data Access Platform) is a framework that runs on top of Hadoop, abstracting the complexity of building and running big data applications. CDAP revolves around two core concepts: data and applications. CDAP datasets are the logical representation of data, regardless of the underlying storage layer; CDAP provides real-time data stream processing capabilities.

Applications use CDAP services to handle application scenarios such as distributed transactions and service discovery to prevent program developers from drowning in the underlying details of Hadoop. CDAP comes with a data intake framework and some preset applications and some general "packages", such as ETL and website analytics, support for testing, debugging and security. Like most original commercial (closed source) projects, CDAP has good documentation, tutorials, and examples.

14. Ranger

Security has always been a sore spot for Hadoop. It does not mean (as is often reported) that Hadoop is "unsafe" or "unsafe". The truth is, Hadoop has a lot of security features, although none of them are very powerful. I mean, each component has its own authentication and authorization implementation, which is not integrated with other platforms.

In May 2015, Hortonworks acquired XA / Security, and then after changing its name, we had Ranger. Ranger keeps many of the key components of Hadoop under an umbrella, allowing you to set a "policy" to bind your Hadoop security to your existing ACL active Directory-based authentication and authorization system. Ranger gives you a place to manage Hadoop access control, through a beautiful page to do management, audit, encryption.

15. Mesos

Mesos provides efficient resource isolation and sharing across distributed applications and frameworks, supporting Hadoop, MPI, Hypertable, Spark, and so on.

Mesos is an open source project in the Apache Incubator that uses ZooKeeper for fault-tolerant replication, Linux Containers to isolate tasks, and supports multiple resource plan allocations (memory and CPU). Provide Java, Python and C++ APIs to develop new parallel applications, and provide a Web-based user interface to view the status of the cluster.

Mesos applications (frameworks) coordinate two-level scheduling mechanisms for cluster resources, so writing a Mesos application doesn't feel like a familiar experience for programmers. Although Mesos is a new project, it is growing rapidly.

16. NiFi

Apache NiFi0.2.0 has released the project, which is still in the incubation stage of the Apache Foundation. Apache NiFi is an easy-to-use, powerful and reliable data processing and distribution system. Apache NiFi is designed for data flow. It supports highly configurable data routing, transformation, and system mediation logic for indicating graphs.

ApacheNiFi is an open source project contributed by the National Security Agency (NSA) to the Apache Foundation. It is designed to automate the data flow between systems. Based on its workflow programming philosophy, NiFi is very easy to use, powerful, reliable and highly configurable. The two most important features are its powerful user interface and good data backtracking tools.

NiFi's user interface allows users to intuitively understand and interact with data streams in the browser, iterating more quickly and securely.

Its data backtracking feature allows users to see how an object flows between systems, plays back, and what happens before and after the key steps of visualization, including a large number of complex schema conversions, fork,join and other operations.

In addition, NiFi uses a component-based extension model to rapidly add functionality to complex data streams. Out-of-the-box components that deal with file systems, including FTP,SFTP and HTTP, also support HDFS.

NiFi has been well received by the industry, including the chief system architect of HortonworksCEO,Leverage CTO and Prescient Edge.

17. Kafka

In the field of big data, Kafka has become the de facto standard for distributed publish and subscribe messages. Its design allows agents to support thousands of customers to maintain durability through distributed commit logs when information throughput is told for processing. Kafka saves a single log file on the HDFS system. Because HDFS is a distributed storage system that makes redundant copies of data, Kafka itself is well protected.

When consumers want to read messages, Kafka looks for their offsets in the central log and sends them. Because the message is not deleted immediately, adding consumers or resending historical information does not incur additional consumption. Kafka has been able to send 2 million messages per second. Although the version number of Kafka is sub-1.0, Kafka is actually a mature and stable product used in some of the world's largest clusters.

18 OpenTSDB

Opentsdb is a HBase database based on time series. It is designed to analyze data collected from applications, mobile devices, network devices, and other hardware devices. It customizes the HBase architecture for storing time series data and is designed to support fast aggregation and minimal storage space requirements.

By using HBase as the underlying storage layer, opentsdb supports the characteristics of distribution and system reliability. Users do not interact directly with HBase, while the data writing system is managed by a time series daemon (TSD), which can be easily extended for application scenarios that require high-speed data processing. There are some prefabricated connectors that publish data to opentsdb and support reading data from clients in Ruby,Python and other languages. Opentsdb is not good at interactive graphics, but it can be integrated with third-party tools. If you are already using HBase and want an easy way to store event data, opentsdb may be just right for you.

19. Jupiter

Everyone's favorite note-taking app is gone. Jupyter is a language-independent part of the "IPython" that is spun out as a separate package. Although jupyter itself is written in Python, the system is modular. Now you can have the same interface as iPython, which makes it easy to share code on laptops and visualize documents and data.

At least 50 language kernels have been supported, including Lisp,R,F #, Perl,Ruby,Scala, etc. In fact, even IPython itself is just a jupyter Python module. Communication through the REPL (read, evaluate, print cycle) language kernel is through protocols, similar to nrepl or Slime. It is nice to see that such a useful software has been significantly funded by non-profit organizations for further development, such as parallel execution and multi-user notebook applications.

20. Zeppelin

Zeppelin is an incubation project for Apache. A web-based notebook that supports interactive data analysis. You can use SQL, Scala, etc., to make data-driven, interactive, collaborative documents. Similar to ipython notebook, you can write code, take notes, and share them directly in the browser.

Some basic diagrams are already included in Zeppelin. Visualization is not limited to SparkSQL queries, the output of any language on the back end can be recognized and visualized. Zeppelin provides a URL to just display the results, and that page does not include menus and buttons for Zeppelin. In this way, you can easily integrate it into your site as an iframe.

Zeppelin is not yet mature. I want to give a demo, but I can't find an easy way to disable "Shell" as an execution option (among other things). However, it already looks better visually than the IPython laptop app, and Apache Zeppelin (in incubation) is Apache2 licensed software. Provide 100% open source.

Scan the QR code to follow the official account of Oriental Cloud Insight.

Learn the results of in-depth public cloud market analysis and insight in real time! Click on the upper right corner and send it to your friends and share it to your moments in the pop-up menu. Please search on the official account and follow: DongCloudInsight or Oriental Cloud Insight. For peer-to-peer communication, please add Wechat: jackyzhang523

Helps you understand the results of deep insights related to the public cloud. Bring the most in-depth and fresh: cloud market analysis, cloud opportunity insight analysis, quick review of cloud major events, cloud gossip, cloud forum information, and face-to-face in-depth discussion of CEO, the highest end of the public cloud domain.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.