Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Let's walk into the big data open source project-Section 1

2025-04-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Recently, the hottest news in big data's field is that Pivotal fulfilled its promise to open source its big data core product GemFire,HAWQ,Greemplum DB at the beginning of the year. This news also made Pivotal very popular in the domestic technology community, and program apes can see how the real enterprise data warehouse is designed and implemented.

At the same time, there are many similar excellent big data-related projects in the open source community, covering data-related aspects such as distributed data storage and computing, data processing, data warehouse, machine learning and so on. Let's take a look at these typical representatives of big data projects in the open source community.

First of all, when it comes to big data's open source project, the first thing that must be mentioned is, of course, the three sub-projects under Apache Hadoop, Apache HDFS,Apache MapReduce,Apache YARN, which can basically be regarded as the international standard dealt with by big data and the cornerstone of big data's entire ecosystem.

Distributed storage

In the field of distributed storage, it can be divided into file system, KV storage, Columnar storage, Document storage and Graph storage according to the storage model.

Distributed file system is the lowest level of the whole distributed storage, and its ancestor is Google's famous GFS. Apache HDFS is an open source version of GFS and should not be introduced any more. RedHat GlusterFS, as the product of the leader of the Linux community, is also worth seeing.

KV storage is the simplest storage model. Typical systems include Amazon DynamoDB, Memcached,Redis,BerkeleyDB, Google LevelDB.

Columnar storage is a direct extension of KV storage, and Value corresponds to Column family or Column Map. The most basic of such systems is the open source version of BigTable, one of the early three carriages of Apache HBase,Google, as well as ApacheCassandra,Hypertable and Facebook HydraBase.

Document storage mainly includes MongoDB, Facebook Apollo and so on. Most of the Graph storage systems are based on Google's Pregel, and the main open source implementations are: Apache Giraph,Apache Spark Bagel,Phoebus. In addition, Google has also opened up its own Graph database Cayley.

Distributed computing

In the aspect of distributed computing, it is mainly reflected in a variety of computing frameworks and data processing models, such as Apache MapReduce and the most classic big data processing engine. Apache Spark, currently the most popular big data processing engine, has an order of magnitude improvement in speed compared to MapReduce. It has also built a whole ecosystem based on Spark, SQL,Streaming,Machine Learning,Graph. Other projects include Apache Storm,Apache Pig,Apache Tez,Apache S4, OpenMPI and so on.

Distributed task scheduling

Distributed task scheduling and cluster management, this kind of system mainly implements distributed task management, resource scheduling, cluster management and other basic tasks, including Apache YARN,Apache Aurora,Apache Falcon,Apache Oozie,Linkedin Azkaban,Apache Ambari,Apache Bigtop, Apache Mesos and so on.

SQL and SQL-like processing, this kind of system is the main product form of Pivotal open source, which basically builds the SQL query engine on the distributed system, including traditional MPP SQL database, SQL-on-Hadoop, and big data query system like SQL-like. Including Greenplum DB,Apache Hive, Apache HAWQ,Cloudera Impala,SparkSQL,Apache Phoenix,Apache Drill, SharkSQL,Facebook PrestoDB,CockroachDB and so on. Nowadays, more and more of these systems are developing to the cloud, including Amazon Redshift,Google BigQuery,Snowflake, etc. Unfortunately, most of these cloud products do not choose open source because of security problems.

Distributed services and data processing (including various log processing)

The field of distributed service and data processing mainly includes the necessary components of distributed programming, such as data acquisition, log processing, message service and so on. There are mainly Apache Zookeeper, Apache Flume, Apache Kafka, Apache Sqoop, Cloudera Morphlines, Facebook Scribe, Logstash,Linkedin Gobblin and so on.

* * Services on top of distributed services

Based on distributed storage, computing, data processing and all kinds of basic components, all kinds of distributed applications emerge in endlessly, such as Apache Mahout, Cloudera Oryx, Spark MLlib, MLbase related to machine learning applications, Apache Solr,ElasticSearch,HBase Coprocessor and Facebook Unicorn related to search applications. It should be said that with the support of these distributed basic components, it becomes much more convenient to build new distributed applications.

That's all for this section. If you are interested, you can read my next article.

Many people know that I have big data training materials, and they naively think that I have a full set of big data developers, hadoop, spark and so on.

Frequent learning materials. I would like to say that you are right. I do have a full set of video materials developed by big data, hadoop and spark.

If you are interested in big data development, you can add a group to get free learning materials: 763835121

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report