What's the difference between Spark,HPCC and Hadoop? 04/27 Update SLTechnology News&Howtos

What's the difference between Spark,HPCC and Hadoop?

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what is the difference between Spark,HPCC and Hadoop". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

What's the difference between Spark,HPCC and Hadoop?

[the intermediate data of 1.Spark is put in memory, which is more efficient for iterative operation]

A major difference between MapReduce and Sparkis is that MapReduce is aperiodic. That is, the data stream flows from a stable source, processing, to a stable file system. "Spark allows the same data, which will form a cycle if the work is visual iterative computing.)

Spark is more suitable for ML and DM operations with more iterative operations. Because in Spark, there is the concept of RDD. Resilient distributed datasets (RDD) serve as abstractions of raw data, and some data is cached in memory for later use. This last point is important; spark allows RAM to work on data on disk based on accelerated MapReduce for approximate 20X. RDDs is immutable and created through parallel transformations such as maps, filters, GroupBy and reduction.

RDD can be cache to memory, so the result of each operation on RDD dataset can be stored in memory, and the next operation can be entered directly from memory, saving a lot of disk IO operations of MapReduce. For the machine learning algorithms which are more common in iterative operation, the efficiency is greatly improved. However, since Spark is currently only a research project in UC Berkeley, the largest scale seen so far is only 200machines, and it does not have the same deployment scale as Hadoop, so it should be carefully considered when it is used on a large scale.

[2.Spark is more general than Hadoop]

Spark provides many types of dataset operations, unlike Hadoop, which only provides Map and Reduce operations. For example, map, filter, flatMap,sample, groupByKey, reduceByKey, union, join, cogroup, mapValues, sort,partionBy and other types of operations, they call these operations Transformations. At the same time, it also provides Count, collect, reduce, lookup, save and other actions.

These various types of dataset operations provide convenience for upper-level users. The communication model between processing nodes is no longer the only Data Shuffle mode like Hadoop. Users can name, materialize, control the partition of intermediate results, and so on. It can be said that the programming model is more flexible than Hadoop.

However, it is also mentioned in the paper that Spark is not suitable for applications with asynchronous fine-grained status updates, such as web service storage or incremental web crawlers and indexes. It is certainly not suitable for the application model of incremental modification to get a large amount of data into memory. When the incremental changes are over, there is no need for iteration.

[3. Fault tolerance]

From Spark's paper "Resilient Distributed Datasets: AFault-Tolerant Abstraction for In-Memory Cluster Computing", I don't see how well fault tolerance is done. Instead, I mentioned distributed dataset computing, and there are two ways to do checkpoint, one is checkpoint data, the other is logging the updates. It seems that Spark adopted the latter. But it was later mentioned in the article that the latter seemed to save storage space. However, because the data processing model is an operation process similar to DAG, due to the error of one of the nodes in the diagram, and the dependence complexity of lineage chains, it may cause the recalculation of all computing nodes, so the cost is not low. They later said that it is up to the user to decide whether to save the data or save the update log, and to do checkpoint. It is equivalent to saying nothing and kicking the ball to the user. So I think it is up to the user to choose a strategy with less cost according to the type of business, the cost of storing data IO and disk space and the cost of recalculation.

[4. About the integration of Spark and Hadoop]

I don't know what people at the Apache Foundation think, but I think Spark should still be integrated into the Hadoop ecosystem. From the fact that Hadoop 0.23 makes MapReduce into a library, we can see that the goal of Hadoop is to support more parallel computing models, including MapReduce, such as MPI,Spark. After all, the single-node CPU utilization of Hadoop is not high, so if this iterative-intensive operation is complementary to the existing platform. At the same time, it puts forward higher requirements for the resource scheduling system. In terms of resource scheduling, UC Berkeley also seems to be doing a Mesos thing, using Linux container, unified scheduling Hadoop and other application models.

This is the end of the content of "what's the difference between Spark,HPCC and Hadoop". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.