How to compare Spark and MapReduce 04/04 Update SLTechnology News&Howtos

How to compare Spark and MapReduce

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shows you how to compare Spark and MapReduce. The content is concise and easy to understand. It will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Let's introduce Spark and MapReduce, and be able to encounter limitations such as "what are the limitations of MapReduce versus Spark?"

First of all, correct a misunderstanding: when browsing the official website of Spark, you can often see the following picture:

From the figure above, you can see that Spark runs hundreds of times faster than Hadoop (actually compared to the MapReduce computing engine)! I believe that when many people learn Spark, they think that the first intuitive concept that Spark is faster than MapReduce comes from this, and even the author finds that some materials on the Internet directly copy this comparison, causing a very serious misunderstanding for beginners. This diagram is a comparison of the running time of machine learning algorithms using Spark and Hadoop running logic respectively, so can you get this comparison on behalf of Spark running any type of task under the same conditions? It's obviously wrong, and we need to know more about this comparison, but we need to know why.

First of all, what is the core of most machine learning algorithms? It is to continuously iterate and adjust the parameters for the same data when training the model, and then form a relatively optimal model. And Spark as a memory-based iterative big data computing engine is very suitable for such a scenario, the previous article "Spark RDD detailed explanation" also introduced, for the same data set, we can access it for the first time, the data set can be loaded into memory, subsequent access can be directly from memory. However, MapReduce is not suitable for iterative scenarios such as machine learning due to factors such as running intermediate results such as disk brushing, and HDFS itself also has a caching feature. It is very likely that the caching feature is not well configured when running logic regression, otherwise the performance gap would not be so great.

Compared with MapReduce, why we choose Spark, the author makes the following summary: Spark

1. Stream batch processing, interactive query, machine learning and graph computing are integrated.

two。 Based on memory iterative calculation, it is suitable for low latency and iterative operation.

3. Rdd and DataFrame can be shared by caching to improve efficiency (especially when SparkSQL can store data in memory in column form)

4. The intermediate result supports checkpoint, which can be quickly recovered if an error occurs.

5. Support to run in pipeline mode between DAG and map without brushing disk

6. Multithreaded model, each worker node runs one or more executor services, each task runs as a thread in executor, and resources can be shared among task.

The 7.Spark programming model is more flexible, supports multiple languages such as java, scala, python, R, and supports rich transformation and action operator MapReduce

1. Suitable for off-line data processing, not suitable for iterative computing, interactive processing, streaming processing

two。 The intermediate results need to be landed, and a large number of disk IO and network IO are required to affect performance.

3. Although the intermediate results of MapReduce can be stored in HDFS and make use of HDFS caching function, it is less efficient than Spark caching function.

4. Multi-process model, task scheduling (frequent applications, release of resources) and high startup overhead, not suitable for low-latency jobs

5.MR programming is not flexible enough to support only map and reduce operations. When a computing logic is complex, multiple MR tasks need to be written to run [and the results generated by these MR tasks need to be persisted to disk when the next MR task is used, so it is inevitable to encounter a large number of disk IO to affect efficiency]

Although Spark has many advantages over MapReduce, it does not mean that Spark can completely replace MapReduce at present. The above is how to compare Spark and MapReduce. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.