What's the difference between Spark and MR? 04/25 Update SLTechnology News&Howtos

What's the difference between Spark and MR?

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly talks about "what's the difference between Spark and MR". Interested friends might as well take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "what's the difference between Spark and MR?"

Let's start with the conclusion: Hadoop MapReduce uses the multi-process model, while Spark uses the multi-thread model.

Next, let's analyze the differences between the two models as well as their advantages and disadvantages:

The high performance of Apache Spark depends in part on the asynchronous concurrency model it uses (in this case, the model adopted on the server/driver side), which is consistent with Hadoop 2.x (including YARN and MapReduce).

Hadoop 2.x implements an asynchronous concurrency model similar to Actor, which is implemented by epoll+ state machine, while Apache Spark directly uses the open source software Akka, which implements the Actor model with very high performance.

Although the two adopt the same concurrency model on the server side, they use different parallel mechanisms at the task level (especially Spark tasks and MapReduce tasks): Hadoop MapReduce uses the multi-process model, while Spark uses the multi-thread model.

Note that multi-process and multi-threading in this article refer to the running mode of multiple tasks on the same node. Both MapReduce and Spark are multiprocess as a whole: MapReduce applications are made up of multiple independent Task processes; Spark applications run in an environment that consists of temporary resource pools built by multiple independent Executor processes.

The multi-process model is convenient for fine-grained control of the resources occupied by each task, but it consumes more startup time, so it is not suitable to run jobs with low latency, which is one of the reasons widely criticized by MapReduce. The multithreaded model, by contrast, makes Spark suitable for running low-latency types of jobs. In summary, tasks on the same node of Spark run in a JVM process in a multithreaded manner, providing the following benefits:

1) the task starts fast, on the contrary, the slow start speed of the MapReduce Task process usually takes about 1 second.

2) all tasks on the same node run in one process, which helps to share memory. This is ideal for memory-intensive tasks, especially for applications that need to load a large number of dictionaries.

3) all tasks on the same node can run in one JVM process (Executor), and the resources occupied by Executor can be continuously used by multiple batches of tasks and will not be released after running some tasks, which avoids the time overhead caused by repeatedly applying for resources for each task. For applications with a very large number of tasks, the running time can be greatly reduced. In contrast, Task in MapReduce: each Task applies for resources separately, releases them immediately after use, and cannot be reused by other tasks, but you can turn on JVM reuse by setting a value of mapred.job.reuse.jvm.num.tasks = greater than 0. Turn on JVM reuse: the disadvantage of this feature is that turning on JVM reuse will occupy the task slot used for reuse until the task is completed. If some reduce task in an "unbalanced" job takes much more time to execute than other Reduce task, then the reserved slot will remain idle and cannot be used by other job until all the task is finished)

Although Spark's over-threading model brings many benefits, it also has some shortcomings, including:

1) because all tasks on the same node are running in the same process, serious resource contention will occur, so it is difficult to finely control the resource consumption of each task. On the contrary, MapReduce allows users to set different resources for Map Task and Reduce Task separately, and then fine-grained control the amount of resources occupied by tasks, which is conducive to the normal and smooth operation of large jobs.

At this point, I believe you have a deeper understanding of "what's the difference between Spark and MR". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.