What are the three common misunderstandings of ApacheSpark? 07/02 Update SLTechnology News&Howtos

What are the three common misunderstandings of ApacheSpark?

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What are the three common misunderstandings of ApacheSpark? aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible way.

Three common misunderstandings of ApacheSpark

Misunderstanding 1: Spark is a memory technology

The biggest misunderstanding about Spark is that it is a memory technology (in-memorytechnology). That's not true! No Spark developer has officially stated this, which is a misunderstanding of the Spark computing process.

Let's start from the beginning. What kind of technology can be called memory technology? In my opinion, it is the technology that allows you to persist data in RAM and process it effectively. However, Spark does not have the option to store data in RAM, although we all know that data can be stored in systems such as HDFS,Tachyon,HBase,Cassandra, but there is no built-in persistence code (nativepersistencecode) whether it is stored on disk or in memory. All it can do is cache the data, which is not persist. Data that has been cached can be easily deleted and recalculated when needed later.

But even with this information, some people still think that Spark is a memory-based technology, because Spark processes data in memory. This is true, of course, because we can't process the data in any other way. API in the operating system only allows you to load data from the block device into memory, and then store the calculated results in the block device. We cannot calculate directly on HDD devices; so basically all processing in modern systems is done in memory.

Although Spark allows us to use memory caching and LRU replacement rules, how do you think current RDBMS systems, such as Oracle and PostgreSQL, handle data? They use the shared memory segment (sharedmemorysegment) as the storage pool for tablepages, through which all data reads and writes are made, and this storage pool also supports LRU replacement rules; all modern databases can also meet most needs through LRU policies. But why don't we call Oracle and PostgreSQL memory-based solutions? Think about LinuxIO again, you know? All IO operations also use LRU caching technology.

Do you still think that Spark handles all operations in memory? You might be disappointed. For example, the core of Spark: shuffle, which writes data to disk. If you use the groupby statement in SparkSQL, or if you convert RDD to PairRDD and do some aggregation on it, you force Spark to distribute data to all partitions based on the hash value of key. The processing of shuffle includes two stages: map and reduce. The Map operation only calculates its hash value based on the key and stores the data in different files on the local file system, the number of files is usually the number of partitions on the reduce side; the Reduce side will pull the data from the Map side and merge the data into the new partition. So if your RDD has M partitions, and then you convert it to N partitions of PairRDD, then the shuffle phase will create N files! Although there are some optimization strategies that can reduce the number of files created, this still doesn't change the fact that you need to write data to disk every time you perform a shuffle operation!

So the conclusion is: Spark is not a memory-based technology! It is actually a technology that can effectively use in-memory LRU policies.

Misunderstanding 2: Spark is faster than Hadoop 10x-100x

This picture is a comparison of the running time of the logical regression (LogisticRegression) machine learning algorithm using Spark and Hadoop respectively. From the above picture, we can see that Spark runs hundreds of times faster than Hadoop! But is that really the case? What is the core of most machine learning algorithms? It's actually doing the same iteration on the same data set, and this is what Spark's LRU algorithm is proud of. When you scan the same dataset multiple times, you only need to load it into memory on the first access, and the subsequent access can be obtained directly from memory. This function is very great! Unfortunately, it is very likely that officials do not use HDFS's caching capabilities when using Hadoop to run logical regression, but in extreme cases. If you use HDFS caching when running logical regression in Hadoop, it is likely that it will only perform worse than Spark by 3x-4x, not as shown in the figure above.

According to experience, benchmark reports made by enterprises are generally unreliable! Generally independent third-party benchmark reports are more reliable, such as TPC-H. Their benchmark reports generally cover most of the scenarios in order to truly show the results.

Generally speaking, the main reasons why Spark runs faster than MapReduce are as follows:

Task starts faster, Spark is fork outgoing thread, and MR is to start a new process

Faster shuffles,Spark will only put data on disk when shuffle, while MR is not.

Faster workflow: a typical MR workflow is made up of many MR jobs, and the data interaction between them needs to be persisted to disk, while Spark supports DAG and pipelining, so you don't have to cache data to disk without shuffle.

Caching: although HDFS currently supports caching, in general, Spark caching is more efficient, especially in SparkSQL, where we can store data in memory as columns.

All these reasons make Spark have better performance than Hadoop; it is true that it is 100 times faster in shorter jobs, but in real production environments, it is generally only faster than 2.5x~3x!

Misunderstanding 3: Spark introduces a whole new technology in data processing

In fact, Spark has not introduced any revolutionary new technologies! The LRU caching strategy and data pipelining processing that it is good at actually exist in the MPP database for a long time! An important step for Spark is to implement it in an open source way! And enterprises can use it for free. Most enterprises are bound to choose open source Spark technology rather than paid MPP technology

The answers to the three common misunderstandings about ApacheSpark are shared here. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.