Spark and Hadoop Armageddon 04/28 Update SLTechnology News&Howtos

Spark and Hadoop Armageddon

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

As the core application of data processing, Spark has an important role and position, so can spark replace Hadoop?

Spark is only a distributed computing platform, while hadoop is already an ecosystem of distributed computing, storage and management.

Corresponding to Spark is Hadoop MapReduce. Spark can replace MapReduce and become an indispensable part of Hadoop system. But why is MapReduce still being used? Because there are many existing applications still rely on it, it is not an independent existence, has become an irreplaceable part of other ecology, such as pig,hive and so on.

As for the advantages of Spark over Hadoop, there are the following points:

(1) cost of task scheduling

Traditional MR systems, such as Hadoop, are designed to run batch jobs for hours, and in some extreme cases, the delay in submitting a task is very high.

Spark uses the event-driven class library akka to start tasks, which can avoid process or thread startup and switching overhead.

(2) data format and memory layout

Because of the MR Schema On Read processing mode, it will cause a large processing overhead. Spark abstracts the distributed memory storage structure elastic distributed data set RDD for data storage. RDD can support coarse-grained write operations. But for read operations, RDD can be accurate to each probability, which makes RDD can be used as a distributed index. The feature of Spark is that it can control the partitioning of data on different nodes, and users can customize partitioning policies, such as Hash partitions. Spark and SparkSQL implement column storage and column storage compression on the basis of Spark

But distributed computing is only part of Hadoop, so comparing Hadoop with Spark is actually a comparison between Spark and MapReduce:

1. Faster

2. Easier to use

There is no Map+Reduce function when programming, and it is super easy to configure. In addition to supporting JAVA, Scala, Python, R are also supported. Scala, in particular, is very suitable for writing data analysis programs, while Mapreduce is very cumbersome to use JAVA.

3. A very easy to use library

4. Easy to run

Spark can be run separately from hadoop, for example, data can be extracted from a database or local files. But after all, in big data's era, everyone was used to combining Spark and hadoop through Mesos or YARN; they mainly used Hadoop's HDFS, of course, Hbase or Hive, a component on top of HDFS, was also supported by Spark.

Therefore, Spark can not replace Hadoop, we have to distinguish the role and status of the two, in order to better grasp the application. When I usually like to read "big data cn" these Wechat official accounts, some of the introduction is also quite good, you can usually go to see, to improve their knowledge structure plays an important role.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.