IMF Prophase knowledge Reserve-What is Spark explains in detail (four major features) 07/13 Update SLTechnology News&Howtos

IMF Prophase knowledge Reserve-What is Spark explains in detail (four major features)

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The official website of Spark describes Spark in the following concise language

We can extract the following information:

Spark is an engine

fast

Universal

Spark can be used to process data

The data is large-scale.

Spark itself does not provide data storage capacity, it is just a computing framework

Where is its rapidity reflected?

If the data is in memory, running MapReduce is more than 100 times faster than hadoop, and if the data is on disk, it is also 10 times faster than Hadoop.

Why is it fast? Spark uses an advanced execution engine: DAG-directed acyclic graph when processing data. And memory computing.

Easy to use:

You can use scala, java, Python and other languages to develop applications quickly. Spark provides more than 80 operations to easily build parallel applications. It only takes a few lines of code to complete the calculation of wordcount.

Versatility:

Spark provides big data one-stack solution. It includes flow computing, graph computing, machine learning, SQL and so on.

For development, maintenance, learning costs are greatly reduced.

Run anywhere:

Spark can run on Hadoop's YARN, Mesos, standalone, or on the cloud.

The data processed by Spark can be stored in HDFS, Cassandra, HBase, S3 and so on.

The development of Spark is very fast, the TimeLine is as follows

After Spark entered the Apache, it developed very rapidly. Versions are released frequently.

Spark ecosystem (BDAS, Chinese: Berkeley Analytical Stack)

MapReduce is one of the Hadoop ecosystems, while Spark is one of the BDAS ecosystems.

Hadoop includes MapReduce, HDFS, HBase, Hive, Zookeeper, Pig, Sqoop, etc.

BDAS includes Spark, Shark (equivalent to Hive), BlinkDB, Spark Streaming (message real-time processing framework, similar to Storm), and so on.

BDAS ecosystem map:

Comparison between MapReduce and Spark

Similarities and differences:

In basic principle

MapReduce is disk-based batch processing of big data.

Spark is based on RDD (resilient distributed dataset) data processing, and RDD can be stored in memory or on disk.

two。 On the model

MapReduce is suitable for processing very large datasets for batch processing. Suitable for long tasks with fewer iterations.

Spark is suitable for data mining, with a large number of iterations, such as machine learning and other iterative tasks.

3. Fault tolerance

At each iteration of MapReduce, the result needs to be written to the hard disk, and then the data calculation is read from the hard disk. As long as one step fails, the whole task will fail.

Spark uses DAG to split the task into many steps, and during each iteration, the data is written to memory. And Spark also provides fault tolerance.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.