Depth: Hadoop's positive competition report on the five dimensions of Spark! 07/01 Update SLTechnology News&Howtos

Depth: Hadoop's positive competition report on the five dimensions of Spark!

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Every year, there are a variety of distributed systems with different data management scale, type and speed performance in the market. Of these systems, Spark and hadoop are the two that get the most attention. But how to tell which one is right for you?

Is it reasonable if you want to batch traffic data and import it into HDFS or use Spark Streaming? If you want to do machine learning and predictive modeling, will Mahout or MLLib better meet your needs?

Depth: Hadoop's positive competition report on the five dimensions of Spark!

To increase confusion, Spark and Hadoop often work with Spark, which is located in the HDFS,Hadoop file system, to process data. However, they are all independent individuals, each with its own advantages and disadvantages as well as specific business cases.

This article compares Spark and Hadoop from the following perspectives: architecture, performance, cost, security, and machine learning.

What is Hadoop?

Hadoop became a Yahoo project in 2006 and has since become a top Apache open source project. It is a general form of distributed processing with multiple components:

HDFS (distributed file system), which stores files in Hadoop native format and parallelizes them in the cluster

YARN, the scheduler that coordinates the application runtime

MapReduce, an algorithm that actually processes data in parallel.

Hadoop is built using Java and can be accessed in a variety of programming languages for writing MapReduce code (including Python) through the Thrift client.

In addition to these basic components, Hadoop includes:

Sqoop, which moves relational data into HDFS

Hive, a SQL-like interface that allows users to run queries on HDFS

Mahout, machine learning.

In addition to using HDFS for file storage, Hadoop can now be configured to use S3 buckets or Azure blob as input.

It can be open source through Apache distributions, or it can be provided by vendors such as Cloudera (the largest and largest Hadoop vendor), MapR or HortonWorks.

What is Spark?

Spark is a relatively new project that was originally developed at AMPLab at the University of California, Berkeley in 2012. It is also a top-level Apache project that focuses on parallel processing of data in a cluster, but the biggest difference is that it runs in memory.

Given that Hadoop reads and writes files to HDFS,Spark using a concept called RDD, the concept of resilient distributed datasets handles data in RAM. Spark can be run in stand-alone mode, and Hadoop clusters can be used as data sources or run with Mesos. In the latter case, the Mesos master will replace the Spark master or YARN for scheduling.

Spark is built around Spark Core, which is an engine that drives scheduling, optimization, and RDD abstraction, and connects Spark to the correct file system (HDFS,S3,RDBM or Elasticsearch). There are several libraries running on Spark Core, including Spark SQL, which allows you to run SQL-like commands on distributed datasets, MLLib for machine learning, GraphX for graphics problems, and streaming recorded data that allows continuous streaming.

Spark has several API. The original interface was written in Scala, based on the extensive use of data scientists, and added Python and R endpoints. Java is another option for writing Spark jobs.

Databricks, founded by Matei Zaharia, founder of Spark, is dedicated to providing Spark-based cloud services that can be used for data integration, data plumbing and other tasks

1. Architecture

Hadoop

First, all files passed into HDFS are split into chunks. Each block is replicated a specified number of times in the entire cluster, depending on the configured block size and replication factor. This information is passed to NameNode, which tracks everything in the entire cluster. NameNode assigns these files to some data nodes and then writes them to them. High availability was implemented in 2012, allowing NameNode to fail over to the backup node to track all files in the cluster.

The MapReduce algorithm is on top of HDFS and consists of JobTracker. Once the application is written in one of these languages, Hadoop accepts the JobTracker and then assigns work (which can include counting words and cleaning up anything in the log file) to run HiveQL queries on data stored in the Hive repository to listen on the TaskTracker of other nodes.

YARN allocates JobTracker acceleration and monitors their resources to improve efficiency. All the results from the MapReduce phase are then summarized and written to disk in HDFS.

Spark

Except that calculations are performed in memory and stored there until the user actively saves them, Spark processing works similar to Hadoop. Initially, Spark reads from files in HDFS,S3 or other file stores to an established mechanism called SparkContext. In addition, Spark creates a structure called RDD or resilient distributed datasets, which represents an immutable set of elements that can be manipulated in parallel.

With the creation of RDD and related operations, Spark also creates a DAG or directed acyclic graph to visualize the order of operations and the relationship between operations in DAG. Each DAG has phases and steps; in this way, it is similar to the interpretation plan in SQL.

You can perform transformations, intermediate steps, operations, or final steps on RDD. The result of a given transformation goes into DAG but is not retained to disk, but the result of the operation retains all data in memory to disk.

A new abstraction in Spark is DataFrames, which was developed as a companion interface to RDD in Spark 2.0. The two are very similar, but DataFrames organizes the data into named columns, similar to Python's panda or R packets. This makes them more user-friendly than RDD, which does not have a similar series of column-level header references. SparkSQL also allows users to query DataFrame like SQL tables in relational data stores.

two。 Performance

It is found that Spark runs 100 times faster in memory and 10 times faster on disk. On 1/10 machines, it is also used to sort 100 TB data, three times faster than Hadoop MapReduce. In particular, it is found that Spark is faster in machine learning applications, such as naive Bayes and k-means.

Spark performance, measured by processing speed, has been found to be better than Hadoop for the following reasons:

Every time you run a selected part of a MapReduce task, Spark is not bound by input and output problems. As it turns out, applications are much faster.

Spark's DAG can be optimized between steps. Hadoop does not have any periodic connections between MapReduce steps, which means that no performance tuning occurs at this level.

However, if Spark runs on YARN with other shared services, performance may degrade and result in RAM overhead memory leaks. For this reason, Hadoop is considered a more efficient system if the user has a batch use case.

3. Cost

Both Spark and Hadoop are freely available as open source Apache projects, which means you can run it at zero installation cost. However, it is important to consider the total cost of ownership, including maintenance, hardware and software purchases, and hiring teams that understand cluster management. The general rule of thumb for internal installation is that Hadoop requires more disk memory, while Spark requires more memory, which means that setting up a Spark cluster can be more expensive. In addition, because Spark is a relatively new system, it has fewer experts and is more expensive. Another option is to use a vendor for installation, such as Cloudera for Hadoop or Spark for DataBricks, or to use AWS to run the EMR / Mapreduce process in the cloud.

Because Hadoop and Spark run in series, extract pricing comparisons can be separated even on EMR instances that are configured to run with Spark installed. For very high-level comparisons, assuming that you choose a computationally optimized EMR cluster for Hadoop, the minimum instance c4.large costs $0.026 per hour. Spark's smallest memory-optimized cluster costs $0.067 per hour. As a result, Spark is more expensive per hour, but to optimize computing time, similar tasks should spend less time on the Spark cluster.

4. Security.

Hadoop is highly fault tolerant because it is designed to replicate data across multiple nodes. Each file is divided into blocks and copied countless times on many machines to ensure that if a single machine goes down, the file can be rebuilt from other blocks elsewhere.

The fault tolerance of Spark is mainly realized by RDD operation. Initially, static data is stored in HDFS and fault-tolerant through Hadoop's architecture. With the establishment of RDD, so does the pedigree, which remembers how the dataset is built, and because it is immutable, it can be rebuilt from scratch if necessary. Data across Spark partitions can also be rebuilt across data nodes based on DAG. Data is copied between actuator nodes, and data may usually be corrupted if the node or communication between the actuator and the driver fails.

Both Spark and Hadoop can support Kerberos authentication, but Hadoop has more detailed security controls over HDFS. Apache Sentry is a system for performing fine-grained metadata access and another project dedicated to HDFS-level security.

Spark's security model is currently rare, but it allows authentication through a shared key.

5. Machine learning

Hadoop uses Mahout to process data. Mahout includes clustering, classification, and batch-based collaborative filtering, all of which run on top of MapReduce. Samsara, a DSL language supported by Scala, is gradually being rolled out, which allows users to perform memory and algebraic operations, and allows users to write their own algorithms.

Spark has a machine learning library, MLLib, for memory iterative machine learning applications. It can be used for Java,Scala,Python or R, including classification and regression, as well as the ability to build machine learning channels through hyperparametric tuning.

Summary

So is it Hadoop or Spark? These systems are the two most important distributed systems in the market at present. Hadoop is mainly used for large-scale disk operations using the MapReduce paradigm, while Spark is a more flexible but more expensive memory processing architecture. Both are top-level Apache projects that are often used together and have similarities, but it is important to understand the characteristics of each project when deciding to use them.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.