Spark: beyond Hadoop MapReduce 07/19 Update SLTechnology News&Howtos

Spark: beyond Hadoop MapReduce

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Introduction: like Hadoop, Spark provides a Map/Reduce API (distributed Computing) and distributed storage. The main difference between the two is that Spark stores data in the memory of the cluster, while Hadoop stores data on the disk of the cluster.

This article is selected from "SparkGraphX actual combat".

Big data is a major challenge for some data science teams because the stand-alone does not have the capacity and capacity to run large-scale data processing in terms of the required scalability. In addition, even systems specially designed for big data, such as Hadoop, are difficult to deal with graph data effectively because of the attribute problems of some data, which we will see in other parts of this chapter.

Apache Spark is similar to Hadoop in that data is distributed and stored on clusters or "nodes" of servers. The difference is that Spark stores data in memory (RAM) and Hadoop stores data in disk (mechanical hard drive or SSD solid state drive).

Definition: in terms of diagrams and cluster computing, the word "node" has two distinct meanings. The graph data consists of vertices and edges, where the meaning of "node" is similar to that of vertices. In terms of cluster computing, the physical machines that make up the cluster are also called "nodes". To avoid confusion, we call the nodes of the graph vertices, which is also a proper noun in Spark. The word "node" in this book is strictly defined as a single physical computing node in the cluster.

Big data cannot be processed on a single computer because of the large amount of data. Both Hadoop and Spark are distributed frameworks that distribute data across cluster nodes. Spark stores distributed datasets in memory, so it is much faster than Hadoop to store data on disk.

Spark's API is easier to use than Hadoop's Map/Reduce API, except where the data to be calculated is stored in different locations (memory and disk). Spark uses the concise and expressive Scala as the native programming language, and the ratio of the number of lines of Java code written for Hadoop Map/Reduce to that of Scala for Spark is generally 10:1.

Although this book mainly uses Scala, you don't need to worry if you are not familiar with Scala. We got a quick start in Chapter 3, including the weird, obscure, and concise Scala syntax. Further familiarity with at least one programming language such as Java, C++, C#, Python and so on is necessary.

The vague definition of big data

The current concept of "big data" has been greatly exaggerated. Big data's concept can be traced back to Google's Google file system paper published in 2003 and Map/Reduce paper published in 2004.

The term big data has many different definitions, and some definitions have lost the meaning that big data should have. But the simple core and vital significance is: big data is because the data itself is too large to be processed by a single machine.

The amount of data has exploded. The data comes from clicks on the website, server logs and hardware with sensors, which are called data sources. Some of the data is graph data (graph data), which means that it is made up of edges and vertices, such as collaborative sites (one of the "Web 2.0" social media). Large graph datasets are actually crowdsourced, such as knowledge interconnected Wikipedia, Facebook friend data, LinkedIn connection data, or Twitter fan data.

Hadoop: the World before Spark

Before discussing Spark, let's summarize how Hadoop solves the big data problem, because Spark is based on the core Hadoop concepts that will be described below.

Hadoop provides a framework for fault-tolerant and parallel processing in cluster machines. Hadoop has two key capabilities:

HDFS- distributed storage

MapReduce- distributed computing

HDFS provides distributed, fault-tolerant storage. NameNode splits a single large file into small chunks, typically 64MB or 128MB. These small pieces of files are scattered on different machines in the cluster. Fault tolerance is to copy small chunks of each file to a certain number of machine nodes (by default to 3 different nodes, which is set to 2 for convenience in the following figure). If one machine node fails, making all file blocks on that machine unavailable, but other machine nodes can provide missing file blocks. This is the key idea of the Hadoop architecture: machine failure is part of normal operation.

Three distributed data blocks maintain two copies through the Hadoop distributed File system (HDFS).

MapReduce is a Hadoop parallel processing framework that provides parallel and distributed computing, as shown in the following figure.

MapReduce is a data processing paradigm used by both Hadoop and Spark. The figure shows the number of times "error" appears in the server log file, which is a MapReduce operation. Usually the Map operation is an one-to-one operation, generating a corresponding data transformation operation for each source data item. Reduce is a many-to-one operation that aggregates the output of the Map phase. Both Hadoop and Spark use the MapReduce paradigm.

Using the MapReduce framework, programmers write a separate code snippet that encapsulates map and reduce functions to process datasets on HDFS. In order to get the data location, the code packaging (jar format) is distributed to the data nodes, and the Map operation is performed on these data nodes, which avoids the consumption of network bandwidth caused by the data transmission of the cluster. For Reduce aggregation operations, the results of Map are transferred to multiple Reduce nodes for reduce operation (called shufing). First, the Map phase operates in parallel, and Hadoop provides a resilient mechanism that restarts computing on other machine nodes when one machine node or a process fails.

The MapReduce programming framework abstracts the dataset into streaming key-value key-value pairs, then processes these key-value pairs and writes them back to HDFS. This is a limited paradigm, but it has been used to solve many data parallelism problems, using linked MapReduce for read-process-write operations. For some simple tasks, the image above shows a more appropriate scenario. But for some iterative algorithms such as machine learning algorithms, it is very painful to use this MapReduce paradigm, which is the reason why we choose to use Spark.

Spark: MapReduce processing in memory

Let's take a look at another alternative distributed processing system, Spark built on top of Hadoop. In this section you will learn that flexible distributed datasets (RDD), which play an important role in Spark's processing of graph data, lead to Hadoop fading in two types of problems:

Interactive query

Iterative algorithm

Hadoop is very suitable for a single query on a large data set, but in many real-world scenarios, once we have a desired answer, we want to ask the data another question, which is interactive query. Using Hadoop means waiting for the data to be reloaded from disk and processed again. It doesn't make sense that we have to perform the same set of calculations as a prerequisite for subsequent analysis.

Iterative algorithms have been widely used in machine learning tasks, such as random gradient descent algorithms, and graph computing algorithms such as PageRank, which will be seen later. The iterative algorithm does a set of calculations over and over a data set until a criterion (the condition for the end of the loop) is met. Implementing this algorithm in Hadoop generally requires a series of MapReduce tasks that load data, and these MapReduce tasks are run repeatedly during each iteration. For very large data sets, each iteration takes 100s or 1000 seconds, and the entire iteration is time-consuming.

Below you will see how Spark solves these problems. Like Hadoop, Spark runs on a cluster of machines with a common hardware configuration. One of the core abstractions in Spark is resilient distributed datasets (RDD). RDD is created by the Spark application (on Spark Driver) and managed by the cluster, as shown in the figure below.

Spark provides a flexible distributed dataset, which can be thought of as a distributed memory-resident array.

The data partitions that make up the RDD distributed dataset are loaded on the machines in the cluster.

Memory-based data processing

Most of the operations performed by Spark are done in random access memory (RAM). Spark is memory-based, while Hadoop Map/Reduce processes data sequentially, so Spark is better suited for dealing with randomly accessed graph data than Hadoop.

The key benefit of Spark is that RDD is cached in memory during interactive queries and iterative processing. The cached RDD avoids reprocessing the parent RDD chain each time, and only needs to return the cached result after the parent RDD calculation.

Naturally, this means using the memory-based computing processing features of Spark, which requires that the memory of the machines in the cluster be large enough. If there is not enough memory available, Spark will gracefully overflow data to disk to ensure that Spark continues to run.

Of course, the Spark cluster also needs a place to persist data, and it has to be a distributed storage system, with options such as HDFS, Cassandra and Amazon S3.

This article is selected from "Spark GraphX practice". Click this link to view the book on the official website of the blog.

For more wonderful articles in time, search for "blog viewpoints" on Wechat or scan the QR code below and follow.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.