A case study of Spark principle 07/09 Update SLTechnology News&Howtos

A case study of Spark principle

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What this article shares with you is an example analysis of the principle of Spark. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

There are defects in Hadoop:

Based on disk, whether MapReduce or YARN, data is loaded from disk, passed through DAG, and then rewritten back to disk.

The intermediate data of the calculation process needs to be written to the temporary file of HDFS.

All these make Hadoop "slow" in big data's operation, and Spark arises at the historic moment.

The architectural design of Spark:

ClusterManager is responsible for allocating resources, which is a bit like the role of ResourceManager in YARN. The butler holds all the working resources and belongs to Party B.

WorkerNode is the node that can work, dispatched by the butler ClusterManager, is the Lord who really has the resources to work.

Executor is a process that starts on WorkerNode, which is the equivalent of a contractor, responsible for preparing the Task environment and executing Task, and responsible for memory and disk usage.

Task is every specific task in the construction project.

Driver is in charge of the generation and transmission of Task to Executor, and is the commander of Party A.

SparkContext deals with ClusterManager, is responsible for applying for resources for money, and is the interface of Party A.

The whole interaction process goes like this:

Party A came to a project, created a SparkContext,SparkContext to find ClusterManager application resources and give a quotation, how much CPU and memory resources are needed. ClusterManager goes to WorkerNode and starts Excutor, and introduces Excutor to Driver.

2 Driver splits batches of Task according to construction drawings and sends Task to Executor for execution.

3 after receiving the Task, Executor prepares the Task runtime to rely on and execute, and returns the execution result to Driver

4 Driver will continue to direct the next step based on the returned Task status until all Task execution is finished.

Let's take a look at the following picture to deepen our understanding:

The core part is related to RDD, which is the task scheduling architecture we introduced earlier, which will be explained in more detail later.

SparkStreaming:

Real-time data stream processing with scalability, high throughput and high reliability based on SparkCore. Support from Kafka, Flume and other data sources to be processed and stored in HDFS, DataBase, Dashboard.

MLlib:

With regard to the implementation library of machine learning, and about machine learning, I still hope to spend some time systematically learning various algorithms. Here is a set of Python-based ML-related blog materials http://blog.csdn.net/yejingtao703/article/category/7365067.

SparkSQL:

Spark provides API in the form of sql to connect various data channels such as Hive, JDBC, HBase and so on. From the point of view of Java developers, it is interface-oriented and decoupling. ORMapping, Spring Cloud Stream and so on are similar ideas.

GraphX:

I haven't used API for graph and graph parallel computing.

RDD (Resilient Distributed Datasets) flexible distributed dataset

RDD supports two operations: conversion (transiformation) and action (action)

Transformation is to create a new dataset from an existing dataset, such as Map;, which calculates the dataset and returns the results to Driver, such as Reduce.

The conversion in RDD is lazy, and only when the action occurs will it actually run. This design allows Spark to run more efficiently because we only need to send the desired result of the action to Driver instead of the entire huge intermediate dataset.

Caching technology (not only limited to memory, but also disks, distributed components, etc.) is the key for Spark to build iterative algorithms and fast interactive queries. When a RDD is persisted, each node will save the slicing results in the cache and reuse in other actions (action) of this data set, which will make the subsequent actions (action) faster (10 times the experience value). For example, after the execution, the results of RDD1 and RDD2 are already in memory, so if you come to RDD0 à RDD1 à RDD3, you can only calculate the last step.

Wide and narrow dependencies between RDD:

Narrow dependency: each Partition of the parent RDD is only used by one Partition of the child RDD.

Wide dependency: each Partition of the parent RDD is used by multiple Partition of the child RDD.

Wide and narrow can be understood as the waistband, so there is only one son because of the tight lower body tube; the looser lower body tube with the waistband can not help but produce a bunch of illegitimate children.

For narrow dependencies of RDD, one cell can be used to deal with parent-child partition, and these Partition can be executed in parallel independently of each other; for wide dependencies, the opposite is true.

Narrow dependence on performance is more efficient in fault recovery. If a son is broken, he can get a son by recalculating his father. Anyway, the recovery efficiency of being the father of only one son is 100%. But the efficiency is very low for wide dependencies, as shown in the following figure:

The user-choreographed code consists of RDD Objects, DAGScheduler is responsible for splitting DAG into Stage according to the wide dependency of RDD, and buying a Stage contains a set of Task that can be executed concurrently with exactly the same logic. TaskScheduler is responsible for pushing the Task to the Worker-launched Executor obtained from ClusterManager.

DAGScheduler (unified, Spark is in charge):

For a detailed case study of how to divide the Stage, please see the following figure

The Worker part adopts Master/Slaver mode, and Master is the core component of the whole system, so ZooKeeper is used for high availability reinforcement. Slaver really creates Executor, executes Task and reports its physical computing resources to Master,Master. It is responsible for allocating slavers resources to Framework according to policy.

Mesos resource scheduling is divided into coarse-grained and fine-grained methods:

The coarse-grained method applies to Master to execute all Task resources directly at startup, and releases resources only after all computing tasks are completed; fine-grained approach is to constantly apply for and return resources according to the resources needed by Task. The two methods have their own advantages and disadvantages, the advantage of coarse-grained is that the scheduling cost is small, but the resources will be occupied for a long time because of the bucket effect; fine-grained has no bucket effect, but the management cost of scheduling is high.

YARN mode:

In the process of shuffling, each current Task of StageA will Hash its own Partition according to the requirements of Partition in stageB to generate the Partition of the number of task in stageB (especially the task of each stageA here), so there will be as many small file as len (stageA.task) * len (stageB.task) generated in the intermediate process. If RDD results need to be cached in memory, the next stageB needs merge, which also involves network overhead and reading of discrete files. So it is very hard to use Hash Base mode for tasks that exceed a certain scale.

Although the Spark version later introduced Consolidate to optimize the Hash-based model, it can only reduce the number of block file to a certain extent, without fundamentally solving the above defects.

Sort Base Shuffle (default from spark1.2):

In Sort mode, StageA produces two files per Task: content files and index files. The content file is based on the requirements of the Partition in StageB to first sort and generate a large file; the index file is an auxiliary description of the content file, which maintains the boundaries between different sub-partition, together with the Task of StageB to extract information. In this way, the number of files generated by the intermediate process is reduced from len (stageA.task) * len (stageB.task) to 2 * len (stageA.task), and StageB reads the content files sequentially. Another benefit of Sort is that a large file is more convenient to compress and decompress than scattered small files, and the consumption of network IO can be reduced by compression. (PS: but the compression and decompression process eats CPU, so it should be evaluated reasonably.)

Sort and Hash modes are configured through spark.shuffle.manager.

Storage module:

Storage media: memory, disk, Tachyon (this is a distributed memory file, unlike Redis, Redis is a distributed memory database), storage level is that they are individually or combined with some fault tolerance, serialization and other strategies. For example, memory + disk.

The component responsible for storage is BlockManager, which has BlockManager on both the Master (Dirver) side and the Slaver (Executor) side, and the division of labor is different. The Slaver side registers its own BlockManager with Master, while the real block; Master side is only responsible for management and scheduling.

The running memory of Storage module accounts for 60% of the memory allocated by Executor by default, so the reasonable allocation of Executor memory and the selection of appropriate storage level need to balance the performance and stability of Spark.

The emergence of Spark makes up for the deficiency of Hadoop in the processing of big data. At the same time, with the continuous dispersion of branches and leaves, a lot of derived interface modules enrich the application scene of Spark and reduce the access threshold of Spark and other technologies.

The above is an example analysis of the principle of Spark. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.