How to carry on the Conceptual Analysis of big data High Speed Computing engine Spark 03/31 Update SLTechnology News&Howtos

How to carry on the Conceptual Analysis of big data High Speed Computing engine Spark

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shows you how to carry out conceptual analysis of big data high-speed computing engine Spark, concise and easy to understand, absolutely can make you shine, through the detailed introduction of this article I hope you can gain something.

Spark Core Section 1 Spark Overview 1.1 What is SparkSpark is a fast, versatile computing engine. Spark's characteristics: fast. Compared to MapReduce, Spark's memory-based operations are more than 100 times faster and its hard-disk-based operations are more than 10 times faster. Spark implements an efficient DAG execution engine that can efficiently process data streams by memory-based; simple to use. Spark supports Scala, Java, Python, R APIs, and more than 80 advanced algorithms, enabling users to quickly build different applications. And Spark supports interactive Python and Scala shells, making it easy to use Spark clusters in these shells to validate problem-solving methods; generic. Spark provides a unified solution. Spark can be used for batch processing, interactive queries (Spark SQL), real-time streaming (Spark Streaming), machine learning (Spark MLlib), and graph computation (GraphX). These different types of processing can all be used seamlessly within the same application. Spark unified solution is very attractive, enterprises want to use a unified platform to deal with the problems encountered, reduce the development and maintenance of human costs and the deployment of platform material costs; compatibility is good. Spark can be easily integrated with other open source products. Spark can use YARN, Mesos as its resource manager and scheduler; it can handle all Hadoop-supported data, including HDFS, HBase, Cassandra, etc. This is especially important for users who already have Hadoop clusters deployed, as no data migration is required to take advantage of Spark's powerful processing power. Spark can also be independent of third-party resource management and scheduling, it implements Standalone as its built-in resource management and scheduling framework, which further lowers the barriers to use of Spark, making it very easy for everyone to deploy and use Spark. In addition, Spark provides tools for deploying Standalone Spark clusters on EC2. 1.2 Spark and Hadoop

From a narrow perspective: Hadoop is a distributed framework, composed of storage, resource scheduling, and computation; Spark is a distributed computing engine, a computing framework written in Scala, a fast, versatile, scalable big data analytics engine based on memory; from a broad perspective, Spark is an integral part of the Hadoop ecosystem;

Shortcomings of MapReduce: Limited expressive power Disk IO overhead is high latency is high There is IO overhead for bridging between tasks The latter task cannot be started until the execution of the previous task is complete. Difficult to perform complex, multi-phase computing tasks

Spark, while drawing on the advantages of MapReduce, solves the problems faced by MapReduce well.

Note: Spark's computing mode also belongs to MapReduce;Spark framework is an optimization of MR framework;

In practical applications, big data applications mainly include the following three types:

Batch processing (off-line processing): typically time spans between tens of minutes and hours

Interactive queries: usually span between tens of seconds and minutes

Stream processing (real-time processing): typically time spans of hundreds of milliseconds to seconds

When all three scenarios are present, traditional Hadoop frameworks require three different software to be deployed simultaneously. For example:

MapReduce / Hive or Impala / Storm

This inevitably leads to some problems:

Input and output data between different scenes cannot be seamlessly shared, and data format conversion is usually required.

Different software requires different development and maintenance teams, resulting in higher operating costs.

It is difficult to coordinate and allocate resources uniformly among systems in the same cluster.

The ecosystem provided by Spark is adequate for all three scenarios, namely batch processing, interactive queries, and streaming data processing:

The design of Spark follows the concept of "one software stack meets different application scenarios"(all in one), gradually forming a complete ecosystem.

It can also support SQL ad hoc queries, real-time streaming computing, machine learning and graph computing.

Spark provides a one-stop big data solution on top of resource manager YARN

Why Spark is faster than MapReduce: 1 Spark uses memory aggressively. A Job in MR framework can have only one map task and one reduce task. If the business processing logic is complex, a map and a reduce cannot be expressed, then multiple jobs need to be combined; however, the calculation results of the previous job must be written to HDFS before they can be handed over to the latter job. Such a complex operation, in the MR framework will occur a lot of write, read operations;Spark framework can combine multiple map reduce tasks together for continuous execution, the middle calculation results do not need to fall to the ground; Complex MR tasks: mr + mr +mr... Complex Spark tasks: mr -> mr -> mr...

Multi-process model (MR) vs multi-threaded model (Spark). Map Task and Reduce Task in MR framework are process-level, while Spark Task is based on thread model. The map task and reduce task in MR framework are jvm processes, which need to reapply resources every time they start, consuming unnecessary time. Spark reduces the overhead required to start and close tasks by multiplexing threads in the thread pool.

1.3 system architecture

The Spark operating architecture includes:

Cluster Manager

Worker Node

Driver

Executor

Cluster Manager is the administrator of cluster resources. Spark supports three cluster deployment modes:

Standalone、Yarn、Mesos;

Worker Node, which manages local resources;

Driver Program。Run the application's main() method and create SparkContext. Cluster Manager allocates resources, SparkContext sends Task to Executor for execution;

Executor: runs on the worker node, executes the Task sent by Driver, and reports the calculation result to Dirver;

1.4 Spark Cluster Deployment Mode

Spark supports three cluster deployment modes: Standalone, Yarn, Mesos;

Standalone mode * Standalone mode, with complete services, can be deployed separately to a cluster, without relying on any other resource management system. To some extent, this pattern is the basis for the other two.

Cluster Manager：Master

Worker Node：Worker

Only coarse-grained resource allocation is supported

Spark On Yarn mode

Yarn has strong community support and has gradually become the standard for big data cluster resource management systems. Spark on Yarn supports two modes: yarn-cluster: suitable for production environments yarn-client: for interaction, debugging, want to see app output immediately Cluster Manager: ResourceManagerWorker Node: NodeManager only supports coarse-grained resource allocation

Spark On Mesos

Officially recommended model. Spark was developed with Mesos support in mind Spark runs more flexibly and naturally on Mesos than on YARN Cluster Manager：Mesos Master Worker Node：Mesos Slave Support coarse-grained and fine-grained resource allocation methods

Coarse-grained Mode: The runtime environment of each application consists of a Dirver and several Executors, where each Executor consumes several resources and can run multiple Tasks internally. Before each task of the application is officially run, all the resources in the running environment need to be applied for, and these resources should be occupied all the time during the running process. Even if they are not used, these resources should be recovered after the program runs.

Fine-grained Mode: Since coarse-grained mode causes a lot of resource waste, Spark On Mesos also provides another scheduling mode: fine-grained mode, which is similar to today's cloud computing, and the core idea is on-demand allocation.

How to choose three cluster deployment modes: Yarn in production environment, the most widely used mode in China

Beginner of Spark: Standalone, Simple

Development test environment, optional Standalone

The amount of data is not too large, the application is not too complicated, it is recommended to start from Standalone mode mesos will not be involved

1.5 related terms Application A spark application submitted by a user, consisting of one driver and many in a cluster executor composition Application jar A jar containing a spark application, a jar should not contain Spark or Hadoop jars, which should be added at runtime Driver program runs the application's main() and creates SparkContext Preface) Cluster manager Services for managing cluster resources, such as standalone, Mesos, Yarn Deploy mode differentiates where the driver process runs. In Cluster mode, the cluster is run internally. Line Driver. In Client mode, Driver runs outside the cluster Worker node The worker node on which the application runs Executor runs application tasks and saves data, each application has its own executors, and independent of each other Task executors Minimum unit of operation of an application Job In the user program, each call to the Action function produces a new job, that is, each Action creates a job Stage A job is broken down into multiple stages, each stage is a collection of a series of Tasks. The above content is how to conduct conceptual analysis of Spark, a big data high-speed computing engine. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserves, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.