In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
In this issue, the editor will bring you about the concept, structure and working mechanism of spark. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.
I. comparison of the three frameworks of Hadoop, Spark and Storm
Hadoop: offline mass data batch processing, disk-based
Spark: based on memory.
Spark features: fast running, using DAG execution engine to support cyclic data flow and memory calculation
2. Easy to use: multi-language programming, interactive programming through spark shell
3. Versatility: provides a complete and powerful technology stack, including sQL query, streaming computing, machine learning and graph algorithm components
4. A variety of operation modes: it can run in independent cluster mode, in hadoop, in AmazonEC2 and other cloud environments, and can access a variety of data sources such as HDFS, HBase, Hive, etc.
Scala: multi-paradigm programming language
Functional programming (lisp, Haskell)
Runs on java platform (jvm, virtual machine) and is compatible with java programs
Scala features: strong concurrency, functional programming and distributed systems
Concise syntax, can provide elegant API
Scala is compatible with java, runs fast, and can be integrated into the hadoop ecosystem.
Scala is the main programming language of spark. It provides REPL (Interactive interpreter) to improve the efficiency of program development.
Comparison between Spark and Hadoop
The disadvantages of hadoop: 1. The expression ability is limited, so it can only be expressed by map and reduce.
2. High disk overhead
3. High latency, because you have to write to disk.
4. The connection between tasks involves IO overhead.
Advantages of Spark over hadoop MapReduce:
1. Not limited to MapReduce, it provides multiple data set operation types, and the programming model is more flexible than Hadoop MapReduce.
2. Spark provides in-memory calculation, which can put the intermediate results in memory, which is more efficient for iterative operations.
3. The task scheduling mechanism based on DAG is more efficient.
II. Spark ecosystem
The spark ecosystem mainly consists of Spark Core, SparkSQL, SparkStreaming, MLLib and GraphX components.
1. Batch processing of massive data MapReduce
2. Interactive query Cloudera Impala based on historical data.
3. Processing of real-time data stream
Spark can be deployed on the resource manager Yarn to provide an one-stop big data solution
Spark can support mass data processing, historical data analysis and real-time data processing at the same time.
Spark ecosystem has become Berkeley data Analysis Software Stack (BDAS)
Application scenarios of Spark ecosystem components
III. Spark operating architecture
1. Basic concepts: RDD, DAG, Executor, Application, Task, Job, Stage
RDD: short for flexible distributed data set, it is an abstract concept of distributed memory and provides a highly shared memory model.
Compared with MapReduce, there are two advantages
1. Multithreading is used to execute specific tasks to reduce the start-up cost of tasks.
2. Using memory and disk as common storage devices at the same time, limited reduction of IO overhead.
2. The basic principle of Spark operation
1. Build a basic running environment. Dirver creates a SparkContext, allocates and monitors resource usage.
2. The resource manager assigns resources to it and starts the Excutor process
3. SparkContext builds the DAG diagram according to the dependency relationship of RDD, and the GAG diagram is submitted to DAGScheduler for parsing into stage, and then submitted to the underlying taskscheduler for processing.
Executor applies to SparkContext for task,taskscheduler to issue task to Executor to run and provide the application code
4. Task runs in Executor and feeds back the results to TaskScheduler, layer by layer. Finally release resources
Features of running architecture: multi-thread running, running process independent of resource manager, Task using data locality and speculative execution to optimize.
3. RDD concept
Design background, iterative algorithm, if using MapReduce, intermediate results will be reused; MapReduce constantly reads and writes data on disk, which will bring a lot of overhead.
Typical execution process of RDD
1) read external data sources to create and partition
2) after a series of conversion operations, RDD will generate a different RDD each time for the next conversion erase operation.
3) the last RDD is calculated through an action operation and output to an external data source
Advantages: lazy invocation, pipelining, avoiding synchronous waiting, no need to save intermediate results
Reasons for efficiency:
1) Fault tolerance: the existing way is to use logging. While RDD is inherently fault-tolerant, any RDD error can go to the father node at a low cost. Each transformation of RDD generates a new RDD, so there is a pipeline-like front-and-back dependency between RDD. When some partition data is lost, Spark can recalculate the lost partition data through this dependency instead of recalculating all partitions of RDD.
2) the intermediate results are saved to memory to avoid unnecessary memory overhead.
3) the stored data can be java objects, which avoids serialization and deserialization of objects.
The dependency of RDD: narrow dependency and wide dependency
Narrow dependency: (narrow dependency) means that one Partition of each parent RDD is used by a Partition of a maximum quilt RDD, such as map, filter, union, and so on. (only child), that is, each partition in the rdd corresponds to only one partition in the parent rdd. The partition in the parent rdd only goes to a certain partition in the child rdd! This is called narrow dependency, and if a partition in the parent rdd removes multiple partition in the child rdd, it must be a wide dependency!
Shuffle dependency: it means that the Partition of a parent RDD will be used by the Partition of multiple child RDD, such as groupByKey, reduceByKey, sortByKey and other operations will produce wide dependency; (over-health) the partition data of each parent rdd may transfer part of the data to each partition of the child rdd, that is, multiple partition of the child rdd depend on the parent rdd. Wide dependencies are divided into a stagewise dependency!
Function: complete the division of Stage
The overall idea of spark dividing stage is: push back to forward, break when it encounters a wide dependency, and divide it into a stage; that will add this RDD to the stage when it encounters a narrow dependency. Therefore, in the image above, RDD C department RDD D department RDD E journal RDDF is built in a stage, RDD An is built in a separate Stage, and RDD B and RDD G are built in the same stage.
Division of Stage:
ShuffleMapStage and ResultStage:
Simply put, the last phase of DAG generates a ResultTask for each resulting partition, that is, the number of Task in each Stage is determined by the number of Partition of the last RDD in that Stage! All other stages generate ShuffleMapTask; so that it is called ShuffleMapTask because it needs to shuffle its calculation results to the next stage; that is, the stage1 and stage2 in the figure above are equivalent to the Mapper in mapreduce, and the stage3 represented by ResultTask is equivalent to the reducer in mapreduce.
IV. Spark SQL
Another component of Spark. First, let's talk about shark (Hive on Spark). In order to be compatible with Hive, we reuse HIveQL parsing, logic execution plan translation and other logic in HiveQL, and translate HiveQL operations into RDD operations on Spark. It is equivalent to replacing the original MapReduce with Spark when finally converting the logical plan to the physical plan.
Compared with spark, sparkSQL no longer depends on Hive, but forms its own set of SQL, which only depends on Hive parsing and Hive metadata. After the hql is parsed into a syntactic abstract tree, the rest are all their own things, no longer rely on the original components of Hive, add SchemaRDD, run in SchemaRDD to encapsulate more data, and data analysis is more powerful. At the same time, it supports more languages, including Scala, Java and python in addition to R language.
V. Spark installation and deployment
1Standalone 2 、 Spark on mesos 3 、 spark on yarn
Application deployment in the enterprise
VI. Spark programming
Write an application
1. Load the file into RDD
2. Set environment variables
3. Create SparkContext
4. Conversion operation
5. Action calculation operation 1
6. Create a sbt file
7. Use sbt to package it
8. Submit the jar package to spark to run.
The above is the concept and structure and working mechanism of spark shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.