How to understand big data's distributed computing spark technology 07/02 Update SLTechnology News&Howtos

How to understand big data's distributed computing spark technology

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you how to understand big data's distributed computing spark technology. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

This system learns the name of spark: spark2.x from shallow to deep series of knowledge. The so-called "shallow" refers to a certain scene or problem, and then "deep" means what is the principle of solving the scene and the problem, and then "deep" refers to how these scenarios and the principles of the problem are implemented at our code level, that is, it involves reading the source code, and finally "in the end" is our practice, and then use practice to generate value.

Here are the steps for the system to learn spark:

First, spark2.x from shallow to deep series one: correct understanding of spark

Only with a correct understanding of spark can we determine the direction and train of thought of learning spark, which refers to the street lamp.

We give this in the form of a video. The link to the video URL is: http://edu.51cto.com/course/10932.html.

In this video, we will figure out the following questions:

1: thoroughly understand what RDD is and its characteristics

2: thoroughly understand what spark's distributed memory computing is

3: thoroughly understand how spark solves problems in various fields, and its characteristics in solving problems in various fields.

At the same time, we will thoroughly understand the following two questions:

1: spark is memory-based and MapReduce is disk-based, so spark is faster than MapReduce. Is that true?

2: how is spark's distributed memory used reasonably, or in which scenarios is it reasonable to use it?

Second, spark2.x from shallow to deep series II: spark core RDD Api

This paper makes a systematic and in-depth explanation on the usage of each scala api in RDD in spark core, the points needing attention in the process of using it, and the principle of each api.

We give this in the form of a video. The link to the video URL is: http://edu.51cto.com/course/11058.html.

The content in this video includes the following:

Chapter one: the description of the course content and the environment needed by the course.

Chapter 2: understanding scala

Understand the basic concepts of scala, including:

1. Object-oriented programming of scala

2. Functional programming of scala

3. Two characteristics of scala: closure and data structure Option

Chapter 3: RDD concept

According to the characteristics of RDD, this paper puts forward the definition of RDD and the advantages of RDD.

Chapter 4: the creation of RDD

The api of creating RDD is explained in detail, and the principle and difference of parallelize and makeRDD api are analyzed in detail.

Chapter 5: the dependence of RDD

RDD's dependent design, and explains in detail why RDD's design is so dependent.

Chapter 6: partition of RDD

1. The working principle of RDD divider HashPartitioner is shown by schematic diagram.

2. Explain how to optimize the performance by using the divider

3. Explain the working principle and usage scene of RangePartitioner with schematic diagram and source code.

4. Customize the divider Partitioner of RDD with an example

5. The number of RDD partitions controls the usage scenarios of api-coalesce and repartition and the difference between them.

6. Explain the principle of coalesce with schematic diagram and source code.

Chapter 7: detailed explanation of api for single type RDD

1. The use of transformation api of single type RDD and the points needing attention, including map, mapPartition, flatMap and other api.

2. Detailed explanation of the principle code of MapPartitionsRDD.

3. Introduction of RDD sampling api (sample, etc.)

4. Introduction of RDD stratified sampling api (sampleByKey, etc.)

5. The way of using pipe api of RDD and the points needing attention in the process of using it.

6. Explain the principle of pipe of RDD in depth.

7. Explain the basic action api of single type RDD, including foreach, first, collect, etc.

8. Single-type RDD basic action api explanation, including reduce, fold and aggregate, etc. At the same time, the principles and differences of reduce and treeReduce, aggregate and treeAggregate are analyzed.

Chapter 8 api explanation of key-value type RDD

1. Detailed explanation of the seven parameters of combineBykey.

2. Detailed explanation of the principle of ShuffleRDD.

3. Detailed explanation of api based on combineByKey, including aggregateByKey, reduceByKey, foldByKey and groupByKey, etc.

4. The actual combat of combineBykey and the points needing attention in the process of use.

5. The comparison of reduceByKey and groupByKey, including the comparison of reduce and fold.

6. Sensory knowledge of cogroup api, including join, leftOuterJoin, rightOuterJoin, fullOuterJoin and subtractByKey api based on cogroup.

7. Explain the principle of cogroup in detail through schematic diagram and source code

8. The principle and implementation of join and other api

9. The principle of subtractByKey

10. SortedByKey principle, using RangePartitioner to realize optimization.

11, count, countByKey, etc. Count api, including approximate estimation and other api

Chapter 9: binary RDD operation

1. The use and principle of union.

2. The use and principle of intersection.

3. The use and principle of cartesian Cartesian product.

4. The use and principle of zip.

Chapter 10: persist and checkpoint mechanisms

1. The caching mechanism of RDD, namely persist

2. The function and implementation process of checkpoint.

3. The principle of checkpoint.

4. The advantages and disadvantages of localCheckpoint and checkpoint.

Chapter 11: Spark reads and writes external data sources

1. Read-write storage systems supported by spark

2. The principle and implementation of HadoopRDD

3. General file formats supported by spark, focusing on the data structure of SequenceFile and points needing attention, including text file, CSV file, Sequence file, Object file and MapFile, etc.

4. Read and write codes of hbase file supported by spark

5. Detailed explanation of reading and writing of row storage file format (avro) and column storage file format (parquet)

6. Api detailed explanation of reading and writing binary data by spark.

Chapter 12: detailed explanation of broadcast and accumulator api

Chapter 13: actual combat of RDD project

Connect the api that we will be going to in this course with the project that I actually participate in.

Third, spark2.x from shallow to deep series 3: spark core correctly submit spark application

We usually use spark-submit to submit a spark application, so we need to understand the usage and principle of each parameter in spark-submit in order to correctly submit spark applications in various business scenarios. After understanding the usage and principle of each parameter in spark-submit, we need to understand the principle of spark-submit, which is also the door to further study of spark.

We give this in the form of a video. The link to the video URL is: http://edu.51cto.com/course/11132.html.

The content of this video includes:

Chapter one: an introduction to the course content

Chapter 2: basic knowledge of java

2.1 java command to start JVM

2.2 java ProcessBuilder starts JVM

Chapter 3: explain every parameter of spark-submit in detail.

3.1 spark-submit sensory awareness

3.2 detailed explanation of master and deploy-mode parameters

3. 3-detailed explanation of conf parameters

3.4 detailed explanation of related parameters of driver

Detailed explanation of 3.5 executor related parameters

3.6. detailed explanation of jars parameters

3.7-detailed explanation of relevant parameters of package

3. 8-detailed explanation of files parameters

3.9-detailed explanation of related parameters of queue

3.10 correct submission of python spark applications

3.11 use SparkLauncher to submit spark applications in code

Chapter IV spark-submit principle

4.1 spark scripting system

4.2 the principle and implementation of spark-class script

4.3 the principle and implementation of spark-daemon script

4.4 SparkSubmit principle and source code analysis

Spark2.x from shallow to deep series IV: spark core-scheduler on driver

From shallow to deep, it explains how the spark application is split into task, and then how to schedule in the cluster, as well as the local rules, conjecture mechanism and blacklist mechanism in the process of task scheduling.

This chapter is a bit partial to the principles, but I will explain these principles in an easy-to-understand language, and you will find that what you think is very complicated is actually so simple.

We give this in the form of video: the video will be released before National Day holiday.

The general content includes (may change, subject to the released video):

Chapter one: course content and environment

Chapter 2: scheduling in a spark application

2.1Division of DAG

2.2 the principle of stage scheduling process

2.3 delay scheduling and performance tuning of task

2.4 the conjecture mechanism of task and its usage scenarios

2.5 Blacklist mechanism and its usage scenarios

2.6 Resource scheduling and its management

Chapter 3: scheduling of multiple spark applications

3.1 premise of dynamic allocation of resources

3.2 Mechanism for dynamic allocation of resources

Fifth, spark2.x from shallow to deep series 5: spark core-shuffle implementation principle and tuning

In this course, we will thoroughly explain the implementation principle and tuning process of spark's shuffle in easy-to-understand words.

The content of this course will roughly cover: memory management principles, storage management principles, mapoutTracker implementation principles and shuffle management principles.

We give this in the form of video: the release time is yet to be determined.

Spark2.x from shallow to deep series 6: spark core RDD java api

This is a supplement to Series 2, which will introduce the implementation principle of RDD java api and the usage of api in detail.

This is given in the form of a blog, the blog address is: http://7639240.blog.51cto.com/7629240/d-1

Will be updated from time to time to write down everything you know.

Spark2.x from shallow to deep series 7: spark core RDD python api

This is a supplement to Series 2, which will introduce the implementation principle of RDD python api and the usage of api in detail.

This is given in the form of a blog, the blog address is: http://7639240.blog.51cto.com/7629240/d-2

This is not perfect at present. It will be updated from time to time.

Spark2.x from shallow to deep series 8: essential basic knowledge of spark core

This course focuses on an in-depth understanding of the three basic components of spark core and the basic java knowledge of the parts we need to understand spark in depth.

Goal:

1: security management of spark

2: serialization mechanism of spark

3: the RPC mechanism of spark, including some knowledge points of nio

This is given in the form of a blog, and the blog address is: TBD

9. Spark2.x from shallow to deep series 9: spark core cluster resource management mechanism

This course provides a detailed understanding of the three resource management mechanisms of spark:

1: standalone mode that comes with spark, and explain the implementation principle of standalone in depth.

2: hadoop's yarn mode, this lesson gives us a thorough understanding of how spark runs tasks based on yarn, and how we implement a client that submits applications to yarn.

3: mesos mode, this lesson gives us a thorough understanding of how spark runs tasks based on mesos, and how we implement a client that submits applications to mesos.

This is how the big data distributed computing spark technology shared by the editor is understood. If you happen to have similar doubts, you might as well refer to the above analysis. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.