How to submit and run Apache Spark Source Code Analysis Job 04/10 Update SLTechnology News&Howtos

How to submit and run Apache Spark Source Code Analysis Job

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to carry out Apache Spark source code analysis Job submission and operation, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

Taking wordCount as an example, the process of creating and running job by spark is described in detail, with emphasis on the creation of processes and threads.

Construction of experimental environment

Ensure that the following conditions are met before taking any further action.

1. Download spark binary 0.9.1

two。 Install scala

3. Install sbt

4. Install java

Start spark-shell stand-alone mode, that is, local mode

Local mode is very simple to run, as long as you run the following command, assuming that the current directory is $SPARK_HOME

MASTER=local bin/spark-shell

"MASTER=local" indicates that you are currently running in stand-alone mode.

Run in local cluster mode

Localcluster mode is a pseudo-cluster mode, which simulates the standalone cluster in a stand-alone environment. The startup sequence is as follows.

1. Start master

two。 Start worker

3. Start spark-shell

Master$SPARK_HOME/sbin/start-master.sh

Notice the output at run time, and the log is saved in the $SPARK_HOME/logs directory by default.

Master mainly runs class org.apache.spark.deploy.master.Master and starts listening on port 8080. The log is shown in the following figure.

Modify configuration

1. Enter the $SPARK_HOME/conf directory

two。 Rename spark-env.sh.template to spark-env.sh

3. Modify the spark-env.sh to add the following

Export SPARK_MASTER_IP=localhostexport SPARK_LOCAL_IP=localhost runs workerbin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077-I 127.0.0.1-c 1-m 512m

Worker startup is complete, connect to master. Open the webui of maser and you can see the connected worker. The listening address of Master WEb UI is http://localhost:8080

Start spark-shellMASTER=spark://localhost:7077 bin/spark-shell

If all goes well, you will see the following prompt.

Created spark context..Spark context available as sc.

You can open localhost:4040 with a browser to view the following

1. Stages

2. Storage

3. Environment

4. Executors

Wordcount

Once the above environment is ready, let's run the simplest example in sparkshell and enter the following code in spark-shell

Scala > sc.textFile ("README.md"). Filter (_ .mom ("Spark")) .count

The above code counts the number of rows that contain Spark in README.md

Detailed explanation of the deployment process

The components in the Spark layout environment are composed as shown in the following figure.

Driver Program briefly, the wordcount statement entered in spark-shell corresponds to the Driver Program in the figure above.

Cluster Manager corresponds to the master mentioned above, which mainly plays the role of deploy management.

Worker Node compared to Master, this is slave node. Each executor,executor running above can correspond to a thread. Executor handles two basic business logic, one is driver programme, and the other is that job is split into stage after submission, and each stage can run one or more task

Notes: in cluster mode, Cluster Manager runs in one jvm process, while worker runs in another jvm process. In local cluster, these jvm processes are all on the same machine, and if it is a real standalone or Mesos and Yarn cluster, worker and master may be distributed on different hosts.

Generation and Operation of JOB

The simple process generated by job is as follows

1. First, the application creates an instance of SparkContext, such as sc.

two。 Use the instance of SparkContext to create and generate RDD

3. After a series of transformation operations, the original RDD is converted into other types of RDD

4. When action acts on RDD after conversion, the runJob method of SparkContext is called

5. The call to sc.runJob is the starting point of the next series of reactions, where the critical jump takes place.

The calling path is roughly as follows

1. Sc.runJob- > dagScheduler.runJob- > submitJob

2. DAGScheduler::submitJob will create an event of JobSummitted and send it to the embedded class eventProcessActor

3. EventProcessActor calls the processEvent processing function after receiving the JobSubmmitted

4. To convert job to stage, generate finalStage and submit it for running. The key is to call submitStage.

5. The dependency relationship between stage is calculated in submitStage, which can be divided into two types: wide dependency and narrow dependency.

6. If it is found in the calculation that the current stage does not have any dependencies or all the dependencies have been prepared, submit the task

7. Submitting the task is done by calling the function submitMissingTasks.

8. The worker on which task really runs is managed by TaskScheduler, that is, the above submitMissingTasks will call TaskScheduler::submitTasks

9. In TaskSchedulerImpl, the corresponding backend will be created according to the current running mode of Spark. If it is running on a stand-alone machine, LocalBackend will be created.

10. LocalBackend receives the ReceiveOffers event passed in by TaskSchedulerImpl

11. ReceiveOffers- > executor.launchTask- > TaskRunner.run

Code snippet executor.lauchTask

Def launchTask (context: ExecutorBackend, taskId: Long, serializedTask: ByteBuffer) {val tr = new TaskRunner (context, taskId, serializedTask) runningTasks.put (taskId, tr) threadPool.execute (tr)}

Having said such a big deal, that is to say, the final logical processing actually takes place in an executor like TaskRunner.

The result is packaged as MapStatus and then fed back to DAGScheduler through a series of internal messages, which is not too complex.

After reading the above, have you mastered how to submit and run Apache Spark source code analysis Job? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.