In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
How to carry out Apache Spark source code analysis Job submission and operation, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.
Taking wordCount as an example, the process of creating and running job by spark is described in detail, with emphasis on the creation of processes and threads.
Construction of experimental environment
Ensure that the following conditions are met before taking any further action.
1. Download spark binary 0.9.1
two。 Install scala
3. Install sbt
4. Install java
Start spark-shell stand-alone mode, that is, local mode
Local mode is very simple to run, as long as you run the following command, assuming that the current directory is $SPARK_HOME
MASTER=local bin/spark-shell
"MASTER=local" indicates that you are currently running in stand-alone mode.
Run in local cluster mode
Localcluster mode is a pseudo-cluster mode, which simulates the standalone cluster in a stand-alone environment. The startup sequence is as follows.
1. Start master
two。 Start worker
3. Start spark-shell
Master$SPARK_HOME/sbin/start-master.sh
Notice the output at run time, and the log is saved in the $SPARK_HOME/logs directory by default.
Master mainly runs class org.apache.spark.deploy.master.Master and starts listening on port 8080. The log is shown in the following figure.
Modify configuration
1. Enter the $SPARK_HOME/conf directory
two。 Rename spark-env.sh.template to spark-env.sh
3. Modify the spark-env.sh to add the following
Export SPARK_MASTER_IP=localhostexport SPARK_LOCAL_IP=localhost runs workerbin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077-I 127.0.0.1-c 1-m 512m
Worker startup is complete, connect to master. Open the webui of maser and you can see the connected worker. The listening address of Master WEb UI is http://localhost:8080
Start spark-shellMASTER=spark://localhost:7077 bin/spark-shell
If all goes well, you will see the following prompt.
Created spark context..Spark context available as sc.
You can open localhost:4040 with a browser to view the following
1. Stages
2. Storage
3. Environment
4. Executors
Wordcount
Once the above environment is ready, let's run the simplest example in sparkshell and enter the following code in spark-shell
Scala > sc.textFile ("README.md"). Filter (_ .mom ("Spark")) .count
The above code counts the number of rows that contain Spark in README.md
Detailed explanation of the deployment process
The components in the Spark layout environment are composed as shown in the following figure.
Driver Program briefly, the wordcount statement entered in spark-shell corresponds to the Driver Program in the figure above.
Cluster Manager corresponds to the master mentioned above, which mainly plays the role of deploy management.
Worker Node compared to Master, this is slave node. Each executor,executor running above can correspond to a thread. Executor handles two basic business logic, one is driver programme, and the other is that job is split into stage after submission, and each stage can run one or more task
Notes: in cluster mode, Cluster Manager runs in one jvm process, while worker runs in another jvm process. In local cluster, these jvm processes are all on the same machine, and if it is a real standalone or Mesos and Yarn cluster, worker and master may be distributed on different hosts.
Generation and Operation of JOB
The simple process generated by job is as follows
1. First, the application creates an instance of SparkContext, such as sc.
two。 Use the instance of SparkContext to create and generate RDD
3. After a series of transformation operations, the original RDD is converted into other types of RDD
4. When action acts on RDD after conversion, the runJob method of SparkContext is called
5. The call to sc.runJob is the starting point of the next series of reactions, where the critical jump takes place.
The calling path is roughly as follows
1. Sc.runJob- > dagScheduler.runJob- > submitJob
2. DAGScheduler::submitJob will create an event of JobSummitted and send it to the embedded class eventProcessActor
3. EventProcessActor calls the processEvent processing function after receiving the JobSubmmitted
4. To convert job to stage, generate finalStage and submit it for running. The key is to call submitStage.
5. The dependency relationship between stage is calculated in submitStage, which can be divided into two types: wide dependency and narrow dependency.
6. If it is found in the calculation that the current stage does not have any dependencies or all the dependencies have been prepared, submit the task
7. Submitting the task is done by calling the function submitMissingTasks.
8. The worker on which task really runs is managed by TaskScheduler, that is, the above submitMissingTasks will call TaskScheduler::submitTasks
9. In TaskSchedulerImpl, the corresponding backend will be created according to the current running mode of Spark. If it is running on a stand-alone machine, LocalBackend will be created.
10. LocalBackend receives the ReceiveOffers event passed in by TaskSchedulerImpl
11. ReceiveOffers- > executor.launchTask- > TaskRunner.run
Code snippet executor.lauchTask
Def launchTask (context: ExecutorBackend, taskId: Long, serializedTask: ByteBuffer) {val tr = new TaskRunner (context, taskId, serializedTask) runningTasks.put (taskId, tr) threadPool.execute (tr)}
Having said such a big deal, that is to say, the final logical processing actually takes place in an executor like TaskRunner.
The result is packaged as MapStatus and then fed back to DAGScheduler through a series of internal messages, which is not too complex.
After reading the above, have you mastered how to submit and run Apache Spark source code analysis Job? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.