What is the implementation principle of Python Spark? 04/27 Update SLTechnology News&Howtos

What is the implementation principle of Python Spark?

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What is the implementation principle of Python Spark? I believe many inexperienced people are at a loss about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Spark is mainly developed by Scala language. In order to facilitate integration with other systems without introducing scala-related dependencies, some implementations are developed in Java language, such as External Shuffle Service. In general, Spark is implemented in the JVM language and runs in JVM. However, Spark not only provides the Scala/Java development interface, but also provides the development interface of Python, R and other languages. in order to ensure the independence of the core implementation of Spark, Spark only wraps around to support the development of different languages. the following mainly introduces the implementation principle of Python Spark and analyzes how the pyspark application runs.

Spark Runtime Architecture

First of all, let's review the basic runtime architecture of Spark, as shown in the following figure, the orange part shows that the runtime of JVM,Spark applications is mainly divided into Driver and Executor,Driver load scheduling and UI presentation, Executor is responsible for Task operation, Spark can be deployed in a variety of resource management systems, such as Yarn, Mesos, etc., while Spark itself also implements a simple Standalone (independent deployment) resource management system. It can run without the help of other resource management systems.

The user's Spark application runs on Driver (to some extent, the user's program is the Spark Driver program). It is packaged into Task after Spark scheduling, and then the Task information is sent to Executor for execution. The Task information includes code logic and data information, and Executor does not run the user's code directly.

PySpark Runtime Architecture

In order not to destroy the existing runtime architecture of Spark, Spark wraps a layer of Python API on the periphery, realizes the interaction between Python and Java with the help of Py4j, and then writes Spark applications through Python. The runtime architecture is shown in the following figure.

The white part is the new Python process. On the driver side, the method of calling Java in Python is implemented through Py4j, that is, the PySpark program written by the user is "mapped" to JVM. For example, the user instantiates a SparkContext object of Python in PySpark, and eventually instantiates the SparkContext object of Scala in JVM. On the executor side, there is no need to use Py4j, because the Task logic running on the executor side is sent by Driver, which is the serialized bytecode, although it may contain user-defined Python functions or Lambda expressions. Py4j cannot implement the method of calling Python in Java. In order to run user-defined Python functions or Lambda expressions on the executor side, you need to start a separate Python process for each Task. Send the Python function or Lambda expression to the Python process for execution through socket communication. The overall flow of interaction at the language level is shown in the following figure. The solid line represents the method call and the dotted line indicates that the result is returned.

The following is a detailed analysis of how PySpark's Driver works and how Executor runs Task.

Operation principle of Driver side

When we submit the pyspark program through spark-submmit, we will first upload the python script and dependency, and apply for the Driver resource. When we apply for the Driver resource, we will pull the JVM through PythonRunner (including the main method), as shown in the following figure.

There are two main things to do in the PythonRunner entry main function:

Turn on Py4j GatewayServer

Run the Python script uploaded by the user through Java Process

After the user Python script is started, the Python version of the SparkContext object is instantiated first, and two things are done during the instantiation process:

Instantiate Py4j GatewayClient, connect the Py4j GatewayServer in JVM, and then call Java in Python with this Py4j Gateway.

Instantiate SparkContext objects in JVM through Py4j Gateway

After the above two steps, the SparkContext object is initialized, the Driver is up, the Executor resource is applied, and the task is scheduled. A series of processing logic defined in the user's Python script eventually triggers the submission of the Job when it encounters the action method. When the Job is submitted, it is directly completed by calling the PythonRDD.runJob method of Java through Py4j. When it is mapped to JVM, it will be transferred to the sparkContext.runJob method. After the Job is run, a local Socket will be opened in the JVM to wait for the Python process to pull. Accordingly, the Python process will pull the result through Socket after calling PythonRDD.runJob.

Pull out the Driver part of the previous runtime architecture diagram separately, as shown in the following figure, pull up the JVM and Python processes through the PythonRunner entry main function, the JVM process corresponds to the orange part of the following figure, and the Python process corresponds to the white part of the following figure. The Python process submits the Job,Job run result by calling the Java method through Py4j and is pulled to the Python process through the local Socket. Another point is that for large amounts of data, such as broadcast variables, Python and JVM processes interact through the local file system to reduce data transfer between processes.

Operation principle of Executor side

For convenience, take Spark On Yarn as an example. When Driver applies for an Executor resource, it pulls the JVM through CoarseGrainedExecutorBackend (there is a main method), starts some necessary services and waits for the Task of Driver to be sent. When no Task is sent, there is no Python process on the executor. After receiving the Task sent by Driver, the internal operation process of Executor is shown in the following figure.

After receiving the Task, the Executor runs Task through launchTask, and finally calls the compute method of PythonRDD to process the data of a partition. The calculation process of the compute method of PythonRDD is roughly divided into three steps:

If there is no pyspark.deamon background Python process, start the pyspark.deamon background process through Java Process. Note that there will be only one pyspark.deamon background process on each Executor, otherwise, connect to pyspark.deamon directly through Socket and request that a pyspark.worker process be started to run user-defined Python functions or Lambda expressions. Pyspark.deamon is a typical multi-process server. When a Socket request is made, fork is processed by a pyspark.worker process. How many Task are running on an Executor at the same time, there will be how many corresponding pyspark.worker processes.

This is followed by a separate thread that feeds data to the pyspark.worker process, and pyspark.worker calls user-defined Python functions or Lambda expressions to process the evaluation.

In the process of feeding the data on one side, the other side uses Socket to pull the calculation results of pyspark.worker.

Pull out the Executor part of the previous runtime architecture diagram separately. As shown in the following figure, the orange part is the JVM process, and the white part is the Python process. There is a common pyspark.deamon process on each Executor, which is responsible for receiving Task requests, and the fork pyspark.worker process processes each Task separately. In the actual data processing process, the pyspark.worker process and the JVM Task will communicate with each other frequently in local Socket data.

Generally speaking, PySpark uses Py4j to implement Python calling Java to drive Spark applications. In essence, the result return from JVM runtime,Java to Python is done through the local Socket. Although this architecture ensures the independence of the core code of Spark, in big data's scenario, the frequent data communication between JVM and Python processes leads to a lot of performance loss, and may be directly stuck when it is bad, so it is recommended to be cautious in using PySpark for large-scale machine learning or Streaming application scenarios, using native Scala/Java as far as possible to write applications, and for simple offline tasks with small and medium-scale data. You can use PySpark to quickly deploy submissions.

After reading the above, have you mastered the implementation principle of Python Spark? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.