How to make the PySpark of CDSW automatically adapt to the Python version 07/06 Update SLTechnology News&Howtos

How to make the PySpark of CDSW automatically adapt to the Python version

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Today, I will talk to you about how to make CDSW's PySpark automatically adapt to the Python version. Many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

The Python environment of Spark2 in a CDH cluster defaults to Python2,CDSW. When you start Session, you can choose the Engine Kernel version of Python2 or Python3. When Python3 is selected to start Session, the development PySpark job will report "Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set" when running. In order to solve the problem of Python version adaptation, we need to make the following adjustments to make our application automatically adapt to the Python version.

Cluster deployment of multiple Python versions

Install Python based on the Anaconda Parcels package provided by CDH, and solve the problem of multiple versions by installing two Python versions of Parcel package on the CDH cluster at the same time. If you need to support Python2 or Python3 version by default in Spark, you can activate the corresponding version of Parcel. The Parcel package of Python2 version is activated by default in my cluster. The next operation mainly describes the environment preparation of Python3.

Python2 environment variables used by Spark2 by default

The download address for the 1.Python2 version of Anaconda is as follows:

Https://repo.anaconda.com/pkgs/misc/parcels/Anaconda-2019.07-el7.parcel

Https://repo.anaconda.com/pkgs/misc/parcels/Anaconda-2019.07-el7.parcel.sha

Https://repo.anaconda.com/pkgs/misc/parcels/manifest.json

The download address for the 2.Python3 version of Anaconda is as follows:

Https://repo.anaconda.com/pkgs/misc/parcels/archive/Anaconda-5.1.0.1-el7.parcel

Https://repo.anaconda.com/pkgs/misc/parcels/archive/Anaconda-5.1.0.1-el7.parcel.sha

Https://repo.anaconda.com/pkgs/misc/parcels/archive/manifest.json

3. Deploy the downloaded parcel package to the private HTTP service of the cluster

4. Log in to Cloudera Manager using the administrator account to enter the Parcel package management interface to configure the Anaconda address

5. After completing the Parcel address configuration, you can complete the download and allocation of the corresponding version of the Parcel package.

The above operation does not require activation, in the case of inactivation, PySpark defaults to the Python2 environment, and if activated, uses the Python3 environment.

6. Confirm that Python2 and Python3 environments exist for all nodes in the cluster

CDSW automatically adapts the Python version for Spark

In order to make our Pyspark program code automatically adapt to different versions of Python, we need to initialize the environment before initializing our Spark code, and add the following code to adapt to different versions of Python before the code runs.

Import os

Py_environ=os.environ ['CONDA_DEFAULT_ENV']

If py_environ=='python2.7':

Os.environ ['PYSPARK_PYTHON'] =' / opt/cloudera/parcels/Anaconda/bin/python'

Else:

Os.environ ['PYSPARK_PYTHON'] =' / opt/cloudera/parcels/Anaconda-5.1.0.1/bin/python'

The verifier automatically adapts to the Python version

1. Select Python2 environment to start Session

two。 Run PySpark Job Test run normally

3. Select Python3 environment to start Session

4. Run PySpark Job Test run normally

Summary

To deploy multiple versions of Python in the cluster at the same time, we can dynamically specify PYSPARK_PYTHON as the Python environment we need by using the Python command in the Pyspark code.

After reading the above, do you have any further understanding of how to make CDSW's PySpark automatically adapt to the Python version? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.