How to realize remote connection to spark Cluster by pycharm using pyspark 07/13 Update SLTechnology News&Howtos

How to realize remote connection to spark Cluster by pycharm using pyspark

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Today, I would like to share with you the relevant knowledge points about how pycharm uses pyspark to remotely connect to spark clusters. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article.

0 background

Due to the need of work, use spark to complete machine learning. Therefore, you need to operate on the spark cluster. So use pycharm and pyspark to remotely connect to the spark cluster. Record the problems and methods encountered here.

Mainly refer to the following literature to complete the corresponding content, but the specific problems should be analyzed in detail.

1 method

1.1 Software configuration

Spark2.3.3, hadoop2.6, python3

1.2 spark configuration

The Python version of each node of the Spark cluster must be consistent. Add a line to the $SPARK_HOME/conf/spark-env.sh of each node: look at your installation directory.

Export PYSPARK_PYTHON=/home/hadoop/anaconda2/bin/python3

This step is to add python to the configuration of spark.

At this point, when you enter pyspark on the server command line, you can enter spark normally.

1.3 Local configuration

1.3.1 first copy the spark2.3.3 from the server to the local.

Note: because my cluster is installed with spark-2.3.3-bin-without-hadoop. But after copying to the local, always report the wrong Java gateway process. . At the same time, I copied the package of hadoop2.6 from the server to the local load into the program, and also reported an error.

Finally, download spark-2.3.3-bin-hadoop2.6 directly from spark's website, and this time it will be fine.

The version of pyspark best corresponds to the version of spark. Like pyspark2.3.3,spark2.3.3.

# os.environ ['SPARK_HOME'] = r "F:ig_dataspark-2.3.3-bin-without-hadoop" (useless) os.environ [' SPARK_HOME'] = r "F:ig_dataspark-2.3.3-bin-hadoop2.6" (useful) # os.environ ["HADOOP_HOME"] = r "F:ig_datahadoop-2.6.5" (useless) # os.environ ['JAVA_HOME'] = r "F:Javajdk1.8.0_144" (useless)

1.3.2

C:WindowsSystem32... The mapping between the IP of the Spark cluster Master node and the host name is added to the .hosts (Windows machine). Administrator permissions are required to modify.

Where spark_cluster is the mapping name of the IP for Master. (you can write IP directly, and the mapping name is for convenience.)

1.3.3

Add the python directory that you just downloaded the extracted spark to the project structure of pycharm

1.3.4

Create a new py file, edit Edit Configurations and add SPARK_HOME variable

Note: in practice, this seems to be fine without adding it. You only need to load spark_home. Such as os.envion (… Spark...)

2 Test import osfrom pyspark import SparkContextfrom pyspark import SparkConf# os.environ ['SPARK_HOME'] = r "F:ig_dataspark-2.3.3-bin-without-hadoop" os.environ [' SPARK_HOME'] = r "F:ig_dataspark-2.3.3-bin-hadoop2.6" # os.environ ["HADOOP_HOME"] = r "F:ig_datahadoop-2.6.5" # os.environ ['JAVA_HOME'] = r "F:Javajdk1.8. SetMaster ("spark://spark_cluster:7077") .setAppName ("test") sc = SparkContext (conf=conf) print (1) logData = sc.textFile ("file:///opt/spark-2.3.3-bin-without-hadoop/README.md").cache()print(2)print("num of a") LogData) sc.stop () these are all the contents of the article "how pycharm uses pyspark to remotely connect to spark clusters" Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.