How to install spark in Python 07/12 Update SLTechnology News&Howtos

How to install spark in Python

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to install Python spark, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

one。 Configuration version

Java JDK 1.8.0_111

Python 3.9.6

Spark 3.1.2

Hadoop 3.2.2

two。 Configure environment 1. Configure JDK

Download the installation of the corresponding version of JDK from the official website and configure the environment variables

(1) create a new JAVA_HOME in the system variable and enter the variable value according to the location where you installed it.

(2) create a new CLASSPATH

Variable value:.;% JAVA_HOME%\ lib\ dt.jar;%JAVA_HOME%\ lib\ tools.jar; (note the symbols required earlier)

(3) Click Path

Create a new one in it:% JAVA_HOME%\ bin

(4) confirm after configuration.

(5) verify, open cmd, enter java-version and javac for verification

This shows that the jdk environment variable is configured successfully.

two。 Configure Spark

(1) download and install:

Spark official website: spark-3.1.2-bin-hadoop3.2 download address

(2) decompress and configure the environment

(3) Click Path, create a new:% SPARK_HOME%\ bin, and confirm

(4) verify, enter pyspark in cmd

This reminds us to install Hadoop.

3. Configure Hadoop

(1) download:

Hadoop official website: Hadoop 3.2.2 download address

(2) decompress and configure the environment

Note: after extracting the files, the following two files may not be available in the bin folder:

Download address: https://github.com/cdarlint/winutils

Configure the environment variable CLASSPATH:%HADOOP_HOME%\ bin\ winutils.exe

(3) Click Path, create a new:% HADOOP_HOME%\ bin, and confirm

(4) verify, enter pyspark in cmd

You can see from the above that spark can run successfully, but the following warning appears:

WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped

Here, because there are related changes to the 3.x version of spark, using the spar2.4.6 version will not have such a problem.

Do not change the version solution (did not try because of warning):

Method 1: solution 1

Method 2: solution 2

III. Pycharm configuration spark

(1) Run- > Edit Configurations

(2) configure Environment Variables

(3) File- > Settings- > Project Structure- > Add Content Root

Find two packages under spark-3.1.2-bin-hadoop3.2\ python\ lib to add

Select the result:

(4) testing

# add this code Initialize import findsparkfindspark.init () from datetime import datetime, datefrom pyspark.sql import SparkSessionspark = SparkSession.builder.getOrCreate () rdd = spark.sparkContext.parallelize ([(1,2, 'string1', date (2000, 1, 1), datetime (2000, 1, 1, 12, 0)), (2, 3,' string2', date (2000, 2, 1), datetime (2000, 1, 2, 12, 0)), (3, 4, 'string3', date (2000, 3)) 1), datetime (2000, 1, 3, 12, 0)]) df = spark.createDataFrame (rdd, schema= ['aural,' baked, 'cached,' dumped,'e']) df.show ()

Running result:

four。 Use the python environment in anconda to configure spark1. Create a virtual environment conda create-n pyspark_env python==3.9.6

View the environment:

Conda env list

Running result:

two。 Install pyspark

Switch to pyspark_env and install pyspark

Pip install pyspark

3. Environment configuration

When you run the above instance, the following error occurs:

This means that we need to configure py4j,SPARK_HOME.

SPARK_HOME:

PYTHONPATH settings:

HADOOP_HOME settings:

Set in path:

4. Running

The running result is the same as above.

The above is all the contents of the article "how to install spark in Python". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.