In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
Editor to share with you how to install Python spark, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!
one。 Configuration version
Java JDK 1.8.0_111
Python 3.9.6
Spark 3.1.2
Hadoop 3.2.2
two。 Configure environment 1. Configure JDK
Download the installation of the corresponding version of JDK from the official website and configure the environment variables
(1) create a new JAVA_HOME in the system variable and enter the variable value according to the location where you installed it.
(2) create a new CLASSPATH
Variable value:.;% JAVA_HOME%\ lib\ dt.jar;%JAVA_HOME%\ lib\ tools.jar; (note the symbols required earlier)
(3) Click Path
Create a new one in it:% JAVA_HOME%\ bin
(4) confirm after configuration.
(5) verify, open cmd, enter java-version and javac for verification
This shows that the jdk environment variable is configured successfully.
two。 Configure Spark
(1) download and install:
Spark official website: spark-3.1.2-bin-hadoop3.2 download address
(2) decompress and configure the environment
(3) Click Path, create a new:% SPARK_HOME%\ bin, and confirm
(4) verify, enter pyspark in cmd
This reminds us to install Hadoop.
3. Configure Hadoop
(1) download:
Hadoop official website: Hadoop 3.2.2 download address
(2) decompress and configure the environment
Note: after extracting the files, the following two files may not be available in the bin folder:
Download address: https://github.com/cdarlint/winutils
Configure the environment variable CLASSPATH:%HADOOP_HOME%\ bin\ winutils.exe
(3) Click Path, create a new:% HADOOP_HOME%\ bin, and confirm
(4) verify, enter pyspark in cmd
You can see from the above that spark can run successfully, but the following warning appears:
WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
Here, because there are related changes to the 3.x version of spark, using the spar2.4.6 version will not have such a problem.
Do not change the version solution (did not try because of warning):
Method 1: solution 1
Method 2: solution 2
III. Pycharm configuration spark
(1) Run- > Edit Configurations
(2) configure Environment Variables
(3) File- > Settings- > Project Structure- > Add Content Root
Find two packages under spark-3.1.2-bin-hadoop3.2\ python\ lib to add
Select the result:
(4) testing
# add this code Initialize import findsparkfindspark.init () from datetime import datetime, datefrom pyspark.sql import SparkSessionspark = SparkSession.builder.getOrCreate () rdd = spark.sparkContext.parallelize ([(1,2, 'string1', date (2000, 1, 1), datetime (2000, 1, 1, 12, 0)), (2, 3,' string2', date (2000, 2, 1), datetime (2000, 1, 2, 12, 0)), (3, 4, 'string3', date (2000, 3)) 1), datetime (2000, 1, 3, 12, 0)]) df = spark.createDataFrame (rdd, schema= ['aural,' baked, 'cached,' dumped,'e']) df.show ()
Running result:
four。 Use the python environment in anconda to configure spark1. Create a virtual environment conda create-n pyspark_env python==3.9.6
View the environment:
Conda env list
Running result:
two。 Install pyspark
Switch to pyspark_env and install pyspark
Pip install pyspark
3. Environment configuration
When you run the above instance, the following error occurs:
This means that we need to configure py4j,SPARK_HOME.
SPARK_HOME:
PYTHONPATH settings:
HADOOP_HOME settings:
Set in path:
4. Running
# add this code Initialize import findsparkfindspark.init () from datetime import datetime, datefrom pyspark.sql import SparkSessionspark = SparkSession.builder.getOrCreate () rdd = spark.sparkContext.parallelize ([(1,2, 'string1', date (2000, 1, 1), datetime (2000, 1, 1, 12, 0)), (2, 3,' string2', date (2000, 2, 1), datetime (2000, 1, 2, 12, 0)), (3, 4, 'string3', date (2000, 3)) 1), datetime (2000, 1, 3, 12, 0)]) df = spark.createDataFrame (rdd, schema= ['aural,' baked, 'cached,' dumped,'e']) df.show ()
The running result is the same as above.
The above is all the contents of the article "how to install spark in Python". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.