Spark2.x from shallow to deep series 7 python development spark environment configuration 07/01 Update SLTechnology News&Howtos

Spark2.x from shallow to deep series 7 python development spark environment configuration

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Before learning any technology of spark, please correctly understand spark, you can refer to: correct understanding of spark

The following is a configuration of the environment for developing spark with python on the mac operating system

First, install python

Spark2.2.0 requires the version of python to be Python2.6+ or Python3.4+

Please refer to:

Http://jingyan.baidu.com/article/7908e85c78c743af491ad261.html

Download the spark compiler and configure the environment variables

1. On the official website: http://spark.apache.org/downloads.html download version: spark-2.2.0-bin-hadoop2.6.tgz package

Put it on a local disk and decompress it.

2. Set environment variables:

Cd ~

Vi .bash _ profile

Export SPARK_HOME=/Users/tangweiqun/Desktop/bigdata/spark/spark-2.2.0-bin-hadoop2.6

Export PATH=$PATH:$SCALA_HOME/bin:$M2_HOME/bin:$JAVA_HOME/bin:$SPARK_HOME/bin

Source .bash _ profile

3. You need to execute chmod 744. / * on the files in the bin directory under SPARK_HOME, otherwise an error of insufficient permissions will be reported.

Window machines should not have to do this.

Third, install PyCharm

1. Download it from the official website: https://www.jetbrains.com/pycharm/download/ and install it foolishly.

4. Write wordcount.py and run it successfully

1. Create a project

File-- > New Project

2. Configure PYTHONPATH for PyCharm

Run-- > Edit Configurations, the configuration is as follows

Click the "+" above, and then fill in:

PYTHONPATH=/Users/tangweiqun/Desktop/bigdata/spark/spark-2.1.0-bin-hadoop2.6/python/:/Users/tangweiqun/Desktop/bigdata/spark/spark-2.1.0-bin-hadoop2.6/python/lib/py4j-0.10.4-src.zip

Add the python-related dependencies in the spark installation package

3. Py4j-some-version.zip and pyspark.zip are added to the project

In order to see the source code, we need to associate the project with the source code as follows:

Click + Add Content Root to add two zip packages under / Users/tangweiqun/Desktop/bigdata/spark/spark-2.1.0-bin-hadoop2.6/python/lib

4. Write spark word count and run it successfully

Create a python file, wordcount.py, with the following contents:

From pyspark import SparkContext, SparkConf

Import os

Import shutil

If _ name__ = = "_ _ main__":

Conf = SparkConf () .setAppName ("appName") .setMaster ("local")

Sc = SparkContext (conf=conf)

SourceDataRDD = sc.textFile ("file:///Users/tangweiqun/test.txt")

WordsRDD = sourceDataRDD.flatMap (lambda line: line.split ())

KeyValueWordsRDD = wordsRDD.map (lambda s: (s, 1))

WordCountRDD = keyValueWordsRDD.reduceByKey (lambda a, b: a + b)

OutputPath = "/ Users/tangweiqun/wordcount"

If os.path.exists (outputPath):

Shutil.rmtree (outputPath)

WordsRDD.saveAsTextFile ("file://" + outputPath)

Print wordCountRDD.collect ()

Right click to run successfully

Detailed and systematic understanding of spark core RDD-related Api can be referred to: detailed explanation of spark core RDD api principle

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.