How to quickly build a Spark development environment 07/06 Update SLTechnology News&Howtos

How to quickly build a Spark development environment

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to quickly build a Spark development environment, in view of this problem, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

First, set up a local pyspark stand-alone practice environment

The following procedure exercises the configuration of the programming environment for the local stand-alone pyspark.

Note: only configure the exercise environment without installing hadoop, no need to install scala.

1, install Java8

Download address: https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Be careful to avoid installing other versions of jdk or there may be incompatibility with spark. Notice to set JAVA_HOME and add it to the default path PATH

For detailed tutorials on installing jdk8 under WINDOWS, please refer to:

Https://www.cnblogs.com/heqiyoujing/p/9502726.html

After the installation is successful, type java-version on the command line, and you will see a result similar to the following.

2. Download and decompress spark

Download from spark official website: http://spark.apache.org/downloads.html

Baidu cloud disk link: https://pan.baidu.com/s/1mUMavclShgvigjaKwoSF_A password: fixh

After download, unpack it and put it into the installation path of a commonly used software, such as:

/ Users/liangyun/ProgramFiles/spark-3.0.1-bin-hadoop3.2

For Linux users and mac users, it is recommended to set environment variables in ~ / .bashrc as follows so that spark-submit and spark-shell can be started.

Windows users can ignore the following settings.

Export PYTHONPATH=$/Users/liangyun/anaconda3/bin/python

Export PATH= "/ Users/liangyun/anaconda3/bin:$ {PATH}"

Export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home

Export SPARK_HOME= "/ Users/liangyun/ProgramFiles/spark-3.0.1-bin-hadoop3.2"

Export PYSPARK_PYTHON=$PYTHONPATH

Export PYSPARK_DRIVER_PYTHON=$PYTHONPATH

Export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

3. Install findspark

After a successful installation, you can run the following code in jupyter

Import findspark

# specify spark_home as the decompression path and python path

Spark_home = "/ Users/liangyun/ProgramFiles/spark-3.0.1-bin-hadoop3.2"

Python_path = "/ Users/liangyun/anaconda3/bin/python"

Findspark.init (spark_home,python_path)

Import pyspark

From pyspark import SparkContext, SparkConf

Conf = SparkConf () .setAppName ("test") .setMaster ("local [4]")

Sc = SparkContext (conf=conf)

Print ("spark version:", pyspark.__version__)

Rdd = sc.parallelize (["hello", "spark"])

Print (rdd.reduce (lambda xonomer) x'+ y))

Spark version: 3.0.1

Hello spark

4, life-saving plan

If the above process fails to install pyspark successfully due to factors such as java environment configuration.

You can learn pyspark directly in the cloud notebook environment of the whale community.

Pyspark has been installed in the cloud notebook environment of the Whale Community.

Https://www.kesci.com/home/column/5fe6aa955e24ed00302304e0

Second, various ways of running pyspark

Pyspark runs mainly in the following ways.

1. Enter the pyspark stand-alone interactive environment through pyspark.

This approach is generally used to test code.

You can also specify jupyter or ipython as the interactive environment.

2. Submit the Spark task to the cluster to run through spark-submit.

In this way, you can submit Python scripts or Jar packages to the cluster and let hundreds of machines run tasks.

This is also the way spark is commonly used in industrial production.

3, interactive execution through zepplin notebook.

Zepplin is the apache counterpart of jupyter notebook.

4. Python installs findspark and pyspark libraries.

The pyspark library can be invoked in jupyter and other Python environments as if it were a regular library.

This is also how this book configures the pyspark practice environment.

Third, submit tasks to the cluster to run FAQs through spark-submit

Here are some issues related to running pyspark on a cluster

1Can I call the jar package developed by Scala or Java?

Answer: only the jar package can be called in Driver, and it can be called through Py4J, but not in excutors.

2How do I install packages such as pandas,numpy in excutors?

A: you can create a Python environment through conda, then compress it into a zip file and upload it to hdfs, and specify the environment when submitting the task. Of course, the simplest and most straightforward solution is to package the anaconda environment you want into zip and upload it to the cluster hdfs environment. Note that the machines you package should have the same linux operating system as the cluster machines.

3How can I add other Python scripts written by myself to PYTHONPATH in excutors?

Answer: you can set it with the py-files parameter, you can add .py, .egg or Python scripts compressed into .zip, and you can import them in excutors.

4How do I add some configuration files to the work path in each excutors?

Answer: it can be set with the files parameter, separated by commas between different file names, and obtained by SparkFiles.get (fileName) in excutors.

# submit the task written by python

Spark-submit-- master yarn\

-- deploy-mode cluster\

-executor-memory 12G\

-driver-memory 12G\

-num-executors 100\

-- executor-cores 2\

-- conf spark.yarn.maxAppAttempts=2\

-- conf spark.default.parallelism=1600\

-- conf spark.sql.shuffle.partitions=1600\

-- conf spark.memory.offHeap.enabled=true\

-- conf spark.memory.offHeap.size=2g\

-- conf spark.task.maxFailures=10\

-- conf spark.stage.maxConsecutiveAttempts=10\

-- conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./anaconda3.zip/anaconda3/bin/python # specifies the Python environment for excutors

-- conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON =. / anaconda3.zip/anaconda3/bin/python # cluster mode setting

-- archives viewfs:///user/hadoop-xxx/yyy/anaconda3.zip # uploaded to hdfs's Python environment

-- files data.csv,profile.txt

-- py-files pkg.py,tqdm.py

This is the answer to pyspark_demo.py 's question on how to quickly build a Spark development environment. I hope the above content can be of some help to you. If you still have a lot of questions to solve, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.