How to use Eclipse to build Spark integrated development environment 07/03 Update SLTechnology News&Howtos

How to use Eclipse to build Spark integrated development environment

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

How to use Eclipse to build Spark integrated development environment, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

Use Maven to compile and generate Spark jar packages that can run directly on Hadoop 2.2.0. On this basis, this paper introduces how to use Eclipse to build Spark integrated development environment.

(1) preparatory work

Before the formal introduction, the following hardware and software preparations are required:

Software preparation:

Eclipse Juno version (version 4.2), you can click here to download: Eclipse 4.2

Scala version 2.9.3, Window installer can be downloaded directly here: Scala 2.9.3

The Eclipse Scala IDE plug-in can be downloaded directly here: Scala IDE (for Scala 2.9.x and Eclipse Juno)

Hardware preparation

A machine with Linux or Windows operating system

(2) build Spark integrated development environment.

I operate under the windows operating system, and the process is as follows:

Step 1: install scala 2.9.3: click install directly.

Step 2: copy all the files in the features and plugins directories of the Eclipse Scala IDE plug-in to the corresponding directory after Eclipse decompression

Step 3: restart Eclipse, click the box button in the upper right corner of eclipse, as shown in the following figure, after expansion, click "Other …." to see if there is a "Scala". If so, click directly to open, otherwise do step 4.

Step 4: in Eclipse, select "Help"-> "Install New Software …" Fill in http://download.scala-ide.org/sdk/e38/scala29/stable/site in the open card and press enter to see the following. Select the first two items to install. (since step 3 has copied the jar package to eclipse, the installation is quick, just dredge it.) after installation, repeat step 3.

(3) using Scala language to develop Spark program.

In eclipse, select "File"-> "New"-> "Other..." -> "Scala Wizard"-> "Scala Project", create a Scala project and name it "SparkScala".

Right-click the "SaprkScala" project, select "Properties", and in the pop-up box, select "Java Build Path"-> "Libraties"-> "Add External JARs …" as shown in the following figure. , imported from the article "Apache Spark: deploying Spark to Hadoop 2.2.0"

Spark-assembly-0.8.1-incubating- hadoop2.2.0.jar in the assembly/target/scala-2.9.3/ directory, and the jar package can also be compiled and generated by itself and placed in the assembly/target/scala-2.9.3/ directory under the spark directory.

Similar to creating a Scala project, add a Scala Class to the project and name it: WordCount. The whole project structure is as follows:

WordCount is the most classic word frequency statistics program, which will count the total number of occurrences of all words in the input directory. The Scala code is as follows:

Import org.apache.spark._ import SparkContext._ object WordCount {def main (args: Array [String]) {if (args.length! = 3) {println ("usage is org.test.WordCount") return} val sc = new SparkContext (args (0), "WordCount", System.getenv ("SPARK_HOME") Seq (System.getenv ("SPARK_TEST_JAR")) val textFile = sc.textFile (args (1)) val result = textFile.flatMap (line = > line.split ("\\ s +")) .map (word = > (word, 1)). ReduceByKey (_ + _) result.saveAsTextFile (args (2))}}

In the Scala project, right-click "WordCount.scala", select "Export", and select "Java"-> "JAR File" in the pop-up box, and then compile the program into a jar package, which can be named "spark-wordcount-in-scala.jar". The download address of the jar package I exported is spark-wordcount-in-scala.jar.

The WordCount program receives three parameters, namely, the master location, the HDFS input directory and the HDFS output directory. To do this, you can write a run_spark_wordcount.sh script:

# configure as YARN configuration file storage directory

Export YARN_CONF_DIR=/opt/hadoop/yarn-client/etc/hadoop/

SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.2.0.jar\

. / spark-class org.apache.spark.deploy.yarn.Client\

-jar spark-wordcount-in-scala.jar\

-class WordCount\

-args yarn-standalone\

-args hdfs://hadoop-test/tmp/input\

-args hdfs:/hadoop-test/tmp/output\

-num-workers 1\

-master-memory 2g\

-worker-memory 2g\

-worker-cores 2

Need to pay attention to the following points: the input parameters of the WordCount program are specified through "- args", each parameter is specified separately, the second parameter is the input directory on HDFS, which needs to be created in advance, and several text files need to be uploaded in order to count the word frequency, the third parameter is the output directory on HDFS, which can not exist before running.

You can get the result by running the run_spark_wordcount.sh script directly.

During the run, it is found that a bug,org.apache.spark.deploy.yarn.Client has a parameter "- name" that specifies the application name:

However, during use, this parameter blocks the application and looks at the source code to find that it is a bug that has been submitted to the Spark jira:

/ / location: new-yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala case ("--queue"):: value:: tail = > amQueue = value args = tail case ("--name"):: value:: tail = > appName = value args = tail / / this line of code is missing Causes the program to block case ("--addJars"):: value:: tail = > addJars = value args = tail

So, don't use the "- name" parameter yet, or fix the bug and recompile Spark.

(4) develop Spark programs with Java language.

The method is the same as ordinary Java program development, as long as the Spark development package spark-assembly-0.8.1-incubating-hadoop2.2.0.jar is used as a three-party dependent library.

(5) Summary

During the preliminary trial of Spark On YARN, it is found that there are still many problems, it is very inconvenient to use, and the threshold is still very high, which is far less mature than Spark On Mesos.

After reading the above, have you mastered how to use Eclipse to build a Spark integrated development environment? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.