How to compile Spark-2.1.0 source code based on CentOS6.4 environment 07/06 Update SLTechnology News&Howtos

How to compile Spark-2.1.0 source code based on CentOS6.4 environment

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

Xiaobian to share with you how to compile Spark-2.1.0 source code based on CentOS6.4 environment, I hope you have gained something after reading this article, let's discuss it together!

1 written before.

Some friends may ask: Spark official website has not provided Spark for different versions of the installation package, why do we still need to compile Spark source code? To solve this problem, we go to Spark official website: spark.apache.org to take a look, as shown in the following figure:

Spark does provide some Hadoop versions of Spark, but does it meet our requirements? The answer is definitely no. Based on my experience in Spark development in recent years, I list the following points:

In the production environment, most of the Hadoop selections are CDH or HDP series, so can these officially provided Hadoop series meet the production requirements?

In the development process, we often encounter the need to modify the Spark source code, so how to integrate the modified code into the Spark installation package?

Best practices that I personally feel are good for the two points listed above:

Compile the installation package for Spark based on the version of Hadoop running in production

After modifying Spark source code, recompile Spark

So: personally feel that if you want to learn and use Spark better, then the first step is to compile the installation package according to the Spark source code.

2 Pre-preparation

According to the introduction of Spark official documentation compilation module (http://spark.apache.org/docs/2.1.0/building-spark.html):

The Maven-based build is the build of reference for Apache Spark. Building Spark using Maven requires Maven 3.3.9 or newer and Java 7+. Note that support for Java 7 is deprecated as of Spark 2.0.0 and may be removed in Spark 2.2.0. "

We are informed that:

Java requires version 7+, and Java 7 has been marked deprecated after Spark2.0.0, but does not affect usage, but Java 7 support will be removed after Spark2.2.0;

Maven requires version 3.3.9+

2.1 Java 7 installation

2.1.1 Download

Java SE installation package download address: www.oracle.com/technetwork/java/javase/downloads/java-archive-downloads-javase7-521261.html

The JDK version we used in this article is: jdk1.7.0_51

2.1.2 Installation

All of our software is installed under the apps folder in the root directory of the hadoop user

//Unzip tar -zxvf jdk-7u51-linux-x64.tar.gz -C ~/app//Add JDK directory to system environment variable (~/.bash_profile) export JAVA_HOME=/home/hadoop/app/jdk1.7.0_51 export PATH=$PATH //make configuration file effective ~/.bash_profile//execute java to view java version java -version //If installation succeeds, output java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

2.2 Installation of Maven 3.3.9

2.2.1 Download

Maven3.3.9 installation package download address: https://mirrors.tuna.tsinghua.edu.cn/apache//maven/maven-3/3.3.9/binaries/

2.2.2 Installing//Decompressing tar -zxvf apache-maven-3.3.9-bin.tar.gz -C ~/app/ //Add JDK directory to system environment variables (~/.bash_profile) export MAVEN_HOME=/home/hadoop/app/apache-maven-3.3.9 export PATH=$MAVEN_HOME/bin:$PATH //make the configuration file effective source ~/.bash_profile//execute mvn, view version mvn -v //If the installation is successful, there is output with the following information Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T08:41:47-08:00) Maven home: /home/hadoop/app/apache-maven-3.3.9 Java version: 1.7.0_51, vendor: Oracle Corporation Java home: /home/hadoop/app/jdk1.7.0_51/jre Default locale: zh_CN, platform encoding: UTF-8 OS name: "linux", version: "2.6.32-358.el6.x86_64", arch: "amd64", family: "unix"

2.3 Spark-2.1.0 source code download

Download address: spark.apache.org/downloads.html

After downloading, you can decompress it. The directory structure after decompression is shown in the figure below.

3 Spark source code compilation

View official documentation compiled source code section: spark.apache.org/docs/2.1.0/building-spark.html#building-a-runnable-distribution

We can use make-distribution.sh script under dev in Spark source directory. The official compilation command is as follows:

./ dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.4 -Phive -Phive-thriftserver -Pmesos -Pyarn

Parameter Description:

--name: Specifies the name of the Spark installation package after compilation

--tgz: compress in tgz

- Sparkr: compiled Spark supports R

-Phadoop-2.4: Compile with hadoop-2.4 profile, The specific profile can be seen in the source root directory pom.xml view

-Phive and-Phive-thriftserver: compiled Spark supports operations on Hive

-Pmesos: compiled Spark supports running on Mesos

Pyarn: compiled Spark support runs on YARN

Then we can compile Spark according to specific conditions, for example, we use Hadoop version 2.6.0-cdh6.7.0, and we need to run Spark on YARN, support Hive operations, then our Spark source code compilation script is:

./ dev/make-distribution.sh --name 2.6.0-cdh6.7.0 --tgz -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Dhadoop.version=2.6.0-cdh6.7.0

After successful compilation, in the root directory of the Spark source code, the spark-2.1.0-bin-2.6.0-cdh6.7.0.tgz package, then we can use the compiled installation package to install Spark.

Some friends may ask, why is the compiled installation package named spark-2.1.0-bin-2.6.0-cdh6.7.0.tgz? With this in mind, we can check out the source code for make-distribution.sh. At the end of the script, we have the following code:

if [ "$MAKE_TGZ" == "true" ]; then TARDIR_NAME=spark-$VERSION-bin-$NAME TARDIR="$SPARK_HOME/$TARDIR_NAME" rm -rf "$TARDIR" cp -r "$DISTDIR" "$TARDIR" tar czf "spark-$VERSION-bin-$NAME.tgz" -C "$SPARK_HOME" "$TARDIR_NAME" rm -rf "$TARDIR" fi

The VERSION is our version of Spark, i.e. 2.1.0, NAME is the 2.6.0-cdh6.7.0 we specified at compile time, so the full name of the Spark installation package finally output according to the script is: spark-2.1.0-bin-2.6.0-cdh6.7.0.tgz. Through the code to view hope that we can understand a problem: source code in front of, a no secret.

Note: in the compilation process will appear to download a dependency package for too long, this is due to network problems, you can execute ctrl+c stop compilation command, and then re-run the compilation command, in the compilation process can try several times. Conditional small partners, it is recommended to open VPN and then compile, the whole compilation process will be a lot smoother.

After reading this article, I believe you have a certain understanding of "how to compile Spark-2.1.0 source code based on CentOS6.4 environment". If you want to know more about it, welcome to pay attention to the industry information channel. Thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.