Example Analysis of MaxCompute Spark Development 10/21 Update SLTechnology News&Howtos

Example Analysis of MaxCompute Spark Development

2025-10-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this issue, the editor will bring you a sample analysis of MaxCompute Spark development. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

MaxCompute Spark development 0. 0. Overview

MaxCompute Spark is a compatible open source Spark computing service provided by MaxCompute. It provides a Spark computing framework based on a unified authority system for computing resources and datasets, which enables users to submit and run Spark jobs in a familiar way of development and use, so as to meet more rich data processing and analysis scenarios.

The following will focus on the application scenarios that MaxCompute Spark can support, as well as the dependency conditions and environment preparation for development, with emphasis on Spark job development, submission to MaxCompute cluster execution, and diagnosis.

1. prerequisite

MaxCompute Spark is a Spark on MaxCompute solution provided by Aliyun, which allows Spark applications to run in a hosted MaxCompute computing environment. In order to run Spark jobs securely in the MaxCompute environment, MaxCompute provides the following SDK and MaxCompute Spark custom distribution packages.

SDK is positioned to connect open source applications to MaxCompute SDK:

The API description and related functions Demo for integration are provided. Users can build their own applications based on the Spark-1.x provided by the project and the example project of Spark-2.x, and submit them to the MaxCompute cluster.

The MaxCompute Spark client releases the package:

Integrated MaxCompute certification function, as a client tool for submitting jobs to the MaxCompute project through Spark-submit to run, currently provides two release packages for Spark1.x and Spark2.x: spark-1.6.3 and spark-2.3.0 SDK can be referenced by configuring Maven dependencies during development. The Spark client needs to be downloaded in advance according to the version of Spark developed. For example, if you need to develop Spark1.x applications, you should download the spark-1.6.3 version client; if you need to develop Spark2.x applications, you should download the spark-2.3.0 client.

two。 Development environment preparation 2.1 Maxcompute Spark client preparation

MaxCompute Spark release package: integrates MaxCompute certification function as a client tool for submitting jobs to MaxCompute projects through Spark-submit to run. Two release packages for Spark1.x and Spark2.x are currently provided:

Spark-1.6.3

Spark-2.3.0

According to the Spark version you need to develop, choose the appropriate version to download and extract the Maxcompute Spark distribution package.

2.2 set environment variables

JAVA_HOME Settin

# try to use JDK 1.7 + 1.8 + is the best

Export JAVA_HOME=/path/to/jdk

Export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

Export PATH=$JAVA_HOME/bin:$PATH

SPARK_HOME Settin

Export SPARK_HOME=/path/to/spark_extracted_package

Export PATH=$SPARK_HOME/bin:$PATH

2.3Setting Spark-defaults.conf

At $SPARK_HOME/conf

There is a spark-defaults.conf.template file under the path, which can be used as a template for spark-defaults.conf. You need to set the account information related to MaxCompute in this file before you can submit Spark tasks to MaxCompute. The default configuration is as follows. Fill in the blank according to the actual account information, and the rest of the configuration can remain unchanged.

# MaxCompute account Information

Spark.hadoop.odps.project.name =

Spark.hadoop.odps.access.id =

Spark.hadoop.odps.access.key =

# the following configurations remain unchanged

Spark.sql.catalogImplementation=odps

Spark.hadoop.odps.task.major.version = cupid_v2

Spark.hadoop.odps.cupid.container.image.enable = true

Spark.hadoop.odps.cupid.container.vm.engine.type = hyper

Spark.hadoop.odps.end.point = http://service.cn.maxcompute.aliyun.com/api

Spark.hadoop.odps.runtime.end.point = http://service.cn.maxcompute.aliyun-inc.com/api

3. Dependencies required to access the MaxCompute table

If the job needs to access the MaxCompute table, you need to rely on the odps-spark-datasource module. This section describes how to install the dependency compilation to the local maven repository; if you do not need to access it, you can skip it.

Git clone code, github address: https://github.com/aliyun/aliyun-cupid-sdk/tree/3.3.2-public

# git clone git@github.com:aliyun/aliyun-cupid-sdk.git

Compiler module

# cd ${path to aliyun-cupid-sdk}

# git checkout 3.3.2-public

/ / compile and install cupid-sdk

# cd ${path to aliyun-cupid-sdk} / core/cupid-sdk/

# mvn clean install-DskipTests

/ / compile and install datasource. Rely on cupid-sdk

/ / for spark-2.x

# cd ${path to aliyun-cupid-sdk} / spark/spark-2.x/datasource

# mvn clean install-DskipTests

/ / for spark-1.x

# cd ${path to aliyun-cupid-sdk} / spark/spark-1.x/datasource

# mvn clean install-DskipTests

Add dependency

Com.aliyun.odps

Odps-spark-datasource_2.10

3.3.2-public

Com.aliyun.odps

Odps-spark-datasource_2.11

3.3.2-public

4. OSS dependence

If the job needs to access OSS, simply add the following dependencies

Com.aliyun.odps

Hadoop-fs-oss

3.3.2-public

5. Application and development

MaxCompute products provide two templates built by applications, based on which users can develop. Finally, the whole project is built uniformly, and then the generated application package can be directly submitted to the MaxCompute cluster to run Spark applications.

5.1 build applications through templates

MaxCompute Spark provides two application building templates, which users can develop based on this template. Finally, the entire project is built uniformly, and then the generated application package can be directly submitted to the MaxCompute cluster to run Spark applications. First, you need to clone the code.

# git clone git@github.com:aliyun/aliyun-cupid-sdk.git

# cd aliyun-cupid-sdk

# checkout 3.3.2-public

# cd archetypes

/ / for Spark-1.x

Sh Create-AliSpark-1.x-APP.sh spark-1.x-demo / tmp

/ / for Spark-2.x

Create-AliSpark-2.x-APP.sh spark-2.x-demo / tmp

The above command creates a maven project named spark-1.x-demo (spark-2.x-demo) under the / tmp directory, and executes the following command to compile and submit the job:

# cd / tmp/spark-2.x/demo

# mvn clean package

/ / submit the job

$SPARK_HOME/bin/spark-submit\

-- master yarn-cluster\

-- class SparkPi\

/ tmp/spark-2.x-demo/target/AliSpark-2.x-quickstart-1.0-SNAPSHOT-shaded.jar

# Usage: sh Create-AliSpark-2.x-APP.sh

Sh Create-AliSpark-2.x-APP.sh spark-2.x-demo / tmp/

Cd / tmp/spark-2.x-demo

Mvn clean package

# smoke test

# 1 using the compiled shaded jar package

# 2 download the MaxCompute Spark client as shown in the document

# 3 refer to the document "setting Environment variables", and fill in the relevant configuration items of the MaxCompute project.

# execute the spark-submit command as follows

$SPARK_HOME/bin/spark-submit\

-- master yarn-cluster\

-- class SparkPi\

/ tmp/spark-2.x-demo/target/AliSpark-2.x-quickstart-1.0-SNAPSHOT-shaded.jar

5.2 Java/Scala development sample Spark-1.x

Notes on pom.xml

Please note that when building Spark applications, users need to pay attention to some definitions that rely on scope because they use the Spark client provided by MaxCompute to submit the application.

Packages released by all spark communities, such as spark-core spark-sql, using provided scope

Odps-spark-datasource uses the default compile scope

Org.apache.spark

Spark-mllib_$ {scala.binary.version}

${spark.version}

Provided

Org.apache.spark

Spark-sql_$ {scala.binary.version}

${spark.version}

Provided

Org.apache.spark

Spark-core_$ {scala.binary.version}

${spark.version}

Provided

Com.aliyun.odps

Odps-spark-datasource_$ {scala.binary.version}

3.3.2-public

Case description

WordCount

Detailed code

Submission mode

Step 1. Build aliyun-cupid-sdk

Step 2. Properly set spark.defaults.conf

Step 3. Bin/spark-submit-master yarn-cluster-class\

Com.aliyun.odps.spark.examples.WordCount\

${path to aliyun-cupid-sdk} / spark/spark-1.x/spark-examples/target/spark-examples_2.10-version-shaded.jar

Spark-SQL on MaxCompute Table

Detailed code

Submission mode

# running may report an exception to Table Not Found because the user's MaxCompute Project does not have the table specified in the code

# you can refer to various APIs in the code to implement SparkSQL applications corresponding to Table

Step 1. Build aliyun-cupid-sdk

Step 2. Properly set spark.defaults.conf

Step 3. Bin/spark-submit-master yarn-cluster-class\

Com.aliyun.odps.spark.examples.sparksql.SparkSQL\

${path to aliyun-cupid-sdk} / spark/spark-1.x/spark-examples/target/spark-examples_2.10-version-shaded.jar

GraphX PageRank

Detailed code

Submission mode

Step 1. Build aliyun-cupid-sdk

Step 2. Properly set spark.defaults.conf

Step 3. Bin/spark-submit-master yarn-cluster-class\

Com.aliyun.odps.spark.examples.graphx.PageRank\

${path to aliyun-cupid-sdk} / spark/spark-1.x/spark-examples/target/spark-examples_2.10-version-shaded.jar

Mllib Kmeans-ON-OSS

Detailed code

Submission mode

# OSS account information in the code needs to be filled in, and then compiled and submitted

Conf.set ("spark.hadoop.fs.oss.accessKeyId", "*")

Conf.set ("spark.hadoop.fs.oss.accessKeySecret", "*")

Conf.set ("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")

Step 1. Build aliyun-cupid-sdk

Step 2. Properly set spark.defaults.conf

Step 3. Bin/spark-submit-master yarn-cluster-class\

Com.aliyun.odps.spark.examples.mllib.KmeansModelSaveToOss\

${path to aliyun-cupid-sdk} / spark/spark-1.x/spark-examples/target/spark-examples_2.10-version-shaded.jar

OSS UnstructuredData

Detailed code

Submission mode

# OSS account information in the code needs to be filled in, and then compiled and submitted

Conf.set ("spark.hadoop.fs.oss.accessKeyId", "*")

Conf.set ("spark.hadoop.fs.oss.accessKeySecret", "*")

Conf.set ("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")

Step 1. Build aliyun-cupid-sdk

Step 2. Properly set spark.defaults.conf

Step 3. Bin/spark-submit-master yarn-cluster-class\

Com.aliyun.odps.spark.examples.oss.SparkUnstructuredDataCompute\

${path to aliyun-cupid-sdk} / spark/spark-1.x/spark-examples/target/spark-examples_2.10-version-shaded.jar

Spark-2.x

Notes on pom.xml

Packages released by all spark communities, such as spark-core spark-sql, using provided scope

Odps-spark-datasource uses the default compile scope

Org.apache.spark

Spark-mllib_$ {scala.binary.version}

${spark.version}

Provided

Org.apache.spark

Spark-sql_$ {scala.binary.version}

${spark.version}

Provided

Org.apache.spark

Spark-core_$ {scala.binary.version}

${spark.version}

Provided

Com.aliyun.odps

Cupid-sdk

Provided

Com.aliyun.odps

Odps-spark-datasource_$ {scala.binary.version}

3.3.2-public

Case description

WordCount

Detailed code

Submission mode

Step 1. Build aliyun-cupid-sdk

Step 2. Properly set spark.defaults.conf

Step 3. Bin/spark-submit-master yarn-cluster-class\

Com.aliyun.odps.spark.examples.WordCount\

${path to aliyun-cupid-sdk} / spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar

Spark-SQL operates the MaxCompute table

Detailed code

Submission mode

# running may report an exception to Table Not Found because the user's MaxCompute Project does not have the table specified in the code

# you can refer to various APIs in the code to implement SparkSQL applications corresponding to Table

Step 1. Build aliyun-cupid-sdk

Step 2. Properly set spark.defaults.conf

Step 3. Bin/spark-submit-master yarn-cluster-class\

Com.aliyun.odps.spark.examples.sparksql.SparkSQL\

${path to aliyun-cupid-sdk} / spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar

GraphX PageRank

Detailed code

Submission mode

Step 1. Build aliyun-cupid-sdk

Step 2. Properly set spark.defaults.conf

Step 3. Bin/spark-submit-master yarn-cluster-class\

Com.aliyun.odps.spark.examples.graphx.PageRank\

${path to aliyun-cupid-sdk} / spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar

Mllib Kmeans-ON-OSS

KmeansModelSaveToOss

Detailed code

Submission mode

# OSS account information in the code needs to be filled in, and then compiled and submitted

Val spark = SparkSession

.builder ()

.config ("spark.hadoop.fs.oss.accessKeyId", "*")

.config ("spark.hadoop.fs.oss.accessKeySecret", "*")

.config ("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")

.appName ("KmeansModelSaveToOss")

.getOrCreate ()

Step 1. Build aliyun-cupid-sdk

Step 2. Properly set spark.defaults.conf

Step 3. Bin/spark-submit-master yarn-cluster-class\

Com.aliyun.odps.spark.examples.mllib.KmeansModelSaveToOss\

${path to aliyun-cupid-sdk} / spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar

OSS UnstructuredData

SparkUnstructuredDataCompute

Detailed code

Submission mode

# OSS account information in the code needs to be filled in, and then compiled and submitted

Val spark = SparkSession

.builder ()

.config ("spark.hadoop.fs.oss.accessKeyId", "*")

.config ("spark.hadoop.fs.oss.accessKeySecret", "*")

.config ("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")

.appName ("SparkUnstructuredDataCompute")

.getOrCreate ()

Step 1. Build aliyun-cupid-sdk

Step 2. Properly set spark.defaults.conf

Step 3. Bin/spark-submit-master yarn-cluster-class\

Com.aliyun.odps.spark.examples.oss.SparkUnstructuredDataCompute\

${path to aliyun-cupid-sdk} / spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar

PySpark development sample

File required

If you need to access the MaxCompute table, you need to refer to section 3 (dependencies for accessing the MaxCompute table) to compile the datasource package.

SparkSQL application example (spark1.6)

From pyspark import SparkContext, SparkConf

From pyspark.sql import OdpsContext

If _ _ name__ = ='_ _ main__':

Conf = SparkConf () .setAppName ("odps_pyspark")

Sc = SparkContext (conf=conf)

Sql_context = OdpsContext (sc)

Df = sql_context.sql ("select id, value from cupid_wordcount")

Df.printSchema ()

Df.show (200)

Df_2 = sql_context.sql ("select id, value from cupid_partition_table1 where pt1 = 'part1'")

Df_2.show (200)

# Create Drop Table

Sql_context.sql ("create table TestCtas as select * from cupid_wordcount". Show ()

Sql_context.sql (drop table TestCtas) .show ()

Submit and run:

. / bin/spark-submit\

-jars ${path to odps-spark-datasource_2.10-3.3.2-public.jar}\

Example.py

SparkSQL application example (spark2.3)

From pyspark.sql import SparkSession

If _ _ name__ = ='_ _ main__':

Spark = SparkSession.builder.appName ("spark sql") .getOrCreate

Df = spark.sql ("select id, value from cupid_wordcount")

Df.printSchema ()

Df.show (10,200)

Df_2 = spark.sql ("SELECT product,category,revenue FROM (SELECT product,category,revenue, dense_rank () OVER (PARTITION BY category ORDER BY revenue DESC) as rank FROM productRevenue) tmp WHERE rank)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.