In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
In this issue, the editor will bring you a sample analysis of MaxCompute Spark development. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.
MaxCompute Spark development 0. 0. Overview
MaxCompute Spark is a compatible open source Spark computing service provided by MaxCompute. It provides a Spark computing framework based on a unified authority system for computing resources and datasets, which enables users to submit and run Spark jobs in a familiar way of development and use, so as to meet more rich data processing and analysis scenarios.
The following will focus on the application scenarios that MaxCompute Spark can support, as well as the dependency conditions and environment preparation for development, with emphasis on Spark job development, submission to MaxCompute cluster execution, and diagnosis.
1. prerequisite
MaxCompute Spark is a Spark on MaxCompute solution provided by Aliyun, which allows Spark applications to run in a hosted MaxCompute computing environment. In order to run Spark jobs securely in the MaxCompute environment, MaxCompute provides the following SDK and MaxCompute Spark custom distribution packages.
SDK is positioned to connect open source applications to MaxCompute SDK:
The API description and related functions Demo for integration are provided. Users can build their own applications based on the Spark-1.x provided by the project and the example project of Spark-2.x, and submit them to the MaxCompute cluster.
The MaxCompute Spark client releases the package:
Integrated MaxCompute certification function, as a client tool for submitting jobs to the MaxCompute project through Spark-submit to run, currently provides two release packages for Spark1.x and Spark2.x: spark-1.6.3 and spark-2.3.0 SDK can be referenced by configuring Maven dependencies during development. The Spark client needs to be downloaded in advance according to the version of Spark developed. For example, if you need to develop Spark1.x applications, you should download the spark-1.6.3 version client; if you need to develop Spark2.x applications, you should download the spark-2.3.0 client.
two。 Development environment preparation 2.1 Maxcompute Spark client preparation
MaxCompute Spark release package: integrates MaxCompute certification function as a client tool for submitting jobs to MaxCompute projects through Spark-submit to run. Two release packages for Spark1.x and Spark2.x are currently provided:
Spark-1.6.3
Spark-2.3.0
According to the Spark version you need to develop, choose the appropriate version to download and extract the Maxcompute Spark distribution package.
2.2 set environment variables
JAVA_HOME Settin
# try to use JDK 1.7 + 1.8 + is the best
Export JAVA_HOME=/path/to/jdk
Export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
Export PATH=$JAVA_HOME/bin:$PATH
SPARK_HOME Settin
Export SPARK_HOME=/path/to/spark_extracted_package
Export PATH=$SPARK_HOME/bin:$PATH
2.3Setting Spark-defaults.conf
At $SPARK_HOME/conf
There is a spark-defaults.conf.template file under the path, which can be used as a template for spark-defaults.conf. You need to set the account information related to MaxCompute in this file before you can submit Spark tasks to MaxCompute. The default configuration is as follows. Fill in the blank according to the actual account information, and the rest of the configuration can remain unchanged.
# MaxCompute account Information
Spark.hadoop.odps.project.name =
Spark.hadoop.odps.access.id =
Spark.hadoop.odps.access.key =
# the following configurations remain unchanged
Spark.sql.catalogImplementation=odps
Spark.hadoop.odps.task.major.version = cupid_v2
Spark.hadoop.odps.cupid.container.image.enable = true
Spark.hadoop.odps.cupid.container.vm.engine.type = hyper
Spark.hadoop.odps.end.point = http://service.cn.maxcompute.aliyun.com/api
Spark.hadoop.odps.runtime.end.point = http://service.cn.maxcompute.aliyun-inc.com/api
3. Dependencies required to access the MaxCompute table
If the job needs to access the MaxCompute table, you need to rely on the odps-spark-datasource module. This section describes how to install the dependency compilation to the local maven repository; if you do not need to access it, you can skip it.
Git clone code, github address: https://github.com/aliyun/aliyun-cupid-sdk/tree/3.3.2-public
# git clone git@github.com:aliyun/aliyun-cupid-sdk.git
Compiler module
# cd ${path to aliyun-cupid-sdk}
# git checkout 3.3.2-public
/ / compile and install cupid-sdk
# cd ${path to aliyun-cupid-sdk} / core/cupid-sdk/
# mvn clean install-DskipTests
/ / compile and install datasource. Rely on cupid-sdk
/ / for spark-2.x
# cd ${path to aliyun-cupid-sdk} / spark/spark-2.x/datasource
# mvn clean install-DskipTests
/ / for spark-1.x
# cd ${path to aliyun-cupid-sdk} / spark/spark-1.x/datasource
# mvn clean install-DskipTests
Add dependency
Com.aliyun.odps
Odps-spark-datasource_2.10
3.3.2-public
Com.aliyun.odps
Odps-spark-datasource_2.11
3.3.2-public
4. OSS dependence
If the job needs to access OSS, simply add the following dependencies
Com.aliyun.odps
Hadoop-fs-oss
3.3.2-public
5. Application and development
MaxCompute products provide two templates built by applications, based on which users can develop. Finally, the whole project is built uniformly, and then the generated application package can be directly submitted to the MaxCompute cluster to run Spark applications.
5.1 build applications through templates
MaxCompute Spark provides two application building templates, which users can develop based on this template. Finally, the entire project is built uniformly, and then the generated application package can be directly submitted to the MaxCompute cluster to run Spark applications. First, you need to clone the code.
# git clone git@github.com:aliyun/aliyun-cupid-sdk.git
# cd aliyun-cupid-sdk
# checkout 3.3.2-public
# cd archetypes
/ / for Spark-1.x
Sh Create-AliSpark-1.x-APP.sh spark-1.x-demo / tmp
/ / for Spark-2.x
Create-AliSpark-2.x-APP.sh spark-2.x-demo / tmp
The above command creates a maven project named spark-1.x-demo (spark-2.x-demo) under the / tmp directory, and executes the following command to compile and submit the job:
# cd / tmp/spark-2.x/demo
# mvn clean package
/ / submit the job
$SPARK_HOME/bin/spark-submit\
-- master yarn-cluster\
-- class SparkPi\
/ tmp/spark-2.x-demo/target/AliSpark-2.x-quickstart-1.0-SNAPSHOT-shaded.jar
# Usage: sh Create-AliSpark-2.x-APP.sh
Sh Create-AliSpark-2.x-APP.sh spark-2.x-demo / tmp/
Cd / tmp/spark-2.x-demo
Mvn clean package
# smoke test
# 1 using the compiled shaded jar package
# 2 download the MaxCompute Spark client as shown in the document
# 3 refer to the document "setting Environment variables", and fill in the relevant configuration items of the MaxCompute project.
# execute the spark-submit command as follows
$SPARK_HOME/bin/spark-submit\
-- master yarn-cluster\
-- class SparkPi\
/ tmp/spark-2.x-demo/target/AliSpark-2.x-quickstart-1.0-SNAPSHOT-shaded.jar
5.2 Java/Scala development sample Spark-1.x
Notes on pom.xml
Please note that when building Spark applications, users need to pay attention to some definitions that rely on scope because they use the Spark client provided by MaxCompute to submit the application.
Packages released by all spark communities, such as spark-core spark-sql, using provided scope
Odps-spark-datasource uses the default compile scope
Org.apache.spark
Spark-mllib_$ {scala.binary.version}
${spark.version}
Provided
Org.apache.spark
Spark-sql_$ {scala.binary.version}
${spark.version}
Provided
Org.apache.spark
Spark-core_$ {scala.binary.version}
${spark.version}
Provided
Com.aliyun.odps
Odps-spark-datasource_$ {scala.binary.version}
3.3.2-public
Case description
WordCount
Detailed code
Submission mode
Step 1. Build aliyun-cupid-sdk
Step 2. Properly set spark.defaults.conf
Step 3. Bin/spark-submit-master yarn-cluster-class\
Com.aliyun.odps.spark.examples.WordCount\
${path to aliyun-cupid-sdk} / spark/spark-1.x/spark-examples/target/spark-examples_2.10-version-shaded.jar
Spark-SQL on MaxCompute Table
Detailed code
Submission mode
# running may report an exception to Table Not Found because the user's MaxCompute Project does not have the table specified in the code
# you can refer to various APIs in the code to implement SparkSQL applications corresponding to Table
Step 1. Build aliyun-cupid-sdk
Step 2. Properly set spark.defaults.conf
Step 3. Bin/spark-submit-master yarn-cluster-class\
Com.aliyun.odps.spark.examples.sparksql.SparkSQL\
${path to aliyun-cupid-sdk} / spark/spark-1.x/spark-examples/target/spark-examples_2.10-version-shaded.jar
GraphX PageRank
Detailed code
Submission mode
Step 1. Build aliyun-cupid-sdk
Step 2. Properly set spark.defaults.conf
Step 3. Bin/spark-submit-master yarn-cluster-class\
Com.aliyun.odps.spark.examples.graphx.PageRank\
${path to aliyun-cupid-sdk} / spark/spark-1.x/spark-examples/target/spark-examples_2.10-version-shaded.jar
Mllib Kmeans-ON-OSS
Detailed code
Submission mode
# OSS account information in the code needs to be filled in, and then compiled and submitted
Conf.set ("spark.hadoop.fs.oss.accessKeyId", "*")
Conf.set ("spark.hadoop.fs.oss.accessKeySecret", "*")
Conf.set ("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")
Step 1. Build aliyun-cupid-sdk
Step 2. Properly set spark.defaults.conf
Step 3. Bin/spark-submit-master yarn-cluster-class\
Com.aliyun.odps.spark.examples.mllib.KmeansModelSaveToOss\
${path to aliyun-cupid-sdk} / spark/spark-1.x/spark-examples/target/spark-examples_2.10-version-shaded.jar
OSS UnstructuredData
Detailed code
Submission mode
# OSS account information in the code needs to be filled in, and then compiled and submitted
Conf.set ("spark.hadoop.fs.oss.accessKeyId", "*")
Conf.set ("spark.hadoop.fs.oss.accessKeySecret", "*")
Conf.set ("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")
Step 1. Build aliyun-cupid-sdk
Step 2. Properly set spark.defaults.conf
Step 3. Bin/spark-submit-master yarn-cluster-class\
Com.aliyun.odps.spark.examples.oss.SparkUnstructuredDataCompute\
${path to aliyun-cupid-sdk} / spark/spark-1.x/spark-examples/target/spark-examples_2.10-version-shaded.jar
Spark-2.x
Notes on pom.xml
Please note that when building Spark applications, users need to pay attention to some definitions that rely on scope because they use the Spark client provided by MaxCompute to submit the application.
Packages released by all spark communities, such as spark-core spark-sql, using provided scope
Odps-spark-datasource uses the default compile scope
Org.apache.spark
Spark-mllib_$ {scala.binary.version}
${spark.version}
Provided
Org.apache.spark
Spark-sql_$ {scala.binary.version}
${spark.version}
Provided
Org.apache.spark
Spark-core_$ {scala.binary.version}
${spark.version}
Provided
Com.aliyun.odps
Cupid-sdk
Provided
Com.aliyun.odps
Odps-spark-datasource_$ {scala.binary.version}
3.3.2-public
Case description
WordCount
Detailed code
Submission mode
Step 1. Build aliyun-cupid-sdk
Step 2. Properly set spark.defaults.conf
Step 3. Bin/spark-submit-master yarn-cluster-class\
Com.aliyun.odps.spark.examples.WordCount\
${path to aliyun-cupid-sdk} / spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar
Spark-SQL operates the MaxCompute table
Detailed code
Submission mode
# running may report an exception to Table Not Found because the user's MaxCompute Project does not have the table specified in the code
# you can refer to various APIs in the code to implement SparkSQL applications corresponding to Table
Step 1. Build aliyun-cupid-sdk
Step 2. Properly set spark.defaults.conf
Step 3. Bin/spark-submit-master yarn-cluster-class\
Com.aliyun.odps.spark.examples.sparksql.SparkSQL\
${path to aliyun-cupid-sdk} / spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar
GraphX PageRank
Detailed code
Submission mode
Step 1. Build aliyun-cupid-sdk
Step 2. Properly set spark.defaults.conf
Step 3. Bin/spark-submit-master yarn-cluster-class\
Com.aliyun.odps.spark.examples.graphx.PageRank\
${path to aliyun-cupid-sdk} / spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar
Mllib Kmeans-ON-OSS
KmeansModelSaveToOss
Detailed code
Submission mode
# OSS account information in the code needs to be filled in, and then compiled and submitted
Val spark = SparkSession
.builder ()
.config ("spark.hadoop.fs.oss.accessKeyId", "*")
.config ("spark.hadoop.fs.oss.accessKeySecret", "*")
.config ("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")
.appName ("KmeansModelSaveToOss")
.getOrCreate ()
Step 1. Build aliyun-cupid-sdk
Step 2. Properly set spark.defaults.conf
Step 3. Bin/spark-submit-master yarn-cluster-class\
Com.aliyun.odps.spark.examples.mllib.KmeansModelSaveToOss\
${path to aliyun-cupid-sdk} / spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar
OSS UnstructuredData
SparkUnstructuredDataCompute
Detailed code
Submission mode
# OSS account information in the code needs to be filled in, and then compiled and submitted
Val spark = SparkSession
.builder ()
.config ("spark.hadoop.fs.oss.accessKeyId", "*")
.config ("spark.hadoop.fs.oss.accessKeySecret", "*")
.config ("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")
.appName ("SparkUnstructuredDataCompute")
.getOrCreate ()
Step 1. Build aliyun-cupid-sdk
Step 2. Properly set spark.defaults.conf
Step 3. Bin/spark-submit-master yarn-cluster-class\
Com.aliyun.odps.spark.examples.oss.SparkUnstructuredDataCompute\
${path to aliyun-cupid-sdk} / spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar
PySpark development sample
File required
If you need to access the MaxCompute table, you need to refer to section 3 (dependencies for accessing the MaxCompute table) to compile the datasource package.
SparkSQL application example (spark1.6)
From pyspark import SparkContext, SparkConf
From pyspark.sql import OdpsContext
If _ _ name__ = ='_ _ main__':
Conf = SparkConf () .setAppName ("odps_pyspark")
Sc = SparkContext (conf=conf)
Sql_context = OdpsContext (sc)
Df = sql_context.sql ("select id, value from cupid_wordcount")
Df.printSchema ()
Df.show (200)
Df_2 = sql_context.sql ("select id, value from cupid_partition_table1 where pt1 = 'part1'")
Df_2.show (200)
# Create Drop Table
Sql_context.sql ("create table TestCtas as select * from cupid_wordcount". Show ()
Sql_context.sql (drop table TestCtas) .show ()
Submit and run:
. / bin/spark-submit\
-jars ${path to odps-spark-datasource_2.10-3.3.2-public.jar}\
Example.py
SparkSQL application example (spark2.3)
From pyspark.sql import SparkSession
If _ _ name__ = ='_ _ main__':
Spark = SparkSession.builder.appName ("spark sql") .getOrCreate
Df = spark.sql ("select id, value from cupid_wordcount")
Df.printSchema ()
Df.show (10,200)
Df_2 = spark.sql ("SELECT product,category,revenue FROM (SELECT product,category,revenue, dense_rank () OVER (PARTITION BY category ORDER BY revenue DESC) as rank FROM productRevenue) tmp WHERE rank)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.