What are the use of MaxCompute Spark and frequently asked questions 04/18 Update SLTechnology News&Howtos

What are the use of MaxCompute Spark and frequently asked questions

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about the use of MaxCompute Spark and common questions. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

MaxCompute Spark is a compatible open source Spark computing service provided by MaxCompute. Based on the unified authority system of computing resources and data sets, it provides a Spark computing framework, which supports users to submit and run Spark jobs in a familiar way of development and use, so as to meet more rich data processing and analysis scenarios.

1.1 key Features

Support for native multi-version Spark jobs

Community native Spark runs in MaxCompute, which is fully compatible with Spark's API and supports multiple Spark versions to run at the same time.

Unified computing resourc

Like MaxCompute SQL/MR and other task types, run in the unified computing resources opened by the MaxCompute project

Unified data and rights management

Follow the authority system of MaxCompute project, and safely query data within the limits of access rights of users.

The same experience as open source systems

Provide native open source real-time Spark UI and query history log functions

1.2 system structure

Native Spark can be run in MaxCompute through the MaxCompute Cupid platform

1.3 constraints and restrictions

Currently, MaxCompute Spark supports the following applicable scenarios:

Offline computing scenarios: GraphX, Mllib, RDD, Spark-SQL, PySpark, etc.

Streaming scene

Read and write MaxCompute Table

Referencing file resources in MaxCompute

Read and write services in VPC environment, such as services deployed on RDS, Redis, HBase, ECS, etc.

Read and write OSS unstructured storage

Use restriction

Does not support interactive class requirements such as Spark-Shell, Spark-SQL-Shell, PySpark-Shell, etc.

Access to MaxCompute external tables, functions and UDF is not supported

Only Local mode and Yarn-cluster mode are supported.

two。 Build 2.1 running mode of development environment

Submit through Spark client

Yarn-Cluster mode, submitting tasks to the MaxCompute cluster

Local mode

Submit via Dataworks

It is essentially Yarn-Cluster mode, which submits tasks to the MaxCompute cluster.

2.2 submit 2.2.1 Yarn-Cluster mode through the client

Download the MC Spark client

Spark 1.6.3

Spark 2.3.0

Environment variable configuration

# # JAVA_HOME configuration # it is recommended to use JDK 1.8export JAVA_HOME=/path/to/jdkexport CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jarexport PATH=$JAVA_HOME/bin:$PATH## SPARK_HOME settings # download the MaxCompute Spark client mentioned above and extract it to any local path # Please do not directly set SPARK_HOME equal to the following path the following path is for display purposes only # Please point to positive Exact path export SPARK_HOME=/path/to/spark_extracted_packageexport PATH=$SPARK_HOME/bin:$PATH## PySpark configuration Python version export PATH=/path/to/python/bin/:$PATH

Parameter configuration

Rename $SPARK_HOME/conf/spark-defaults.conf.template to spark-defaults.conf

Refer to the following for parameter configuration

Prepare for the project

Git clone https://github.com/aliyun/MaxCompute-Spark.gitcd spark-2.xmvn clean package

Task submission

/ / bash environment cd $SPARK_HOMEbin/spark-submit-- master yarn-cluster-- class com.aliyun.odps.spark.examples.SparkPi\ / path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar// commands submitted in windows environment cd $SPARK_HOME/binspark-submit.cmd-- master yarn-cluster-- class com.aliyun.odps.spark.examples.SparkPi\ path\ to\ MaxCompute-Spark \ spark-2.x\ target\ spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar

IDEA debugging attention

When IDEA runs Local mode, you cannot directly refer to the configuration in spark-defaults.conf. You need to manually specify the relevant configuration in the code.

Be sure to add the related dependencies of the MaxCompute Spark client (jars directory) manually in IDEA, otherwise the following error will occur:

The value of spark.sql.catalogimplementation should be one of hive in-memory but was odps

Reference documentation

2.3 submit 2.3.1 resource upload via DataWorks

Essentially, the configuration of the MC Spark node corresponds to the parameters and options of the spark-submit command

ODPS SPARK node spark-submit master Java, Python resource app jar or python file configuration item-- conf PROP=VALUEMain Class--class CLASS_NAME parameter [app arguments] Select JAR resource-- jars JARS select Python resource-- py-files PY_FILES select File resource-- files FILES select Archives resource

-- archives

Upload resources:

0~50MB: you can create resources and upload them directly in the DataWorks interface.

50MB~500MB: you can upload it using MaxCompute client (CMD), and then add it to data development in the DataWorks interface. Refer to the documentation.

Resource reference:

After the resource is submitted, you can select the desired resource (jar/python/file/archive) in the DataWorks Spark node interface.

When the task runs: resource files are uploaded to the current working directories of Driver and Executor by default

2.3.2 parameters and configuration

Spark configuration item: the-- conf option corresponding to the spark-submit command

No configuration required for accessid,accesskey,projectname,endpoint,runtime.end.point,task.major.version

In addition, you need to add the configuration in spark-default.conf to the configuration items in dataworks one by one

Pass parameters to the main class (such as bizdate)

First add a parameter to the scheduling-> parameter, and then reference it in the parameters column of the Spark node. Multiple parameters are separated by spaces

This parameter is passed to the user's main class, and the user can parse the parameter in the code.

Reference documentation

three。 Configuration introduction 3.1 configured location 3.1.1 location of Spark configuration

There are usually several places where users can add Spark configuration when using Maxcompute Spark, mainly including:

Location 1:spark-defaults.conf, the Spark configuration that the user added in the spark-defaults.conf file when submitted by the client

The configuration item of the location 2:dataworks, the Spark configuration that the user adds in the configuration item when submitted through dataworks, and this part of the configuration will eventually be added in location 3

Location 3: configured in the startup script spark-submit-conf option

Location 4: the Spark configuration that the user sets when initializing SparkContext in the user code

Priority of Spark configuration

User Code > spark-submit-options > spark-defaults.conf configuration > spark-env.sh configuration > default value

3.1.2 two configurations to be distinguished

One is that it must be configured in the configuration item of spark-defaults.conf or dataworks to take effect (required before the task is submitted), but not in the user code. The main characteristics of this kind of configuration are:

Related to Maxcompute/Cupid platform: generally speaking, odps or cupid are included in the parameter names. These parameters are usually related to task submission / resource application:

Obviously, some resource acquisition (such as driver memory, core,diskdriver,maxcompute resources) will be used before the task is executed. If these parameters are set in the code, it is obvious that the platform cannot read them, so these parameters must not be configured in the code.

Some of these parameters, even if configured in the code, will not cause the task to fail, but will not take effect.

Some of these parameters are configured in the code and may cause side effects, such as setting spark.master to local in yarn-cluster mode

Parameters for accessing VPC:

Such parameters are also related to the platform, and access to the network is carried out when the task is submitted.

One is that the configuration can take effect in all three locations, but the code configuration has the highest priority.

It is recommended to configure the parameters related to task running and optimization in the code, while the configurations related to the resource platform are configured in the configuration items of spark-defaults.conf or dataworks.

3.2 Resource-related parameters

3.3 platform-related parameters

four。 Job Diagnostics 4.1 Logview4.1.1 Logview introduction

The log is printed when the task is submitted: the log contains a logview link (keyword logview url)

The StdErr of Master and Worker prints the log output of the spark engine, and prints the output of the user's job to the console in StdOut.

4.1.2 troubleshooting problems with Logview

When you get Logview, you usually first look at the error report of Driver. Driver will contain some key errors.

If there is a problem that the class or method cannot be found in Driver, it is usually the problem of jar package.

If there is a connection to an external VPC in Driver or a Time out in OSS, it is generally necessary to check the parameter configuration.

If an error occurs in the Driver, such as the Executor cannot be connected, or the Chunk cannot be found, usually the Executor has exited ahead of time. You need to further check the error report of the Executor. There may be an OOM.

Sort by End Time, the earlier the end time, the easier it is to be the Executor node where the problem occurs.

Sorted by Latency, Latency represents the survival time of Executor. The shorter the survival time, the more likely it is that the root cause lies.

The use of Spark UI is consistent with the original version of the community, please refer to the documentation

Be careful

Spark UI requires authentication. Only the Owner that submitted the task can be opened.

Spark UI can only be opened when the job is running. If the task has finished, then Spark UI cannot be opened. You need to check Spark History Server UI at this time.

five。 FAQ 1. Problems with local mode operation

Question 1: the value of spark.sql.catalogimplementation should be one of hive in-memory but was odps

The reason is that the user did not correctly add the jars directory of Maxcompute Spark to the classpath according to the documentation, resulting in loading the community version of the spark package. You need to add the jars directory to the classpath according to the documentation.

Problem 2: IDEA Local mode cannot directly refer to the configuration in spark-defaults.conf, and the Spark configuration item must be written in the code.

Question 3: visit OSS and VPC:

Local mode is in the user's native environment, and the network is not isolated. While Yarn-Cluster mode is in the network isolation environment of Maxcompute, it is necessary to configure the relevant parameters of vpc access.

The endpoint that accesses oss in Local mode is usually the public network endpoint, while the endpoint that accesses vpc in Yarn-cluster mode is the classic network endpoint.

2. The problem of jar package

Java/scala programs often encounter problems with Java class missing / class conflicts:

Class conflicts: user Jar packages conflict with Spark or platform dependent Jar packages

Class not found: user Jar package was not typed as Fat Jar or caused by class conflict

You should pay attention to:

Dependency is the difference between provided and compile:

Provided: the code relies on the jar package, but only needs it at compile time, but not at run time. The runtime will go to the cluster to find the corresponding jar package.

Compile: the code relies on this jar package, which is needed both at compile and run time. These jar packages do not exist in the cluster, and users need to type into their own jar packages. This type of jar package is generally a three-party library and has nothing to do with the operation of spark, but with the logic of user code.

The jar package submitted by the user must be Fat jar:

All compile type dependencies must be typed into the user's jar package to ensure that the code can be loaded into these dependent classes at runtime

A jar package set to provided is required

Jar package whose groupId is org.apache.spark

Platform-related Jar packages

Cupid-sdk

Hadoop-yarn-client

Odps-sdk

A jar package set to compile is required

Oss-related jar packages

Hadoop-fs-oss

Jar packages used by users to access other services:

Such as mysql,hbase

Third-party libraries to be referenced by user code

3. Need to introduce Python package

Many times users need to use external Python dependencies

First of all, we recommend that users use our packaged public resources, including some commonly used data processing, computing, and tripartite libraries connecting external services (mysql,redis,hbase).

# # Common Resources python2.7.13spark.hadoop.odps.cupid.resources = public.python-2.7.13-ucs4.tar.gzspark.pyspark.python =. / public.python-2.7.13-ucs4.tar.gz/python-2.7.13-ucs4/bin/python## Common Resources python3.7.9spark.hadoop.odps.cupid.resources = public.python-3.7.9-ucs4.tar.gzspark.pyspark.python =. / public.python-3 .7.9-ucs4.tar.gz/python-3.7.9-ucs4/bin/python3 these are the MaxCompute Spark usage and FAQs that the editor shared with you. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.