In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
In this issue, the editor will bring you about the use of MaxCompute Spark and common questions. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.
MaxCompute Spark is a compatible open source Spark computing service provided by MaxCompute. Based on the unified authority system of computing resources and data sets, it provides a Spark computing framework, which supports users to submit and run Spark jobs in a familiar way of development and use, so as to meet more rich data processing and analysis scenarios.
1.1 key Features
Support for native multi-version Spark jobs
Community native Spark runs in MaxCompute, which is fully compatible with Spark's API and supports multiple Spark versions to run at the same time.
Unified computing resourc
Like MaxCompute SQL/MR and other task types, run in the unified computing resources opened by the MaxCompute project
Unified data and rights management
Follow the authority system of MaxCompute project, and safely query data within the limits of access rights of users.
The same experience as open source systems
Provide native open source real-time Spark UI and query history log functions
1.2 system structure
Native Spark can be run in MaxCompute through the MaxCompute Cupid platform
1.3 constraints and restrictions
Currently, MaxCompute Spark supports the following applicable scenarios:
Offline computing scenarios: GraphX, Mllib, RDD, Spark-SQL, PySpark, etc.
Streaming scene
Read and write MaxCompute Table
Referencing file resources in MaxCompute
Read and write services in VPC environment, such as services deployed on RDS, Redis, HBase, ECS, etc.
Read and write OSS unstructured storage
Use restriction
Does not support interactive class requirements such as Spark-Shell, Spark-SQL-Shell, PySpark-Shell, etc.
Access to MaxCompute external tables, functions and UDF is not supported
Only Local mode and Yarn-cluster mode are supported.
two。 Build 2.1 running mode of development environment
Submit through Spark client
Yarn-Cluster mode, submitting tasks to the MaxCompute cluster
Local mode
Submit via Dataworks
It is essentially Yarn-Cluster mode, which submits tasks to the MaxCompute cluster.
2.2 submit 2.2.1 Yarn-Cluster mode through the client
Download the MC Spark client
Spark 1.6.3
Spark 2.3.0
Environment variable configuration
# # JAVA_HOME configuration # it is recommended to use JDK 1.8export JAVA_HOME=/path/to/jdkexport CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jarexport PATH=$JAVA_HOME/bin:$PATH## SPARK_HOME settings # download the MaxCompute Spark client mentioned above and extract it to any local path # Please do not directly set SPARK_HOME equal to the following path the following path is for display purposes only # Please point to positive Exact path export SPARK_HOME=/path/to/spark_extracted_packageexport PATH=$SPARK_HOME/bin:$PATH## PySpark configuration Python version export PATH=/path/to/python/bin/:$PATH
Parameter configuration
Rename $SPARK_HOME/conf/spark-defaults.conf.template to spark-defaults.conf
Refer to the following for parameter configuration
Prepare for the project
Git clone https://github.com/aliyun/MaxCompute-Spark.gitcd spark-2.xmvn clean package
Task submission
/ / bash environment cd $SPARK_HOMEbin/spark-submit-- master yarn-cluster-- class com.aliyun.odps.spark.examples.SparkPi\ / path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar// commands submitted in windows environment cd $SPARK_HOME/binspark-submit.cmd-- master yarn-cluster-- class com.aliyun.odps.spark.examples.SparkPi\ path\ to\ MaxCompute-Spark \ spark-2.x\ target\ spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar
IDEA debugging attention
When IDEA runs Local mode, you cannot directly refer to the configuration in spark-defaults.conf. You need to manually specify the relevant configuration in the code.
Be sure to add the related dependencies of the MaxCompute Spark client (jars directory) manually in IDEA, otherwise the following error will occur:
The value of spark.sql.catalogimplementation should be one of hive in-memory but was odps
Reference documentation
2.3 submit 2.3.1 resource upload via DataWorks
Essentially, the configuration of the MC Spark node corresponds to the parameters and options of the spark-submit command
ODPS SPARK node spark-submit master Java, Python resource app jar or python file configuration item-- conf PROP=VALUEMain Class--class CLASS_NAME parameter [app arguments] Select JAR resource-- jars JARS select Python resource-- py-files PY_FILES select File resource-- files FILES select Archives resource
-- archives
Upload resources:
0~50MB: you can create resources and upload them directly in the DataWorks interface.
50MB~500MB: you can upload it using MaxCompute client (CMD), and then add it to data development in the DataWorks interface. Refer to the documentation.
Resource reference:
After the resource is submitted, you can select the desired resource (jar/python/file/archive) in the DataWorks Spark node interface.
When the task runs: resource files are uploaded to the current working directories of Driver and Executor by default
2.3.2 parameters and configuration
Spark configuration item: the-- conf option corresponding to the spark-submit command
No configuration required for accessid,accesskey,projectname,endpoint,runtime.end.point,task.major.version
In addition, you need to add the configuration in spark-default.conf to the configuration items in dataworks one by one
Pass parameters to the main class (such as bizdate)
First add a parameter to the scheduling-> parameter, and then reference it in the parameters column of the Spark node. Multiple parameters are separated by spaces
This parameter is passed to the user's main class, and the user can parse the parameter in the code.
Reference documentation
three。 Configuration introduction 3.1 configured location 3.1.1 location of Spark configuration
There are usually several places where users can add Spark configuration when using Maxcompute Spark, mainly including:
Location 1:spark-defaults.conf, the Spark configuration that the user added in the spark-defaults.conf file when submitted by the client
The configuration item of the location 2:dataworks, the Spark configuration that the user adds in the configuration item when submitted through dataworks, and this part of the configuration will eventually be added in location 3
Location 3: configured in the startup script spark-submit-conf option
Location 4: the Spark configuration that the user sets when initializing SparkContext in the user code
Priority of Spark configuration
User Code > spark-submit-options > spark-defaults.conf configuration > spark-env.sh configuration > default value
3.1.2 two configurations to be distinguished
One is that it must be configured in the configuration item of spark-defaults.conf or dataworks to take effect (required before the task is submitted), but not in the user code. The main characteristics of this kind of configuration are:
Related to Maxcompute/Cupid platform: generally speaking, odps or cupid are included in the parameter names. These parameters are usually related to task submission / resource application:
Obviously, some resource acquisition (such as driver memory, core,diskdriver,maxcompute resources) will be used before the task is executed. If these parameters are set in the code, it is obvious that the platform cannot read them, so these parameters must not be configured in the code.
Some of these parameters, even if configured in the code, will not cause the task to fail, but will not take effect.
Some of these parameters are configured in the code and may cause side effects, such as setting spark.master to local in yarn-cluster mode
Parameters for accessing VPC:
Such parameters are also related to the platform, and access to the network is carried out when the task is submitted.
One is that the configuration can take effect in all three locations, but the code configuration has the highest priority.
It is recommended to configure the parameters related to task running and optimization in the code, while the configurations related to the resource platform are configured in the configuration items of spark-defaults.conf or dataworks.
3.2 Resource-related parameters
3.3 platform-related parameters
four。 Job Diagnostics 4.1 Logview4.1.1 Logview introduction
The log is printed when the task is submitted: the log contains a logview link (keyword logview url)
The StdErr of Master and Worker prints the log output of the spark engine, and prints the output of the user's job to the console in StdOut.
4.1.2 troubleshooting problems with Logview
When you get Logview, you usually first look at the error report of Driver. Driver will contain some key errors.
If there is a problem that the class or method cannot be found in Driver, it is usually the problem of jar package.
If there is a connection to an external VPC in Driver or a Time out in OSS, it is generally necessary to check the parameter configuration.
If an error occurs in the Driver, such as the Executor cannot be connected, or the Chunk cannot be found, usually the Executor has exited ahead of time. You need to further check the error report of the Executor. There may be an OOM.
Sort by End Time, the earlier the end time, the easier it is to be the Executor node where the problem occurs.
Sorted by Latency, Latency represents the survival time of Executor. The shorter the survival time, the more likely it is that the root cause lies.
The use of Spark UI is consistent with the original version of the community, please refer to the documentation
Be careful
Spark UI requires authentication. Only the Owner that submitted the task can be opened.
Spark UI can only be opened when the job is running. If the task has finished, then Spark UI cannot be opened. You need to check Spark History Server UI at this time.
five。 FAQ 1. Problems with local mode operation
Question 1: the value of spark.sql.catalogimplementation should be one of hive in-memory but was odps
The reason is that the user did not correctly add the jars directory of Maxcompute Spark to the classpath according to the documentation, resulting in loading the community version of the spark package. You need to add the jars directory to the classpath according to the documentation.
Problem 2: IDEA Local mode cannot directly refer to the configuration in spark-defaults.conf, and the Spark configuration item must be written in the code.
Question 3: visit OSS and VPC:
Local mode is in the user's native environment, and the network is not isolated. While Yarn-Cluster mode is in the network isolation environment of Maxcompute, it is necessary to configure the relevant parameters of vpc access.
The endpoint that accesses oss in Local mode is usually the public network endpoint, while the endpoint that accesses vpc in Yarn-cluster mode is the classic network endpoint.
2. The problem of jar package
Java/scala programs often encounter problems with Java class missing / class conflicts:
Class conflicts: user Jar packages conflict with Spark or platform dependent Jar packages
Class not found: user Jar package was not typed as Fat Jar or caused by class conflict
You should pay attention to:
Dependency is the difference between provided and compile:
Provided: the code relies on the jar package, but only needs it at compile time, but not at run time. The runtime will go to the cluster to find the corresponding jar package.
Compile: the code relies on this jar package, which is needed both at compile and run time. These jar packages do not exist in the cluster, and users need to type into their own jar packages. This type of jar package is generally a three-party library and has nothing to do with the operation of spark, but with the logic of user code.
The jar package submitted by the user must be Fat jar:
All compile type dependencies must be typed into the user's jar package to ensure that the code can be loaded into these dependent classes at runtime
A jar package set to provided is required
Jar package whose groupId is org.apache.spark
Platform-related Jar packages
Cupid-sdk
Hadoop-yarn-client
Odps-sdk
A jar package set to compile is required
Oss-related jar packages
Hadoop-fs-oss
Jar packages used by users to access other services:
Such as mysql,hbase
Third-party libraries to be referenced by user code
3. Need to introduce Python package
Many times users need to use external Python dependencies
First of all, we recommend that users use our packaged public resources, including some commonly used data processing, computing, and tripartite libraries connecting external services (mysql,redis,hbase).
# # Common Resources python2.7.13spark.hadoop.odps.cupid.resources = public.python-2.7.13-ucs4.tar.gzspark.pyspark.python =. / public.python-2.7.13-ucs4.tar.gz/python-2.7.13-ucs4/bin/python## Common Resources python3.7.9spark.hadoop.odps.cupid.resources = public.python-3.7.9-ucs4.tar.gzspark.pyspark.python =. / public.python-3 .7.9-ucs4.tar.gz/python-3.7.9-ucs4/bin/python3 these are the MaxCompute Spark usage and FAQs that the editor shared with you. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.