How Spark On MaxCompute accesses Phonix data 07/02 Update SLTechnology News&Howtos

How Spark On MaxCompute accesses Phonix data

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shows you how Spark On MaxCompute accesses Phonix data, which is concise and easy to understand, and can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

First, purchase Hbase1.1 and set the corresponding resources 1.1 purchase hbase

The main versions of hbase are 2.0 and 1.1. Here, the corresponding version of hbase is 1.1.

Differences between Hbase and Hbase2.0 versions

HBase1.1 version

Version 1.1 is based on version 1.1.2 of the HBase community.

HBase2.0 version

Version 2.0 is a completely new version based on the HBase2.0.0 version released by the community in 2018. Similarly, on this basis, we have made a lot of improvements and optimization, absorbed a lot of successful experience within Ali, and have better stability and performance than the community HBase version.

1.2 confirm VPC,vsWitchID

It is convenient and feasible to test connectivity, and the VPCId,vsWitchID of the hbase is as consistent as possible with the purchased exclusive integrated resource group.

1.3 set the hbase whitelist, of which the DataWorks whitelist is as follows, and the personal ECS can also be added

Select the whitelist under the region of the corresponding DataWorks according to the document link to add.

1.4 View the corresponding version and access address of hbase

By opening the button of database link, you can check the main version of Hbase and the proprietary network access address of Hbase, as well as whether to enable public network access to connect.

Install the Phonix client and create tables and insert data 2.1 install the client

Select Phonix version 4.12.0 according to the hbase version 1.1 and download the corresponding client file ali-phoenix-4.12.0-AliHBase-1.1-0.9.tar.gz according to the document.

. / bin/sqlline.py 172.16.0.13172.16.0.15172.16.0.12:2181

Create a table:

CREATE TABLE IF NOT EXISTS users_phonix (id INT, username STRING, password STRING)

Insert data:

UPSERT INTO users (id, username, password) VALUES (1, 'admin',' Letmein'); 2.2 check whether it is created and inserted successfully

Execute the command on the client to see if the current table and data are uploaded successfully

Select * from users

Third, write the corresponding code logic 3.1 write code logic

In IDEA, configure the local development environment according to the corresponding Pom file, complete the configuration information involved in the code, and write the test. Here, you can first use the public network access link of Hbase to test. After the code logic verification is successful, you can adjust the configuration parameters. The specific code is as follows

Package com.git.phoniximport org.apache.hadoop.conf.Configurationimport org.apache.spark.sql.SparkSessionimport org.apache.phoenix.spark._/** * this example applies to Phoenix version 4.x * / object SparkOnPhoenix4xSparkSession {def main (args: Array [String]): Unit = {/ / ZK link address of the HBase cluster. / / the format is: xxx-002.hbase.rds.aliyuncs.com,xxx-001.hbase.rds.aliyuncs.com,xxx-003.hbase.rds.aliyuncs.com:2181 val zkAddress = args (0) / / the table name on the Phoenix side, which needs to be created in advance on the Phoenix side. For Phoenix table creation, please refer to: https://help.aliyun.com/document_detail/53716.html?spm=a2c4g.11186623.4.2.4e961ff0lRqHUW val phoenixTableName = args (1) / / the name of the table on the spark side. Val ODPSTableName = args (2) val sparkSession = SparkSession .builder () .appName ("SparkSQL-on-MaxCompute") .config ("spark.sql.broadcastTimeout", 20 * 60) .config ("spark.sql.crossJoin.enabled", true) .config ("odps.exec.dynamic.partition.mode", "nonstrict") / / .config ("spark.master") "local [4]") / / spark.master needs to be set to local [N] before it can be run directly N is the concurrency .config ("spark.hadoop.odps.project.name", "* *") .config ("spark.hadoop.odps.access.id", "* *") .config ("spark.hadoop.odps.access.key", "* *") / / .config ("spark.hadoop.odps.end.point") "http://service.cn.maxcompute.aliyun.com/api") .config (" spark.hadoop.odps.end.point "," http://service.cn-beijing.maxcompute.aliyun-inc.com/api") .config "(" spark.sql.catalogImplementation "," odps ") .getOrCreate () / / the first insertion method var df = sparkSession.read.format (" org.apache.phoenix.spark ") .option (" table ") PhoenixTableName) .option ("zkUrl", zkAddress) .load () df.show () df.write.mode ("overwrite") .insertInto (ODPSTableName)}} 3.2 corresponds to the Pom file

The pom file is divided into Spark dependencies and ali-phoenix-spark-related dependencies. Since the jar package of ODPS is involved, it will cause jar conflicts in the cluster, so the package of ODPS should be excluded.

4.0.0 2.3.0 3.3.8-public 2.11.8 2.11 4.12.0-HBase-1.1 com.aliyun.odps Spark-Phonix 1.0.0-SNAPSHOT jar org.jpmml pmml-model 1.3.8 Org.jpmml pmml-evaluator 1.3.10 org.apache.spark spark-core_$ {scala.binary.version} ${spark.version} provided org.scala-lang scala-library Org.scala-lang scalap org.apache.spark spark-sql_$ {scala.binary.version} ${spark.version} provided org.apache. Spark spark-mllib_$ {scala.binary.version} ${spark.version} provided org.apache.spark spark-streaming_$ {scala.binary.version} ${spark.version} provided com.aliyun.odps cupid-sdk ${cupid.sdk.version} provided com.aliyun.phoenix ali-phoenix-core 4.12.0-AliHBase-1.1-0.8 com.aliyun.odps odps-sdk-mapred Com.aliyun.odps odps-sdk-commons com.aliyun.phoenix ali-phoenix-spark 4.12.0-AliHBase-1.1-0.8 com.aliyun. Phoenix ali-phoenix-core org.apache.maven.plugins maven-shade-plugin 2.4.3 package Shade false true *: * META-INF/*.SF META-INF/*.DSA META-INF/*.RSA * * / log4j.properties Reference.conf META-INF/services/org.apache.spark.sql.sources.DataSourceRegister net.alchim31.maven Scala-maven-plugin 3.3.2 scala-compile-first process-resources compile Scala-test-compile-first process-test-resources testCompile IV. Package and upload to DataWorks for smoke test 4.1 create the MaxCompute table CREATE TABLE IF NOT EXISTS users_phonix (id INT) to be imported Username STRING, password STRING) 4.2 package and upload to MaxCompute

The package of IDEA should be packed into shaded package, and all dependent packages should be put into jar package. Because there is a limit of 50m for uploading jar package in DatadWork interface, MaxCompute client is used for jar package.

Select the corresponding project environment, view the uploaded resources, and click add to data Development

Go to the DataWorks interface and select the resource icon on the left, select the corresponding environment bit to switch in, and enter the file name when you delete the file to search. The list shows that the resource has been uploaded, and click submit to data development.

Click the submit button

4.4 configure the corresponding vpcList parameters and submit the task test

The configuration information of the configuration vpcList file is as follows, which can be configured according to the link of personal hbase.

{"regionId": "cn-beijing", "vpcs": [{"vpcId": "vpc-2ze7cqx2bqodp9ri1vvvk", "zones": [{"urls": [{"domain": "172.16.0.12" "port": 2181}, {"domain": "172.16.0.13", "port": 2181} {"domain": "172.16.0.15", "port": 2181}, {"domain": "172.16.0.14" "port": 2181}, {"domain": "172.16.0.12", "port": 16000} {"domain": "172.16.0.13", "port": 16000}, {"domain": "172.16.0.15" "port": 16000}, {"domain": "172.16.0.14", "port": 16000} {"domain": "172.16.0.12", "port": 16020}, {"domain": "172.16.0.13" "port": 16020}, {"domain": "172.16.0.15", "port": 16020} {"domain": "172.16.0.14", "port": 16020}]}

Spark task submission task configuration parameters, main class, and corresponding parameters

This parameter is mainly composed of three parameters, the first is the link of Phonix, the second is the table name of Phonix, and the third is the incoming MaxCompute table.

Click the smoke test button to see the successful execution of the task

Execute the query statement in the temporary query node to get that the data has been written to the table of MaxCompute

Summary:

Using Spark on MaxCompute to access the data of Phonix and writing the data to the table of MaxCompute has proved to be feasible. However, there are a few points to pay attention to in practice:

1. Select the corresponding Hbase and Phonix version according to the actual use situation, the corresponding version is the same, and the client used, as well as the code dependency will change.

two。 If you use the public network to test locally in IEAD, you should pay attention to the Hbase whitelist. You should not only set the whitelist of DataWorks, but also add your local address to the whitelist.

3. When the code is packaged, it is necessary to sort out the dependencies in pom to avoid the existing ODPS packages in the corresponding dependencies, thus causing jar package conflicts, and packaging as shaded packages to avoid missing the corresponding dependencies.

The above is how Spark On MaxCompute accesses Phonix data. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.