How to do Cross-region data Analysis based on DataLakeAnalytics 07/04 Update SLTechnology News&Howtos

How to do Cross-region data Analysis based on DataLakeAnalytics

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article shows you how to do cross-regional data analysis based on DataLakeAnalytics, the content is concise and easy to understand, it can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

On Aliyun, many customers' applications are deployed in multiple regions, such as deploying an application in Beijing (cn-beijing) to allow customers in the north to access faster, while deploying one in cn-hangzhou to allow customers in the south to access faster. After the multi-regional deployment, the business data is split into multiple copies, and the databases in each region are independent, and the network is impassable, which makes it difficult to analyze the overall business data. Today I would like to introduce a set of cross-regional data analysis solutions based on several Ali Cloud products such as DataLakeAnalytics, OSS, DataX and so on.

In fact, cloud products (such as our own DataLakeAnalytics) also have the need for cross-regional data analysis, and this solution is also applicable. This solution was originally explored to analyze DataLakeAnalytics's own business data.

Solution overview

We know that RDS in different regions is not available, unless you have access to a public network (there are great security risks and are not recommended), and even if you open a public network, it is not easy to jointly analyze the data in multiple databases; and we do not want it to take up too much budget for this kind of data analysis.

Our solution is to synchronize the data of each region to the OSS of the same region, and then use DataLakeAnalytics for joint analysis. The advantage of this scheme is that OSS storage is very cheap, and DataLakeAnalytics is also charged by query volume, so you don't have to spend a penny when you don't query. The overall plan is shown in the following figure:

Gather data from different regions

The first step of our plan is to synchronize the RDS data of each region to the OSS of the same region. Alibaba Group has opened up a great data handling tool: DataX, which can move data between different data sources, which supports a wide variety of data sources: from relational MySQL, SQLServer, to a variety of file systems such as HDFS, OSS, etc., in which we need mysqlreader plug-ins to read data from MySQL and osswriter plug-ins to write data to OSS.

Suppose we have the following table person that records personnel information that needs to be synchronized:

Create table person (id int primary key auto_increment, name varchar (1023), age int)

Let's write a DataX task description file person.json similar to the following:

{"job": {"setting": {"speed": {"channel": 1, "byte": 104857600}, "errorLimit": {"record": 10}}, "content": [{"reader": {"name": "mysqlreader" "parameter": {"username": "your-user-name", "password": "your-password", "column": ["id", "name", "age",] "connection": [{"table": ["person"] "jdbcUrl": ["jdbc:mysql://your-rds.mysql.rds.aliyuncs.com:3306/dbname"]}, "writer": {"name": "osswriter" "parameter": {"endpoint": "http://oss.aliyuncs.com"," accessId: "your-access-id", "accessKey": "your-access-secret", "bucket": "mydb-bucket", "object": "mydb/person/region=cn-hangzhou/person.csv", "encoding": "UTF-8" "fieldDelimiter": "|" writeMode ":" truncate "}]}}

Here the information related to MySQL fills in the information of your business database, while the information related to OSS chooses the address of the OSS to which we synchronize. Pay attention to the object field in the OSS configuration section, mydb stores all your data, person this directory stores your person table data, region=cn-hangzhou this directory is interesting, it stores your application in the cn-hangzhou region data, similarly, you may also have cn-beijing, cn-shangahi data and so on.

Then execute the following command:

/ / make sure you have downloaded and configured DataX correctly before execution. Python datax/bin/datax.py person.json

If executed correctly, you will see the following output:

. Omit line N. 2018-09-06 19 INFO StandAloneJobContainerCommunicator 53 records 19.900 [job-0] INFO JobContainer-PerfTrace not enableurs 2018-09-06 1915 53 records [job-0] INFO StandAloneJobContainerCommunicator-Total 251 records, 54067 bytes | Speed 5.28KB/s, 25 records/s | Error 0 0 bytes | All Task WaitWriterTime 0.001s | All Task WaitReaderTime 0.026s | Percentage 100.00% 2018-09-06 19virtual 53 Percentage 19.902 [job-0] INFO JobContainer-Task start time: 2018-09-06 19:53:09 Task end time: 2018-09-06 19:53:19 Task Total time elapsed: 10s average task traffic : 5.28KB/s record write speed: 25rec/s read record total: 251Total read and write failure: 0

In this way, the data is automatically synchronized to OSS, and you can download an oss-browser to view the data on oss:

The data in the file looks like this:

9 | ethan | 1010 | julian | 2011 | train | 3012 | wally | 40

After data handling in one region is completed, other regions can follow suit. The only thing to note is that although MySQL data is data from different regions, OSS uses the same root directory person, because we need to do data collection. After the data from several regions are collected, the structure of the person directory is roughly like this:

Using DataLakeAnalytics to analyze aggregated OSS data

The following analysis can be handed over to DataLakeAnalytics. Analyzing the data on OSS is DataLakeAnalytics's specialty. Before we start, we need to have a DataLakeAnalytics account. DataLakeAnalytics is currently under public trial, so just apply for trial. After the trial approval is successful, you will get a user name and password, and then you can log in to the console to use:

Or if you are a geek and prefer the command line, you can use a normal MySQL client to connect to DLA:

Mysql-hservice.cn-shanghai.datalakeanalytics.aliyuncs.com-P10000-u-p

In this article, I will use the MySQL command line to demonstrate the functionality of DLA.

First, let's build a database of DataLakeAnalytics:

CREATE DATABASE `mydb` WITH DBPROPERTIES (catalog = oss, location = 'oss://your-bucket/mydb/')

The oss://mydb-bucket/mydb/ here is the parent directory of the person directory where our data was aggregated earlier.

After building the library, we create another table:

CREATE EXTERNAL TABLE IF NOT EXISTS `person` (`id` bigint, `name` varchar), `age` int) PARTITIONED BY (region varchar (63)) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILELOCATION' oss://mydb-bucket/mydb/person'

Note that this is a partition table, and the key of the partition is our region. The first advantage is that it is relatively simple for each region to synchronize data, so you don't have to worry about washing out the data from other regions; in addition, the use of regional partitioning also enables us to scan less data and query faster when analyzing a single region.

After building the table, we run the following command to ask DataLakeAnalytics to scan the list of files on OSS to find all region partitions:

Mysql > msck repair table person +-+ | Result | | +- -- + | Partitions not in metastore: person:region=cn-beijing person:region=cn-hangzhou person:region=cn-shanghai | | Repair: Added partition to metastore mydb.person:region=cn-beijing | | Repair: Added partition to metastore mydb.person:region=cn-hangzhou | | Repair: Added partition to Metastore mydb.person:region=cn-shanghai | +- -+

Now we can happily jointly query the data of all regions:)

Mysql > select * from person limit 5 +-+ | id | name | age | region | +-+ | 1 | james | 10 | cn-beijing | | 2 | bond | 20 | cn-beijing | | 3 | lucy | 30 | cn-beijing | 4 | lily | 40 | cn-beijing | | 5 | trump | 10 | cn-hangzhou | +-+ 5 rows in set (0.43 sec) mysql > select region Count (*) cnt from person group by region +-+-+ | region | cnt | +-+-+ | cn-beijing | 4 | cn-hangzhou | 4 | cn-shanghai | 4 | +-+-+ 3 rows in set (0.18 sec)

In this article, we introduce a method of cross-regional data analysis through DataLakeAnalytics, OSS, DataX. Many details of the solution due to limited space have not been further optimized, for example, we can actually partition the data further by day, so that we can synchronize less data every day and be more efficient; for example, we do not describe how to synchronize data periodically, using crontab? Or some kind of dispatch system? These are left to the reader to explore.

The above content is based on how DataLakeAnalytics does cross-regional data analysis. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.