Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the practice of data lake based on DataLakeAnalytics?

2025-01-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

What is the practice of data lake based on DataLakeAnalytics? I believe many inexperienced people are at a loss about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Activate DLA

Before we start, we need to have a DLA account. DLA is currently under public trial, so just apply for a trial. After the trial approval is successful, you will get a user name and password, and then you can log in to the console to use:

Cdn.nlark.com/lark/0/2018/png/4867/1535975617974-a6185b92-9e74-4e6e-9730-e23858a136ad.png ">

Or if you are a geek and prefer the command line, you can use a normal MySQL client to connect to DLA:

Mysql-hservice.cn-shanghai.datalakeanalytics.aliyuncs.com-P10000-u-p

In this article, I will use the MySQL command line to demonstrate the functionality of DLA.

In addition, you also need to prepare some test data on your OSS. What I prepare here is the famous TPCH test data set.

Using DLA to analyze the data on OSS

DLA is a data lake engine that uses SQL as the query language. In order to enable DLA to query the data on OSS, we need to tell DLA the structure of our OSS data in some way. In order to make it more convenient for users, DLA uses the traditional database, the concept of tables to maintain the meta-information of these data, that is, the data of the file structure of OSS is mapped to DLA into a database and a pile of tables.

Take the TPCH dataset as an example. We know that the TPCH dataset contains the following pieces of information: user (customer), order (orders), order details (lineitem), and so on. These data belong to a business as a whole, and we set up a database to correspond to them:

CREATE SCHEMA oss_tpch with DBPROPERTIES (CATALOG = 'oss', LOCATION =' oss://public-datasets-cn-hangzhou/tpch/1x/')

Each piece of data corresponds to multiple files in a directory on OSS. In the case of an order, it corresponds to a file under the orders_text directory (there is only one file in this example. In practice, there can be multiple files):

Let's map this orders_text directory to a table under our database oss_tpch:

Use oss_tpch;CREATE EXTERNAL TABLE IF NOT EXISTS orders (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE DATE, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE LOCATION' oss://public-datasets-cn-hangzhou/tpch/1x/orders_text/'

In this way, we can analyze the data on OSS through DLA. For example, let's take a look at the top ten items:

Mysql > select * from orders limit 10 +- -o_orderkey | o_custkey | o_orderstatus | o_totalprice | o_orderdate | o_orderpriority | o_clerk | o_shippriority | o_comment | | +-+ | -+ | 1 | 3689999 | O | 224560.83 | 1996-01-02 | 5-LOW | Clerk#000095055 | 0 | nstructions sleep furiously among | | 2 | 7800163 | O | 75388.65 | 1996-12-01 | 1-URGENT | Clerk#000087916 | 0 | foxes. Pending accounts at the pending, silent asymptot | | 3 | 12331391 | F | 255287.36 | 1993-10-14 | 5-LOW | Clerk#000095426 | 0 | sly final accounts boost. Carefully regular ideas cajole carefully. Depos | | 4 | 13677602 | O | 43119.84 | 1995-10-11 | 5-LOW | Clerk#000012340 | 0 | sits. Slyly regular warthogs cajole. Regular, regular theodolites acro | | 5 | 4448479 | F | 125809.76 | 1994-07-30 | 5-LOW | Clerk#000092480 | 0 | quickly. Bold deposits sleep slyly. Packages use slyly | | 6 | 5562202 | F | 56408.2 | 1992-02-21 | 4-NOT SPECIFIED | Clerk#000005798 | 0 | ggle. Special Final requests are against the furiously specia | | 7 | 3913430 | O | 240358.24 | 1996-01-10 | 2-HIGH | Clerk#000046961 | 0 | ly special requests | | 32 | 13005694 | O | 136666.23 | July 16, 1995 | 2 -HIGH | Clerk#000061561 | 0 | ise blithely bold Regular requests. Quickly unusual dep | | 33 | 6695788 | F | 183460.23 | 1993-10-27 | 3-MEDIUM | Clerk#000040860 | 0 | uriously. Furiously final request | | 34 | 6100004 | O | 52842.63 | 1998-07-21 | 3-MEDIUM | Clerk#000022278 | 0 | ly final packages. Fluffily final deposits wake blithely ideas. Spe | +-+- -+-+ 10 rows in set (0.21 sec)

Let's take a look at the top ten orders of user 36901:

Mysql > select * from orders where limit 10 = '36901' custkey +- -- +-+ | o_orderkey | o_custkey | o_orderstatus | o_totalprice | o_orderdate | o_orderpriority | o_clerk | o_shippriority | o_comment | +- -+ | 1243264 | 36901 | F | 103833.45 | 1992-03-23 | 2-HIGH | Clerk#000000922 | 0 | nts haggle. Even, even theodolites are. Blithely | | 1274530 | 36901 | O | 181977.58 | 1997-04-29 | 2-HIGH | Clerk#000000232 | 0 | bold foxes along the carefully expres | 1599527 | 36901 | F | 322352.11 | 10-16 | 2-HIGH | Clerk#000000674 | 0 | the slyly even dependencies. | | 1837477 | 36901 | F | 101653.62 | 1993-05-27 | 5-LOW | Clerk#000000891 | 0 | lyly special requests. Express foxes sleep fu | | 1994082 | 36901 | O | 77952.78 | 1995-07-05 | 3-MEDIUM | Clerk#000000525 | 0 | luffily ironic courts. Bold, e | | 2224802 | 36901 | F | 243852.76 | 1993-01-14 | 1-URGENT | Clerk#000000827 | 0 | sly final requests. Pending, regular ideas among the furiously u | | 4957636 | 36901 | F | 5741.32 | 1992-05-20 | 5-LOW | Clerk#000000230 | 0 | ackages. Fluffily even packages solve carefully dolphins. Unusua | | 5078467 | 36901 | F | 119823.03 | 1994-04-29 | 4-NOT SPECIFIED | Clerk#000000402 | 0 | regular asymptotes cajo | 5173859 | 36901 | F | 103624.02 | May 28 | 3-MEDIUM | Clerk#000000335 | 0 | regular dependencies poach quickly. Unusu | | 5525574 | 36901 | O | 136098.0 | 1998-02-16 | 4-NOT SPECIFIED | Clerk#000000425 | 0 | cial pinto beans wake. Slyly even warthogs use. Bo | +-+- -+-+ 10 rows in set (1.07 sec)

Let's take a look at the person with the largest number of orders:

Mysql > select o_custkey, count (*) as cnt from orders group by o_custkey order by cnt desc limit 10 +-+-+ | o_custkey | cnt | +-+-+ | 3451 | 41 | 102022 | 41 | 102004 | 41 | 79300 | 40 | 117082 | 40 | 122623 | 40 | 69682 | 39 | 143,500 | 39 | 142450 | 38 | 53302 | 38 | +- +-+ 10 rows in set (2.69 sec)

Well, these are the customers we need to focus on serving. We need to write back the ID of these users to the RDS database at the front desk and ask our marketing students to do some targeted marketing activities. No problem. DLA supports returning the analyzed data to RDS.

Data backflow RDS mapping MySQL database information into DLA

To flow the analyzed data back to RDS, we first have a mechanism to tell DLA the destination of data reflow. Thanks to the unified design of DLA, we are just like mapping OSS data. We just map a MySQL database into it. For example, we need to write the data to the following database:

Mysql-habcde.mysql.rds.aliyuncs.com-P3306-uhello-pworld-Dmarketing

So let's build a mapping library in DLA:

CREATE SCHEMA `mysql_ marketing` WITH DBPROPERTIES (CATALOG = 'mysql', LOCATION =' jdbc:mysql://abcde.mysql.rds.aliyuncs.com:3306/marketing', USER='hello', PASSWORD='world', INSTANCE_ID ='', VPC_ID ='')

What we need to explain here are VPC_ID and INSTANCE_ID. We know that when we buy RDS on Aliyun for security reasons, we usually put it in a separate VPC to ensure that only we can access it. Here, in order to allow DLA to access our MySQL database for data reflow, we need to tell DLA the relevant information about this RDS.

Both INSTANCE_ID and VPC_ID can be found on the details page of RDS, such as VPC_ID:

Since the security group of RDS controls the source IP for access, we need to add the DLA-related address field 100.104.0.0and16 IP address field to your RDS whitelist.

At this point, the preparatory work is complete, and our mysql database has been built.

Map MySQL result table into DLA

The result we want to save is very simple, that is, the top 10 users place an order. The table creation statement in the MySQL database is as follows:

Create table top10_user (custkey int, order_cnt bigint)

In order to map this table into DLA, we build a corresponding table, and the statement is almost the same:

Use mysql_marketing;create external table top10_user (custkey int, order_cnt bigint); ETL

Next, we can flow back the checked data:

Mysql > insert into mysql_marketing.top10_user-> select o_custkey, count (*) as cnt from oss_tpch.orders-> group by o_custkey order by cnt desc limit 10 + | rows | +-+ | 10 | +-+ 1 row in set (4.71 sec) mysql > select * from mysql_marketing.top10_user +-+-+ | custkey | order_cnt | +-+-+ | 143500 | 39 | | 102004 | 41 | 53302 | 38 | 3451 | 41 | 122623 | 40 | 129637 | 38 | 41 | 117082 | 40 | 69682 | 39 | 79300 | 40 | +-+-+ 10 rows in set (0.14 sec) read the above content Have you mastered the method of data lake practice based on DataLakeAnalytics? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report