In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
It is believed that many inexperienced people have no idea about how to use DataWorks to schedule DLA tasks in Data Lake Analytics. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.
DataWorks, as a popular big data development and scheduling service on Aliyun, has recently added support for Data Lake Analytics, which means that all Data Lake Analytics customers can get all-round powerful capabilities such as task development, task dependency management, task scheduling, task operation and maintenance, and so on. Today, we will introduce how to use DataWorks to schedule DLA script tasks.
Activate DLA
Before we start, we need to have a DLA account. At present, all new DLA users have 50T of free traffic, so you can rest assured to try it out. After you successfully activate DLA, you will get a user name and password, and then you can log in to the console to use:
Or if you are a geek and prefer the command line, you can use a normal MySQL client to connect to DLA:
Mysql-hservice.cn-region.datalakeanalytics.aliyuncs.com-P10000-u-p in this article, I will use the MySQL command line to demonstrate the functionality of DLA. Apply for trial DataWorks + DLA
After activating the DLA service, you also need to activate the DataWorks service. DataWorks is still in public trial, so you can rest assured to use it.
Then you can ask any of our students in your corresponding DLA service group to activate the trial qualification of DLA + DataWorks (at present, this feature is still in the stage of invitation and trial, and it has not been fully released).
If you do not have a customer with a dedicated DLA service group at present, you can contact us through a work order. DLA data, library, table preparation
To demonstrate how to schedule DLA tasks on DataWorks, we will use some test data later. Here we use the test data set of the famous TPCH, which is stored on OSS.
Through the MySQL command line, we create the corresponding libraries and tables:
CREATE SCHEMA dataworks_demo with DBPROPERTIES (CATALOG = 'oss', LOCATION =' oss://test-bucket/datasets/'); use dataworks_demo CREATE EXTERNAL TABLE IF NOT EXISTS orders (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE DATE, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE LOCATION' oss://test-bucket/datasets/tpch/1x/text_string/orders_text/' -- result table finished_ordersCREATE EXTERNAL TABLE IF NOT EXISTS finished_orders (O_ORDERKEY INT, O_TOTALPRICE DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE LOCATION' oss://test-bucket/datasets/dataworks_demo/finished_orders/' -- result table high_value_finished_ordersCREATE EXTERNAL TABLE IF NOT EXISTS high_value_finished_orders (O_ORDERKEY INT, O_TOTALPRICE DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE LOCATION' oss://test-bucket/datasets/dataworks_demo/high_value_finished_orders/'
One of the important features of task scheduling is the dependency between tasks. To demonstrate this feature, we will create two DLA tasks in DataWorks. The relationship between our tables and tasks is shown below:
Task 1: we clean out the completed order from the orders table: o_orderstatus ='F' and write it to the finished_orders table
Task 2: find orders with a total price greater than 10000 from the finished_orders table: o_totalprice > 10000, and write them into the high_value_finished_orders table
For more detailed information on how to use DLA to analyze OSS data, please refer to:
Data Lake Analytics + OSS data file format processing: https://yq.aliyun.com/articles/623246
Use Data Lake Analytics + OSS to analyze TPC-H datasets in CSV format: https://yq.aliyun.com/articles/623282
Create a DLA task on DataWorks
After activating the function of DataWorks + DLA, we can create the task of DLA in the data development IDE of DataWorks, as shown below:
We named the first task: finished_orders. Click OK to go to a page edited by SQL. To write DLA SQL, be sure to tell DataWorks which DLA service our SQL is running on. This is packaged as the concept of "data source" in DataWorks:
The specification of DataWorks is that the name of the task is consistent with the name of the output table of the task.
When you first came in, there was no data source. Click New data Source:
Fill in the necessary information and click OK to finish.
For security reasons, DataWorks controls the services that can be connected, so we need to add the address and port of the DLA we want to connect to the whitelist. This configuration is in the configuration of the DataWorks workspace:
The specific configuration is as follows (you need to change to your actual IP+ port):
It should be noted here that only the workspace administrator has permissions for the workspace configuration.
After doing so much, we can finally see the data source of DLA on the editing page. Let's enter the following SQL in the task of finished_orders, and click execute:
Use dataworks_demo;insert into finished_ordersselect O_ORDERKEY, O_TOTALPRICEfrom orders where O_ORDERSTATUS ='F'
As shown below:
Repeat the above steps to create a second task: high_value_finished_orders:
Use dataworks_demo;insert into high_value_finished_ordersselect * from finished_orderswhere O_TOTALPRICE > 10000; configuration task depends on
It doesn't make much sense to run a single task at a time. The core of task scheduling is that multiple tasks run at a specified time according to the specified dependencies. Let's let: task_finished_orders starts running at 2: 00 am every day:
High_value_finished_orders does not run until finished_orders runs successfully:
Task release
After the task is configured, you can release, operate and maintain the task. To publish a task, you must first submit:
After submission, we can see all the tasks to be released in the list to be released:
Select the two tasks we just submitted, and we can release them:
On the release list page, you can check whether the release we just made is successful:
After the release is successful, we can go to the task operation and maintenance page to view our tasks and conduct various operation and maintenance operations.
After reading the above, have you mastered how to use DataWorks to schedule DLA tasks in Data Lake Analytics? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.