Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Data Lake Analytics + OSS to analyze TPC-H datasets in CSV format

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

How to use Data Lake Analytics + OSS to analyze TPC-H data sets in CSV format, many novices are not very clear about this. In order to help you solve this problem, the following editor will explain it in detail. People with this need can come and learn. I hope you can get something.

1. Activate Data Lake Analytics and OSS services

If you are already active, you can skip this step. If it is not activated, please refer to: https://help.aliyun.com/document_detail/70386.html

Apply for product activation and service.

two。 Download the TPC-H test dataset

You can download TPC-H 100MB's dataset from here:

Https://public-datasets-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/tpch_100m_data.zip

3. Upload data files to OSS

Log in to the OSS console of Aliyun's official website: https://oss.console.aliyun.com/overview

Plan the OSS bucket you want to use. After you have created or selected it, click "File Management". Because there are 8 data files, create a corresponding file directory for each data file:

Create 8 directories as follows:

Click to enter the directory and upload the corresponding data file, for example, the customer directory, then upload the customer.tbl file.

After uploading, as shown in the following picture. Then, upload the other seven data files to the corresponding directory in turn.

So far, 8 data files have been uploaded to your OSS bucket:

Oss://xxx/tpch_100m/customer/customer.tbloss://xxx/tpch_100m/lineitem/lineitem.tbloss://xxx/tpch_100m/nation/nation.tbloss://xxx/tpch_100m/orders/orders.tbloss://xxx/tpch_100m/part/part.tbloss://xxx/tpch_100m/partsupp/partsupp.tbloss://xxx/tpch_100m/region/region.tbloss://xxx/tpch_100m/supplier/supplier.tbl4. Log in to the Data Lake Analytics console

Https://openanalytics.console.aliyun.com/

Click "Log in Database", enter the user name and password assigned when activating the service, and log in to the Data Lake Analytics console.

5. Create Schema and Table

Enter the statement to create the SCHEMA and click "synchronous execution".

CREATE SCHEMA tpch_100m with DBPROPERTIES (LOCATION = 'oss://test-bucket-julian-1/tpch_100m/', catalog='oss')

(note: currently, the schema name of the same Aliyun region,Data Lake Analytics is globally unique. It is recommended that the schema name should be based on the business definition. There is already a duplicate name schema. If an error is reported during creation, please change the schema name. )

After the Schema is created, in the drop-down box of Database, select the schema you just created. Then enter the table creation statement in the SQL text box and click synchronous execution.

Syntax reference for building table statements: https://help.aliyun.com/document_detail/72006.html

The table-building statements of the eight tables corresponding to TPC-H are as follows, which are respectively pasted into the document box (the location of the data file in the LOCATION clause should be modified according to your actual OSS bucket directory). (note: currently, multiple SQL statements are not supported in the console. Please execute a single statement. )

CREATE EXTERNAL TABLE nation (N_NATIONKEY INT, N_NAME STRING, N_ID STRING, N_REGIONKEY INT, N_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE LOCATION' oss://test-bucket-julian-1/tpch_100m/nation' CREATE EXTERNAL TABLE lineitem (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE DATE, L_COMMITDATE DATE, L_RECEIPTDATE DATE, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING L_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE LOCATION' oss://test-bucket-julian-1/tpch_100m/lineitem' CREATE EXTERNAL TABLE orders (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE DATE, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE LOCATION' oss://test-bucket-julian-1/tpch_100m/orders' CREATE EXTERNAL TABLE supplier (S_SUPPKEY INT, S_NAME STRING, S_ADDRESS STRING, S_NATIONKEY INT, S_PHONE STRING, S_ACCTBAL DOUBLE, S_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE LOCATION' oss://test-bucket-julian-1/tpch_100m/supplier' CREATE EXTERNAL TABLE partsupp (PS_PARTKEY INT, PS_SUPPKEY INT, PS_AVAILQTY INT, PS_SUPPLYCOST DOUBLE, PS_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE LOCATION' oss://test-bucket-julian-1/tpch_100m/partsupp' CREATE EXTERNAL TABLE customer (C_CUSTKEY INT, C_NAME STRING, C_ADDRESS STRING, C_NATIONKEY INT, C_PHONE STRING, C_ACCTBAL DOUBLE, C_MKTSEGMENT STRING, C_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE LOCATION' oss://test-bucket-julian-1/tpch_100m/customer' CREATE EXTERNAL TABLE part (P_PARTKEY INT, P_NAME STRING, P_MFGR STRING, P_BRAND STRING, P_TYPE STRING, P_SIZE INT, P_CONTAINER STRING, P_RETAILPRICE DOUBLE, P_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE LOCATION' oss://test-bucket-julian-1/tpch_100m/part' CREATE EXTERNAL TABLE region (R_REGIONKEY INT, R_NAME STRING, R_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE LOCATION' oss://test-bucket-julian-1/tpch_100m/region'

See more.

After the table is built, refresh the page, and you can see the 8 tables under schema in the left navigation bar.

6. Execute TPC-H query

TPC-H has a total of 22 queries, as follows:

Q1:

SELECT l_returnflag, l_linestatus, Sum (l_quantity) AS sum_qty, Sum (l_extendedprice) AS sum_base_price, Sum (l_extendedprice * (1-l_discount)) AS sum_disc_price Sum (l_extendedprice * (1-l_discount) * (1 + l_tax)) AS sum_charge, Avg (l_quantity) AS avg_qty, Avg (l_extendedprice) AS avg_price Avg (l_discount) AS avg_disc, Count (*) AS count_orderFROM lineitemWHERE l_shipdate date '1995-03-31'GROUP BY l_orderkey, o_orderdate, o_shippriorityORDER BY revenue DESC, o_orderdateLIMIT 10

Q4:

SELECT o_orderpriority, Count (*) AS order_countFROM orders, lineitemWHERE o_orderdate > = date '1997-10-01'AND o_orderdate

< date '1997-10-01' + INTERVAL '3' monthAND l_orderkey = o_orderkeyAND l_commitdate < l_receiptdateGROUP BY o_orderpriorityORDER BY o_orderpriorityLIMIT 1; Q5: SELECT n_name, Sum(l_extendedprice * (1 - l_discount)) AS revenueFROM customer, orders, lineitem, supplier, nation, regionWHERE c_custkey = o_custkeyAND l_orderkey = o_orderkeyAND l_suppkey = s_suppkeyAND c_nationkey = s_nationkeyAND s_nationkey = n_nationkeyAND n_regionkey = r_regionkeyAND r_name = 'ASIA'AND o_orderdate >

= date '1995-01-01'AND o_orderdate

< date '1995-01-01' + INTERVAL '1' yearGROUP BY n_nameORDER BY revenue DESCLIMIT 1; Q6: SELECT sum(l_extendedprice * l_discount) AS revenueFROM lineitemWHERE l_shipdate >

= date '1995-01-01'AND l_shipdate

< date '1995-01-01' + interval '1' yearAND l_discount between 0.04 - 0.01 AND 0.04 + 0.01AND l_quantity < 24LIMIT 1; Q7: SELECT supp_nation, cust_nation, l_year, Sum(volume) AS revenueFROM ( SELECT n1.n_name AS supp_nation, n2.n_name AS cust_nation, Extract(year FROM l_shipdate) AS l_year, l_extendedprice * (1 - l_discount) AS volume FROM supplier, lineitem, orders, customer, nation n1, nation n2 WHERE s_suppkey = l_suppkey AND o_orderkey = l_orderkey AND c_custkey = o_custkey AND s_nationkey = n1.n_nationkey AND c_nationkey = n2.n_nationkey AND ( ( n1.n_name = 'GERMANY' AND n2.n_name = 'INDIA') OR ( n1.n_name = 'INDIA' AND n2.n_name = 'GERMANY') ) AND l_shipdate BETWEEN date '1995-01-01' AND date '1996-12-31' ) AS shippingGROUP BY supp_nation, cust_nation, l_yearORDER BY supp_nation, cust_nation, l_yearLIMIT 1; 查看更多 Q8: SELECT o_year, Sum( CASE WHEN nation = 'INDIA' THEN volume ELSE 0 end) / Sum(volume) AS mkt_shareFROM ( SELECT Extract(year FROM o_orderdate) AS o_year, l_extendedprice * (1 - l_discount) AS volume, n2.n_name AS nation FROM part, supplier, lineitem, orders, customer, nation n1, nation n2, region WHERE p_partkey = l_partkey AND s_suppkey = l_suppkey AND l_orderkey = o_orderkey AND o_custkey = c_custkey AND c_nationkey = n1.n_nationkey AND n1.n_regionkey = r_regionkey AND r_name = 'ASIA' AND s_nationkey = n2.n_nationkey AND o_orderdate BETWEEN date '1995-01-01' AND date '1996-12-31' AND p_type = 'STANDARD ANODIZED STEEL' ) AS all_nationsGROUP BY o_yearORDER BY o_yearLIMIT 1; 查看更多 Q9: SELECT nation, o_year, Sum(amount) AS sum_profitFROM ( SELECT n_name AS nation, Extract(year FROM o_orderdate) AS o_year, l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount FROM part, supplier, lineitem, partsupp, orders, nation WHERE s_suppkey = l_suppkey AND ps_suppkey = l_suppkey AND ps_partkey = l_partkey AND p_partkey = l_partkey AND o_orderkey = l_orderkey AND s_nationkey = n_nationkey AND p_name LIKE '%aquamarine%' ) AS profitGROUP BY nation, o_yearORDER BY nation, o_year DESCLIMIT 1; 查看更多 Q10: SELECT c_custkey, c_name, Sum(l_extendedprice * (1 - l_discount)) AS revenue, c_acctbal, n_name, c_address, c_phone, c_commentFROM customer, orders, lineitem, nationWHERE c_custkey = o_custkeyAND l_orderkey = o_orderkeyAND o_orderdate >

= date '1994-08-01'AND o_orderdate

< date '1994-08-01' + INTERVAL '3' monthAND l_returnflag = 'R'AND c_nationkey = n_nationkeyGROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_commentORDER BY revenue DESCLIMIT 20; 查看更多 Q11: SELECT ps_partkey, Sum(ps_supplycost * ps_availqty) AS valueFROM partsupp, supplier, nationWHERE ps_suppkey = s_suppkeyAND s_nationkey = n_nationkeyAND n_name = 'PERU'GROUP BY ps_partkeyHAVING Sum(ps_supplycost * ps_availqty) >

(SELECT Sum (ps_supplycost * ps_availqty) * 0.0001000000 as sum_value FROM partsupp, supplier, nation WHERE ps_suppkey = s_suppkey AND s_nationkey = n_nationkey AND n_name = 'PERU') ORDER BY value DESCLIMIT 1

Q12:

SELECT l_shipmode, sum (case when o_orderpriority = '1lineitemWHERE o_orderkey' or o_orderpriority = '2murahh' then 1else 0end) AS high_line_count, sum (case when o_orderpriority '1whurgen' and o_orderpriority '2muri Higg' then 1else 0end) AS low_line_countFROM orders, lineitemWHERE o_orderkey = l_orderkeyAND l_shipmode in ('MAIL',' TRUCK') AND l_commitdate

< l_receiptdateAND l_shipdate < l_commitdateAND l_receiptdate >

= date '1996-01-01'AND l_receiptdate

< date '1996-01-01' + interval '1' yearGROUP BY l_shipmodeORDER BY l_shipmodeLIMIT 1; Q13: SELECT c_count, count(*) AS custdistFROM ( SELECT c_custkey, count(o_orderkey) AS c_count FROM customer, orders WHERE c_custkey = o_custkey AND o_comment NOT LIKE '%pending%accounts%' GROUP BY c_custkey ) AS c_ordersGROUP BY c_countORDER BY custdist DESC, c_count DESCLIMIT 1; Q14: SELECT 100.00 * sum(case when p_type like 'PROMO%' then l_extendedprice * (1 - l_discount) else 0end) / sum(l_extendedprice * (1 - l_discount)) AS promo_revenueFROM lineitem, partWHERE l_partkey = p_partkeyAND l_shipdate >

= date '1996-01-01'AND l_shipdate

< date '1996-01-01' + interval '1' monthLIMIT 1; Q15: WITH revenue0 AS(SELECT l_suppkey AS supplier_no, sum(l_extendedprice * (1 - l_discount)) AS total_revenueFROM lineitemWHERE l_shipdate >

= date '1993-01-01'AND l_shipdate

< date '1993-01-01' + interval '3' monthGROUP BY l_suppkey)SELECT s_suppkey, s_name, s_address, s_phone, total_revenueFROM supplier, revenue0WHERE s_suppkey = supplier_noAND total_revenue IN ( SELECT max(total_revenue) FROM revenue0 )ORDER BY s_suppkey; Q16: SELECT p_brand, p_type, p_size, count(distinct ps_suppkey) AS supplier_cntFROM partsupp, partWHERE p_partkey = ps_partkeyAND p_brand 'Brand#23'AND p_type NOT LIKE 'PROMO BURNISHED%'AND p_size IN (1, 13, 10, 28, 21, 35, 31, 11)AND ps_suppkey NOT IN ( SELECT s_suppkey FROM supplier WHERE s_comment LIKE '%Customer%Complaints%' )GROUP BY p_brand, p_type, p_sizeORDER BY supplier_cnt DESC, p_brand, p_type, p_sizeLIMIT 1; Q17: SELECT sum(l_extendedprice) / 7.0 AS avg_yearlyFROM lineitem, partWHERE p_partkey = l_partkey AND p_brand = 'Brand#44' AND p_container = 'WRAP PKG' AND l_quantity < ( SELECT 0.2 * avg(l_quantity) FROM lineitem, part WHERE l_partkey = p_partkey ); Q18: SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity)FROM customer, orders, lineitemWHERE o_orderkey IN ( SELECT l_orderkey FROM lineitem GROUP BY l_orderkey HAVING sum(l_quantity) >

AND c_custkey = o_custkeyAND o_orderkey = l_orderkeyGROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalpriceORDER BY o_totalprice DESC, o_orderdateLIMIT 100

Q19:

SELECT sum (lumped extendedprice* (1-l_discount)) AS revenueFROM lineitem, partWHERE (p_partkey = l_partkey and p_brand = 'Brand#12' and p_container in (' SM CASE','SM BOX','SM PACK','SM PKG') and l_quantity > = 6 and l_quantity = 10 and l_quantity = 21 and l_quantity = date '1993-01-01' and l_shipdate

< date '1993-01-01' + interval '1' year)select s_name, s_addressfrom supplier, nationwhere s_suppkey in ( select ps_suppkey from partsupp, temp_table where ps_partkey in ( select p_partkey from part where p_name like 'dark%' ) and ps_availqty >

Temp_table.col1) and s_nationkey = n_nationkey and n_name = 'JORDAN'order by s_namelimit 1

Q21:

Select s_name, count (*) as numwaitfrom supplier, lineitem L1, orders Nationwhere s_suppkey = l1.l_suppkey and o_orderkey = l1.l_orderkey and o_orderstatus ='F' and l1.l_receiptdate > l1.l_commitdate and exists (select * from lineitem L2 where l2.l_orderkey = l1.l_orderkey and l2.l_suppkey l1.l_suppkey) and not exists ( Select * from lineitem L3 where l3.l_orderkey = l1.l_orderkey and l3.l_suppkey l1.l_suppkey and l3.l_receiptdate > l3.l_commitdate) and s_nationkey = n_nationkey and n_name = 'SAUDI ARABIA'group by s_nameorder by numwait desc S_namelimit 100

See more.

Q22:

With temp_table_1 as (select avg (c_acctbal) as avg_value from customer where c_acctbal > 0.00 and substring (c_phone from 1 for 2) in ('33,'29,'37,'35,'25,'27,'43'), temp_table_2 as (select count (*) as count1 from orders, customer where o_custkey = c_custkey) select cntrycode, count (*) as numcust Sum (c_acctbal) as totacctbalfrom (select substring (c_phone from 1 for 2) as cntrycode, c_acctbal from customer, temp_table_1, temp_table_2 where substring (c_phone from 1 for 2) in ('33,'29,'37,'35, 25, 27' '43') and c_acctbal > temp_table_1.avg_value and temp_table_2.count1 = 0) as custsalegroup by cntrycodeorder by cntrycodelimit 1 7. Execute query asynchronously

Data Lake Analytics supports synchronous execution mode and asynchronous execution mode. In synchronous execution mode, the console interface waits for the execution result to be returned; in asynchronous execution mode, the ID of the query task is returned immediately.

Click "execution status" to see the execution status of the asynchronous query task, which is mainly divided into "RUNNING", "SUCCESS" and "FAILURE".

Click "Refresh". When STATUS changes to "SUCCESS", the query is successful. At the same time, you can view the query time spent "ELAPSE_TIME" and the number of data bytes scanned by the query "SCANNED_DATA_BYTES".

8. View query history

Click "execution History" to see the historical details of the query you executed, including:

1) query statement

2) query time and execution time

3) number of rows returned by query result

4) query status

5) query the number of bytes scanned

6) the target OSS file to which the result set is written back (Data Lake Analytics saves the query result set in the user's bucket).

The query result file is automatically uploaded to the OSS bucket of the user and region, including the result data file and the result set metadata description file.

{QueryLocation} / {query_name} | Unsaved} / {yyyy} / {mm} / {dd} / {query_id} / xxx.csv {QueryLocation} / {query_name} | Unsaved} / {yyyy} / {mm} / {dd} / {query_id} / xxx.csv.metadata

Where QueryLocation is:

Aliyun-oa-query-results--

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report