How to optimize Presto query 07/04 Update SLTechnology News&Howtos

How to optimize Presto query

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to optimize the Presto query, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

Presto is an open source distributed SQL query engine, which is suitable for interactive analytical queries. The amount of data is supported from GB to PB bytes. The query language is an ANSI SQL-like statement.

I. data storage

Reasonable setting of zoning

Similar to Hive, Presto reads partition data based on meta-information. Reasonable partitioning can reduce the amount of Presto data read and improve query performance.

Use column storage

Presto makes specific optimizations for ORC file reading, so when creating tables used by Presto in Hive, it is recommended that you store them in ORC format. Support for ORC is better than Parquet,Presto.

Use Compression

Data compression can reduce the pressure on IO bandwidth caused by data transmission between nodes. For ad hoc queries, snappy compression is recommended.

Pre-sorting

For sorted data, the ORC format supports skipping reading unnecessary data during the data filtering phase of the query. For example, fields that often need to be filtered can be pre-sorted.

INSERT INTO table nation_orc partition (p) SELECT * FROM nation SORT BY n_name

If you need to filter the n_name field, performance will improve.

SELECT count (*) FROM nation_orc WHERE nameplate automatic optimization; II. Query SQL optimization

Select only the necessary fields

Due to the use of column storage, selecting the required fields can speed up the reading of fields and reduce the amount of data. Avoid using * to read all fields.

[GOOD]: SELECT time,user,host FROM tbl [BAD]: SELECT * FROM tbl

Filter conditions must include partition fields

For tables with partitions, the partition field is preferred in the where statement for filtering. Acct_day is the partition field, and visit_time is the specific access time

[GOOD]: SELECT time,user,host FROM tbl where acct_day=20171101 [BAD]: SELECT * FROM tbl where visit_time=20171101

Group By statement optimization

Reasonably arranging the order of fields in Group by statements will improve the performance to a certain extent. Sort the fields in the Group By statement in descending order according to how much distinct data each field has. In the example, uid is the user id, which is much larger than the gender data.

[GOOD]: SELECT GROUP BY uid, gender [BAD]: SELECT GROUP BY gender, uid

Use Limit when Order by

Order by needs to scan data to a single worker node for sorting, resulting in a large amount of memory for a single worker. If you are querying Top N or Bottom N, using limit reduces sorting calculations and memory pressure.

[GOOD]: SELECT * FROM tbl ORDER BY time LIMIT 100 [BAD]: SELECT * FROM tbl ORDER BY time

And try to reduce the number of sorted fields to speed up the calculation.

Use approximate aggregate functions

Presto has some approximate aggregate functions, which can greatly improve query performance for query scenarios that allow a small amount of error. For example, using the approx_distinct () function has an error of about 2.3% compared to Count (distinct x).

SELECT approx_distinct (user_id) FROM access

If you have to remove duplicates precisely, use the Count+Group statement instead

Replace multiple like statements with regexp_like

The Presto query optimizer does not optimize multiple like statements, and the use of regexp_like greatly improves performance

[GOOD] SELECT... FROM accessWHERE regexp_like (method, 'GET | POST | PUT | DELETE') [BAD] SELECT... FROM accessWHERE method LIKE'% GET%' OR method LIKE'% POST%' OR method LIKE'% PUT%' OR method LIKE'% DELETE%'

Place the large table on the left when using the join statement

The default algorithm for join in Presto is broadcast join, which splits the table on the left of join into multiple worker, and then copies the entire table data on the right of join to each worker for calculation. If the table on the right has too much data, a memory overflow error may be reported.

[GOOD] SELECT... FROM large_table l join small_table s on l.id = s.id [BAD] SELECT. FROM small_table s join large_table l on l.id = s.id

What if both the left and right tables are larger? In order to prevent memory from reporting errors

1) modify the configuration distributed-joins-enabled (presto version > = 0.196)

2) use the session option of distributed_join at the beginning of each query

-- set session distributed_join = 'true'SELECT... FROM large_table1 join large_table2on large_table1.id = large_table2.id

The core point is to use distributed join. This configuration type of Presto partitions both the left and right tables with join key's hash value as the partition field. So even if the right table is also a large table, it will be split.

The disadvantage is that it will increase a lot of network data transmission, so it will be slower than broadcast join.

Use the Rank function instead of the row_number function to get Top N

When doing some grouping and sorting scenarios, using the rank function has better performance

SELECT checksum (rnk) FROM (SELECT rank () OVER (PARTITION BY l_orderkey, l_partkey ORDER BY l_shipdate DESC) AS rnk FROM lineitem) tWHERE rnk = 1 [BAD] SELECT checksum (rnk) FROM (SELECT row_number () OVER (PARTITION BY l_orderkey, l_partkey ORDER BY l_shipdate DESC) AS rnk FROM lineitem) tWHERE rnk = 1

9. Multi-purpose with statement

When using Presto to analyze statistical data, you can consider merging multiple queries into a single query, which can be completed with a subquery provided by Presto.

This is not quite the same as the well-known use of MySQL. Notice the commas in the following subqueries.

WITH subquery_1 AS (SELECT A1, a2, A3 FROM Table_1 WHERE A3 between 20180101 and 20180131), subquery_2 AS (SELECT b1, b2, b3FROM Table_2WHERE b3 between 20180101 and 20180131) SELECT subquery_1.a1, subquery_1.a2, subquery_2.b1, subquery_2.b2FROM subquery_1JOIN subquery_2ON subquery_1.a3 = subquery_2.b3

Try to use UNION ALL instead of UNION

Similar to distinct, UNION has deduplication function, so it will cause memory usage problems.

If you are just stitching the results of two or more SQL queries, consider using UNION ALL

3. Seamlessly replace Hive table

If ORC and snappy are not used in the previous hive table, how to replace them seamlessly without affecting the online application:

For example, the following hive table:

CREATE TABLE bdc_dm.res_category (channel_id1 int comment'1 channel id',province string COMMENT 'province, city string comment' city', uv int comment 'uv') comment' example'partitioned by (landing_date int COMMENT 'date: yyyymmdd') ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t 'COLLECTION ITEMS TERMINATED BY', 'MAP KEYS TERMINATED BY': 'LINES TERMINATED BY'\ n'

Establish the corresponding orc table

CREATE TABLE bdc_dm.res_category_orc (channel_id1 int comment'1 channel id',province string COMMENT 'province, city string comment' city', uv int comment 'uv') comment' example'partitioned by (landing_date int COMMENT 'date: yyyymmdd') row format delimited fields terminated by'\ t'stored as orc TBLPROPERTIES ("orc.compress" = "SNAPPY")

First pour the data into the orc table, and then change the table name

Insert overwrite table bdc_dm.res_category_orc partition (landing_date) select * from bdc_dm.res_category where landing_date > = 20171001 alter TABLE bdc_dm.res_category RENAME TO bdc_dm.res_category_tmp;ALTER TABLE bdc_dm.res_category_orc RENAME TO bdc_dm.res_category

Where res_category_tmp is a backup table, if there is no problem after running online for a period of time, you can delete the table.

IV. matters needing attention

Both ORC and Parquet support column storage, but ORC supports Presto better (Parquet supports Impala better)

For column storage, the storage file is binary, for tables that frequently add and delete fields, column storage is not recommended (it is expensive to modify file metadata). Compared to the data warehouse, the dwd layer does not recommend using ORC, while the dm layer recommends using the

Thank you for reading this article carefully. I hope the article "how to optimize Presto query" shared by the editor will be helpful to everyone. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.