How does GaussDB for DWS recognize bad smell SQL 04/28 Update SLTechnology News&Howtos

How does GaussDB for DWS recognize bad smell SQL

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shows you how GaussDB for DWS identifies bad-smelling SQL. It is concise and easy to understand. It will definitely make your eyes shine. I hope you can gain something from the detailed introduction of this article.

SQL is the standard language for relational databases (RDBs), and its role is to translate the user's intentions into a language that the database can understand and execute. When human beings communicate with each other, different words for the same meaning can have different effects.

Similarly, when humans communicate information with databases, expressing the same operation in different SQL statements can lead to different efficiencies. Sometimes the same SQL statement, the database uses different ways to execute, the efficiency will be different. SQL statements and how they are executed that cause inefficiency are called "bad smells" in SQL.

Here is a simple example of what a bad smell in SQL is.

Figure 1-a Merging Sets with Union

In the query above, since union is used to merge the two result sets, sorting and de-duplication are required after merging, adding overhead. In fact, there is no overlap between the results of dept_id = 1 and dept_id > 2, so it is completely possible to merge them with union all, as shown in the following figure.

Figure 1-b: Union all.

A more efficient approach is to use the or condition to filter out the desired results directly during scanning, which not only saves computation, but also saves the memory overhead required to save intermediate results, as shown in the following figure.

Figure 1-c: Filtering results with or condition

It can be seen that the same operation is completed with different SQL statements, but the efficiency is very different. The first two SQL statements have a "bad taste" to varying degrees.

For this simple example, users can easily identify the problem and choose the best solution. However, for some complex SQL statements, their performance defects may be hidden and need to be analyzed in depth to be able to mine them. This places high demands on database users. Even experienced database experts sometimes struggle to pinpoint the cause of performance degradation.

GaussDB analyzes and records its performance when executing SQL statements, and presents it to users through views and functions. This article briefly describes how to use this "first-hand" data provided by GaussDB to analyze and locate performance problems in SQL statements and identify and eliminate "bad smells" in SQL.

◆ Self-diagnostic view to identify SQL bad smells

When executing SQL, GaussDB records and analyzes the execution plan and resource consumption during execution. If an abnormal situation is found, it will also record alarm information for "self-diagnosis" of the cause. Users can query this information through the following view:

· gs_wlm_session_info

· pgxc_wlm_session_info

· gs_wlm_session_history

· pgxc_wlm_session_history

gs_wlm_session_info is the base table, and the other three are views. The gs_header is used to view information collected on the current CN node, and the pgxc_header contains information collected on all CNs in the cluster. The definitions of tables and views are basically the same, as shown in the following table.

Table 1-1 Self-diagnosis Form & Function Field Definitions

Table 1-2 Self-diagnosis Form & Function Field Definitions

The query field is the SQL statement executed. By analyzing the fields corresponding to each query, such as execution time, memory, IO, disk volume, tilt rate, etc., you can find suspected problematic SQL statements, and then combine them with the query_plan (execution plan) field for further analysis. In particular, for some abnormal situations found during execution, the warning field also gives warning information in the form of human-readable. The self-diagnostic information currently available is as follows:

Multi/Single column statistics not collected

The optimizer relies on table statistics to generate reasonable execution plans. Failure to collect statistics on columns in a table in a timely manner may influence the optimizer's judgment and result in poor execution plans. If single column or multi-column statistical information of a table is not collected when generating the plan, the warning field will give the following warning information:

Statistic Not Collect:schemaname.tablename(column name list)

In addition, if statistics have been collected (analyzed) for a table, but the table content has changed significantly since the last analysis, the optimizer may rely on inaccurate statistics to generate an optimal query plan. In this case, the pg_total_autovac_tuples system function queries the table for the number of tuples that have changed since the last analysis. If the number is large, it's a good idea to perform an analyze so that the optimizer gets the most up-to-date statistics.

SQL not pushed down

If the operator in the execution plan can be pushed down to DN node for execution, it can only be executed on CN. Because the number of CNs is much smaller than DN, a large number of operations are stacked on CNs, affecting overall performance. If you encounter a function or syntax that cannot be pushed down, the warning field will give the following warning information:

SQL is not plan-shipping, reason : %s◇Hash join large tables as inner tables

If it is found that a large table is used as an inner table when performing a Hash join, the following alarm information will be given:

PlanNode[%d] Large Table is INNER in HashJoin \"%s\"

At present, the standard of "large table" is that the average number of rows on each DN is greater than 100,000, and the number of rows in the inner table is more than 10 times the number of rows in the outer table.

◇ Large table equivalence connection using NestLoop

If NestLoop mode is used when making equivalent connection for large tables, the following alarm information will be given:

PlanNode[%d] Large Table with Equal-Condition use Nestloop\"%s\"

At present, the criterion for determining the equivalence connection of large tables is that the number of rows in the inner table and the outer table with the largest number is greater than DN multiplied by 100,000.

♦ Data tilt

Data is unevenly distributed among DNs, which can cause nodes with more data to become performance bottlenecks. If the data tilt is found to be serious, the following warning information will be given:

PlanNode[%d] DataSkew:\"%s\", min_dn_tuples:%.0f, max_dn_tuples:%.0f

At present, the standard for judging data skew is that the number of rows with the largest number in DN is more than 10 times that of the least number, and the largest number is greater than 100,000.

Cost estimates are inaccurate.

GaussDB calculates the actual cost of executing SQL statements and compares it with the estimated cost. If the optimizer's estimate of cost deviates significantly from the actual estimate, it is likely to produce a non-optimal plan. If it is found that the cost estimation is inaccurate, the following warning information will be given:

"PlanNode[%d] Inaccurate Estimation-Rows: \"%s\" A-Rows:%.0f, E-Rows:%.0f

The current cost is measured by the number of rows returned by the planned node. If the actual/estimated number of rows returned on each DN is greater than 100,000 on average, and the difference between the two is more than 10 times, the cost estimation is considered inaccurate.

Broadcast volume is too large

Broadcast is mainly suitable for small tables. For large tables, Hash+ Redistribute is usually more efficient. If a link with a large table being broadcast is found in the plan, the following alarm information will be given:

PlanNode[%d] Large Table in Broadcast \"%s\"

The current standard for large table broadcasts is that the average number of rows broadcast to each DN is greater than 100,000.

Index settings are unreasonable

Poor performance can result from improper use of indexes, such as sequential scanning where index scanning should be used, or index scanning where sequential scanning should be used.

The value of an index scan is to reduce the number of data reads, so the more rows an index scan filters out, the better. If index scanning is used, but the number of output rows/total number of scans is>1/1000, and the number of output rows is>10000 (for row storage tables) or>100 (for column storage tables), the following alarm message will be given:

PlanNode[%d] Indexscan is not properly used:\"%s\", output:%.0f, filtered:%.0f, rate:%.5f

Sequential scanning is useful when the number of filtered rows is not a significant proportion of the total number of rows. If sequential scanning is used, but the number of rows/total number of scans is output

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.