In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly introduces what are the minimalist rules of SQL optimization, which are introduced in great detail and have certain reference value. Friends who are interested must finish reading them.
Query optimization is a complex project, which involves hardware, parameter configuration, parsers of different databases, implementation of optimizer, execution sequence of SQL statements, index, collection of statistical information, and even the overall architecture of applications and systems. This article introduces several key rules that can help us write efficient SQL queries; especially for beginners, they can at least prevent us from writing poor-performing queries.
The following rules apply to a variety of relational databases, including but not limited to: MySQL, Oracle, SQL Server, PostgreSQL, and SQLite. If you find the article useful, you are welcome to comment, like and forward support in moments.
Rule 1: only return the desired results
Be sure to specify WHERE conditions for the query statement to filter out unwanted rows of data. Generally speaking, OLTP systems only need to return a few records from a large amount of data at a time; specifying query conditions can help us return results through an index rather than a full table scan. The performance is better when using indexes in most cases, because indexes (B-trees, B + trees, B* trees) perform binary searches and have logarithmic time complexity rather than linear time complexity. The following is a schematic diagram of the MySQL clustering index:
For example, suppose that each index branch node can store 100 records, and 1 million (1003) records need only three layers of B-tree to complete the index. When looking up data through the index, you need to read the index data three times (each time the disk IO reads the whole branch node), plus one time disk IO to read the data to get the query results. Pure practical information! The 15000-word grammar manual is shared with you.
In contrast, if a full table scan is used, the number of disk IO that needs to be performed may be several orders of magnitude higher. When the amount of data increases to 100 million (1004), the B-tree index only needs to increase the index IO one more time, while the full table scan needs to add several orders of magnitude of IO.
By the same token, we should avoid using SELECT * FROM because it represents all the fields in the query table. This method of writing usually results in the database needs to read more data, and the network also needs to transmit more data, which leads to performance degradation.
Rule 2: make sure the query uses the correct index
If a suitable index is missing, the data will not be looked up by the index even if the query condition is specified. Therefore, we first need to make sure that the appropriate index is created. In general, the following fields need to be indexed:
Indexing fields that often appear in WHERE conditions can avoid full table scans
Add the fields sorted by ORDER BY to the index to avoid additional sorting operations
The index of the associated fields of multi-table join query can improve the performance of join query.
Add the GROUP BY grouping operation field to the index, and you can use the index to complete the grouping.
Even if the appropriate index is created, the database will not use the index if there is a problem with the SQL statement. Common problems that cause index failures include:
Performing an expression operation on an index field in the WHERE clause or using a function will cause the index to fail, which also includes a data type mismatch of the field, such as a comparison between strings and integers
When using LIKE matching, the index cannot be used if the wildcard appears on the left. For the fuzzy matching of large text data, we should consider the full-text search function provided by the database, and even the special full-text search engine (Elasticsearch, etc.)
If an index is created on a field in the WHERE condition, try to set it to NOT NULL;. Not all databases can take advantage of the index when using IS [NOT] NULL judgment.
The execution plan (execution plan, also known as query plan or interpretation plan) is the concrete step for the database to execute SQL statements, such as accessing the data in the table through index or full table scan, the implementation of join query and the order of join and so on. If the SQL statement does not perform well, we should first look at its execution plan and use the execution plan (EXPLAIN) to ensure that the query uses the correct index.
Rule 3: avoid using subqueries as much as possible
Taking MySQL as an example, the following query returns information about employees whose monthly salary is higher than the average monthly salary of the department:
EXPLAIN ANALYZE SELECT emp_id, emp_name FROM employee e WHERE salary > (SELECT AVG (salary) FROM employee WHERE dept_id = e.dept_id);-> Filter: (e.salary > (select # 2)) (cost=2.75 rows=25) (actual time=0.232..4.401 rows=6 loops=1)-> Table scan on e (cost=2.75 rows=25) (actual time=0.099..0.190 rows=25 loops=1)-> Select # 2 (subquery in condition Dependent)-> Aggregate: avg (employee.salary) (actual time=0.147..0.149 rows=1 loops=25)-> Index lookup on employee using idx_emp_dept (dept_id=e.dept_id) (cost=1.12 rows=5) (actual time=0.068..0.104 rows=7 loops=25)
As you can see from the execution plan, the implementation of MySQL is similar to that of Nested Loop Join; the subquery loops 25 times, but the average monthly salary of each department can actually be calculated and cached with a single scan. The following statement replaces the subquery with the equivalent JOIN statement, which implements the Subquery Unnest of the subquery:
EXPLAIN ANALYZE SELECT e.emp_id, e.emp_name FROM employee e JOIN (SELECT dept_id, AVG (salary) AS dept_average FROM employee GROUP BY dept_id) t ON e.dept_id = t.dept_id WHERE e.salary > t.dept_average -> Nested loop inner join (actual time=0.722..2.354 rows=6 loops=1)-> Table scan on e (cost=2.75 rows=25) (actual time=0.096..0.205 rows=25 loops=1)-> Filter: (e.salary > t.dept_average) (actual time=0.068..0.076 rows=0 loops=25)-> Index lookup on t using (dept_id=e.dept_id) (actual time=0.011..0.015 rows=1 loops=25) -> Materialize (actual time=0.048..0.057 rows=1 loops=25)-> Group aggregate: avg (employee.salary) (actual time=0.228..0.510 rows=5 loops=1)-> Index scan on employee using idx_emp_dept (cost=2.75 rows=25) (actual time=0.181..0.348 rows=25 loops=1)
The rewritten query uses Materialization technology to generate a memory temporary table from the results of the subquery, and then joins with the employee table. From the actual execution time, we can see that this approach is faster.
The above example automatically executes the expansion of the subquery in Oracle and SQL Server, and the two writing methods have the same effect; in PostgreSQL, similar to MySQL, the first statement is implemented using Nested Loop Join, and after being rewritten to JOIN, it is implemented with Hash Join, resulting in better performance.
In addition, similar conclusions can be drawn for IN and EXISTS subqueries. Due to the differences in optimizer capabilities of different databases, we should try to avoid using subqueries and consider using JOIN for rewriting. Search the official account migrant worker elder brother technology road, reply "1024", send you a technical resources gift package.
Rule 4: do not use OFFSET for paging
The principle of a paging query is to skip the specified number of rows and then return the Top-N record. The schematic diagram of the paging query is as follows: the database generally supports FETCH/LIMIT and OFFSET to achieve Top-N ranking and paging query. This paging query can cause performance problems when there is a large amount of data in the table. Take MySQL as an example:
-MySQLSELECT * FROM large_table ORDER BY id LIMIT 10 OFFSET N
As the OFFSET increases, the above query becomes slower and slower, because even if we only need to return 10 records, the database still needs to access and filter out N (such as 1000000) rows of records, even through the index will involve unnecessary scanning operations.
For the above paging query, it is better to remember the maximum id obtained last time, and then pass it as a condition in the next query:
-- MySQLSELECT * FROM large_table WHERE id > last_id ORDER BY id LIMIT 10
If there is an index on the id field, this paged query can be largely independent of the amount of data.
Rule 5: understand the logical execution order of SQL clauses
The following is the syntactic order of the clauses in SQL, and the numbers in the preceding parentheses represent their logical execution order:
(6) SELECT [DISTINCT | ALL] col1, col2, agg_func (col3) AS alias (1) FROM T1 JOIN T2 (2) ON (join_conditions) (3) WHERE where_conditions (4) GROUP BY col1, col2 (5) HAVING having_condition (7) UNION [ALL]. (8) ORDER BY col1 ASC,col2 DESC (9) OFFSET m ROWS FETCH NEXT num_rows ROWS ONLY
That is, SQL does not execute the SELECT before the FROM clause in the order in which it is written. Logically, the SQL statements are executed in the following order:
First, FROM and JOIN are the first steps in the execution of the SQL statement. Their logical result is a Cartesian product that determines the data set to be operated on next. Note that the logical execution order does not represent the physical execution order. In fact, the database uses ON and WHERE filtering conditions to optimize access before getting the data in the table.
Secondly, the ON condition is applied to filter the results of the previous step and generate a new data set.
Then, execute the WHERE clause to filter the dataset from the previous step again. WHERE and ON have the same effect in most cases, but external join queries are different, which we will show examples below
Next, the grouping is based on the expression specified in the GROUP BY clause; at the same time, the result of the aggregate function agg_func is calculated for each grouping. After GROUP BY processing, the structure of the dataset changes, leaving only the results of grouped fields and aggregate functions.
If there is a GROUP BY clause, you can use HAVING to further filter the grouped results, usually the results of the aggregate function
Next, SELECT can specify the columns to return; if the DISTINCT keyword is specified, the result set needs to be deduplicated. In addition, aliases are generated for fields that specify AS
If there are collection operators (UNION, INTERSECT, EXCEPT) and other SELECT statements, execute the query and merge the two result sets. For multiple SELECT statements in a collection operation, the database can usually support concurrent execution
Then, the ORDER BY clause is applied to sort the results. If there is a GROUP BY clause or DISTINCT keyword, you can only sort using grouping fields and aggregate functions; otherwise, you can sort using any field in the FROM and JOIN tables
Finally, OFFSET and FETCH (LIMIT, TOP) limit the final number of rows returned.
Understanding the order in which SQL logic executes can help us with SQL optimization. For example, the WHERE clause executes before the HAVING clause, so we should try to use WHERE for data filtering to avoid unnecessary operations, unless the business needs to filter the results of the aggregate function.
In addition, understanding the logical execution order of SQL can also help us avoid some common errors, such as the following statements:
-- error example SELECT emp_name AS empname FROM employee WHERE empname = 'Zhang Fei'
The error in this statement is that the column alias is referenced in the WHERE condition; as you can see from the logical order above, there is no alias for the generated field before the SELECT clause is executed when the WHERE condition is executed.
Another action to note is GROUP BY, for example:
-- GROUP BY error example SELECT dept_id, emp_name, AVG (salary) FROM employee GROUP BY dept_id
Because the result set only retains the results of the grouping field and the aggregate function after GROUP BY processing, the emp_name field in the example no longer exists; from the business logic, it is meaningless to display the name of an employee after grouping statistics by department. If you need to display both employee information and a summary of your department, you can use the window function. Extension: SQL Grammar Express Manual
There are also logic issues that may not directly cause an error in the query, but return incorrect results; for example, the ON and WHERE conditions in the outer join query. The following is an example of a left outer join query:
SELECT e.emp_name, d.dept_name FROM employee e LEFT JOIN department d ON (e.dept_id = d.dept_id) WHERE e.emp_name = 'Zhang Fei'; emp_name | dept_name |-| Zhang Fei | Administration Department | SELECT e.emp_name, d.dept_name FROM employee e LEFT JOIN department d ON (e.dept_id = d.dept_id AND e.emp_name = 'Zhang Fei') Emp_name | dept_name |-| Liu Bei | [NULL] | Guan Yu | [NULL] | Zhang Fei | Administration Department | Zhuge Liang | [NULL] |.
The first query specifies the condition of the join in the ON clause and finds out the information about "Zhang Fei" through the WHERE clause.
The second query puts all the filter criteria in the ON clause, and the result returns all employee information. This is because the left outer join returns all the data in the left table and does not take effect even if the employee name is specified in the ON clause, while the WHERE condition logically filters the results after the join operation.
Summary
SQL optimization is essentially about understanding how the optimizer works, creating appropriate indexes and correct statements for it, and manually making it smart when the optimizer is not smart enough.
The above is all the contents of the article "what are the minimalist rules of SQL optimization". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
Table A.1 Management Node configuration parameters
© 2024 shulou.com SLNews company. All rights reserved.