Database SQL optimization 07/03 Update SLTechnology News&Howtos

Database SQL optimization

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

One: optimization description

A: some data show that the maximum waiting time that users can bear is 8 seconds. There are many database optimization strategies, and at the beginning of the design, the establishment of a good data structure is very important for later performance optimization. Because the database structure is the cornerstone of the system, the foundation is not good, the use of a variety of optimization strategies, can not achieve a very perfect effect.

B: several aspects of database optimization

It can be seen that data structure, SQL and index are the lowest cost and the best optimization methods.

C: performance optimization is endless, when the performance can meet the requirements, do not over-optimize.

Second, the direction of optimization

1. SQL and optimization of index

First of all, we should write a well-structured SQL according to the requirements, and then establish a valid index in the table according to the SQL. However, if there are too many indexes, it will affect not only the efficiency of writing, but also the query.

two。 A reasonable database is designed.

The table structure is designed according to the three paradigms of the database. When designing a table structure, you need to consider how to design a more efficient query.

Database three paradigms:

The first paradigm: each field in the data table must be the smallest unit that cannot be split, that is, to ensure the atomicity of each column.

Second normal form: after satisfying a normal form, each column in the table must be unique and must depend on the primary key.

The third paradigm: after satisfying the second paradigm, each column in the table is only directly related to the primary key rather than indirectly (foreign keys are also directly related), and the fields are not redundant.

Note: there is no best design, only the most suitable design, so don't pay too much attention to theory. The three paradigms can be used as a basic basis and should not be copied mechanically.

Sometimes it is reasonable to de-normalize based on the scenario:

A: partition table.

B: keep redundant fields. When two or more tables often need to be joined in a query, a number of redundant fields can be added to one of the tables to avoid joining between tables too frequently, which is generally used when the data of redundant columns does not change frequently.

C: increase the number of students. Derived columns are derived from the calculation of many other columns in the table. Adding derived columns can reduce statistical operations and greatly reduce the operation time when data are summarized.

Five database constraints:

A:PRIMARY key: setting primary key constraint

B:UNIQUE: sets the uniqueness constraint and cannot have duplicate values

C:DEFAULT default value constraint

D:NOT NULL: set a non-empty constraint. This field cannot be empty.

E:FOREIGN key: sets foreign key constraints.

Field type selection:

A: try to use TINYINT, SMALLINT, MEDIUM_INT as integer types instead of INT, and add UNSIGNED if non-negative

The length of B:VARCHAR allocates only the space that is really needed

C: use enumerations or integers instead of string types

D: try to use TIMESTAMP instead of DATETIME

E: do not have too many fields in a single table. It is recommended that it be less than 20.

F: avoid using NULL fields, which are difficult to query and optimize and take up extra index space

3. Optimization of system configuration

For example: MySQL database my.cnf

4. Hardware optimization

Faster IO, more memory. In general, the larger the memory, the better the operation for the database. But more CPU is not necessarily, because it will not use too much CPU, there are many queries are a single CPU. In addition, high IO (SSD, RAID) is used, but IO does not reduce the database locking mechanism. So if the slow query is caused by some locks within the database, then hardware optimization doesn't make any sense.

Three: optimization scheme

Code optimization

The reason for putting code first is that it is most likely to be overlooked by technicians. After many technicians get a performance optimization requirement, they must call it caching, async, JVM, and so on. In fact, the first step should be to analyze the relevant code, find out the corresponding bottleneck, and then consider the specific optimization strategy. Some performance problems are entirely due to unreasonable code writing, which can be solved by directly modifying the code, such as too many for loops, making a lot of unnecessary conditional judgments, repeating the same logic many times, and so on.

For example:

A update operation that queries out entity and then executes update, which undoubtedly adds one more database interaction. Another problem is that update statements may manipulate fields that do not need to be updated.

We can assign the attributes involved in the form, as well as updateTime,updateUser, etc., to entity, and directly through pdateByPrimaryKeySelective, to update specific fields.

Locate slow SQL and optimize

This is the most commonly used, and every technician should master the basic SQL tuning tools (including methods, tools, auxiliary systems, etc.). Here, take MySQL as an example. The most common way is to locate the specific problem SQL from the self-contained slow query log or open source slow query system, and then use explain, profile and other tools to gradually tune it, and finally go online after testing to achieve results.

SqlServer execution Plan:

What information can we get by implementing the plan:

A: which steps are more expensive

B: which steps produce a large amount of data, and the amount of data is expressed by the thickness of the lines, which is very intuitive

C: what actions are performed at each step

Specific optimization means:

A: try to use (or not) the functions that come with sqlserver as little as possible

Select id from t where substring (name,1,3) = 'abc'

Select id from t where datediff (day,createdate,'2005-11-30') = 0

You can query it like this:

Select id from t where name like 'abc%'

Select id from t where createdate > = '2005-11-30' and createdate

< '2005-12-1' B：连续数值条件，用BETWEEN不用IN：SELECT id FROM t WHERE num BETWEEN 1 AND 5 C：Update 语句，如果只更改1、2个字段，不要Update全部字段，否则频繁调用会引起明显的性能消耗 D：尽量使用数字型字段，若只含数值信息的字段尽量不要设计为字符型 E：不建议使用 select * from t ，用具体的字段列表代替"*"，不要返回用不到的任何字段。尽量避免向客户端返回大数据量，若数据量过大，应该考虑相应需求是否合理 F：表与表之间通过一个冗余字段来关联，要比直接使用JOIN有更好的性能 G：select count(*) from table；这样不带任何条件的count会引起全表扫描连接池调优我们的应用为了实现数据库连接的高效获取、对数据库连接的限流等目的，通常会采用连接池类的方案，即每一个应用节点都管理了一个到各个数据库的连接池。随着业务访问量或者数据量的增长，原有的连接池参数可能不能很好地满足需求，这个时候就需要结合当前使用连接池的原理、具体的连接池监控数据和当前的业务量作一个综合的判断，通过反复的几次调试得到最终的调优参数。合理使用索引索引一般情况下都是高效的。但是由于索引是以空间换时间的一种策略，索引本身在提高查询效率的同时会影响插入、更新、删除的效率，频繁写的表不宜建索引。选择合适的索引列，选择在where，group by，order by，on从句中出现的列作为索引项，对于离散度不大的列没有必要创建索引。主键已经是索引了，所以primay key 的主键不用再设置unique唯一索引索引类型主键索引（PRIMARY KEY）唯一索引（UNIQUE）普通索引（INDEX）组合索引（INDEX）全文索引（FULLTEXT）可以应用索引的操作符大于等于 Between IN LIKE 不以 % 开头不能应用索引的操作符 NOT IN LIKE %_ 开头如何选择索引字段 A：字段出现在查询条件中，并且查询条件可以使用索引 B：通常对数字的索引和检索要比对字符串的索引和检索效率更高 C：语句执行频率高，一天会有几千次以上 D：通过字段条件可筛选的记录集很小无效索引 A：尽量不要在 where 子句中对字段进行 null 值判断，否则将导致引擎放弃使用索引而进行全表扫描 B：应尽量避免在 where 子句中使用 != 或操作符，否则将引擎放弃使用索引而进行全表扫描。 C：应尽量避免在 where 子句中使用 or 来连接条件，如果一个字段有索引，一个字段没有索引，将导致引擎放弃使用索引而进行全表扫描 select id from t where num=10 or Name = 'admin' 可以这样查询： select id from t where num = 10 union select id from t where Name = 'admin' union all 返回所有数据，不管是不是重复。 union会自动压缩，去除重复数据。 D：不做列运算 where age + 1 = 10，任何对列的操作都将导致表扫描，它包括数据库教程函数、计算表达式等 E：查询like，如果是 '%aaa' 不会使用到索引分表分表方式水平分割（按行）、垂直分割(按列) 分表场景 A：根据经验，mysql表数据一般达到百万级别，查询效率就会很低。 B：一张表的某些字段值比较大并且很少使用。可以将这些字段隔离成单独一张表，通过外键关联，例如考试成绩，我们通常关注分数，不关注考试详情。水平分表策略按时间分表：当数据有很强的实效性，例如微博的数据，可以按月分割。按区间分表：例如用户表 1到一百万用一张表，一百万到两百万用一张表。 hash分表：通过一个原始目标id或者是名称按照一定的hash算法计算出数据存储的表名。读写分离当一台服务器不能满足需求时，采用读写分离【写: update/delete/add】的方式进行集群。一台数据库支持最大连接数是有限的，如果用户的并发访问很多，一台服务器无法满足需求，可以集群处理。mysql集群处理技术最常用的就是读写分离。主从同步：数据库最终会把数据持久化到磁盘，集群必须确保每个数据库服务器的数据是一致的。从库读主库写，从库从主库上同步数据。读写分离：使用负载均衡实现，写操作都往主库上写，读操作往从服务器上读。缓存缓存分类本地缓存：HashMap/ConcurrentHashMap、Ehcache、Guava Cache等缓存服务：Redis/Tair/Memcache等使用场景短时间内相同数据重复查询多次且数据更新不频繁，这个时候可以选择先从缓存查询，查询不到再从数据库加载并回设到缓存的方式。此种场景较适合用单机缓存。高并发查询热点数据，后端数据库不堪重负，可以用缓存来扛。缓存作用减轻数据库的压力，减少访问时间。缓存选择如果数据量小，并且不会频繁地增长又清空（这会导致频繁地垃圾回收），那么可以选择本地缓存。具体的话，如果需要一些策略的支持（比如缓存满的逐出策略），可以考虑Ehcache；如不需要，可以考虑HashMap；如需要考虑多线程并发的场景，可以考虑ConcurentHashMap。其他情况，可以考虑缓存服务。目前从资源的投入度、可运维性、是否能动态扩容以及配套设施来考虑，我们优先考虑Tair。除非目前Tair还不能支持的场合（比如分布式锁、Hash类型的value），我们考虑用Redis。缓存穿透一般的缓存系统，都是按照key去缓存查询，如果不存在对应的value，就应该去后端系统查找（比如DB）。如果key对应的value是一定不存在的，并且对该key并发请求量很大，就会对后端系统造成很大的压力。这就叫做缓存穿透。对查询结果为空的情况也进行缓存，缓存时间设置短点，或者该key对应的数据insert了之后清理缓存。缓存并发有时候如果网站并发访问高，一个缓存如果失效，可能出现多个进程同时查询DB，同时设置缓存的情况，如果并发确实很大，这也可能造成DB压力过大，还有缓存频繁更新的问题。对缓存查询加锁，如果KEY不存在，就加锁，然后查DB入缓存，然后解锁；其他进程如果发现有锁就等待，然后等解锁后返回数据或者进入DB查询。缓存雪崩(失效) 当缓存服务器重启或者大量缓存集中在某一个时间段失效，这样在失效的时候，也会给后端系统(比如DB) 带来很大压力。不同的key，设置不同的过期时间，让缓存失效的时间点尽量均匀. 防止缓存空间不够用 ① 给缓存服务，选择合适的缓存逐出算法，比如最常见的LRU。 ② 针对当前设置的容量，设置适当的警戒值，比如10G的缓存，当缓存数据达到8G的时候，就开始发出报警，提前排查问题或者扩容。 ③ 给一些没有必要长期保存的key，尽量设置过期时间。我们看下图，在WebServer（Dao层）和DB之间加一层cache，这层cache一般选取的介质是内存，因为我们都知道存入数据库的数据都具有持久化的特点，那么读写会有磁盘IO的操作，内存的读写速度远比磁盘快得多。（选用存储介质，提高访问速度：内存>

> disk; reduce disk IO operations, reduce repeated queries, and improve throughput)

Commonly used open source cache workers are: ehcache, memcache, redis.

Ehcache is a pure Java in-process caching framework, which is used by hibernate for secondary caching. At the same time, ehcache can cluster through multicast. I am mainly used for the local cache and the cache at the top of the database.

Memcache is a distributed cache system that provides simple data storage such as key-value, which can make full use of CPU multi-core and no persistence function. It can be used for session sharing and page object caching in web cluster.

Redis high-performance key-value system, provides a wealth of data types, single-core CPU has anti-concurrency ability, persistence and master-slave replication functions. I mainly use redis's redis sentinel, which is divided into several groups according to different business.

Considerations for redis

A: set the expiration time as much as possible when increasing key, otherwise the memory usage of Redis Server will reach the maximum of the system's physical memory, causing Redis to use VM to degrade system performance.

B:Redis Key should be designed as short as possible, and Value should try not to use complex objects.

C: convert objects to JSON objects (using off-the-shelf JSON libraries) and store them in Redis

D: convert objects to Google open source binary protocol objects (Google Protobuf, similar to JSON data format, but because of binary performance, performance efficiency and space footprint are lower than JSON; the disadvantage is that Protobuf has a much larger learning curve than JSON)

Be sure to release the connection after using the E:Redis.

Read whether there is relevant data in the cache, and if there is relevant data in the cache, return it directly. This is the so-called data hit "hit".

If there is no relevant data in the cache, the relevant data is read from the database, put into the cache, and then returned. This is the so-called data missed "miss"

Cache hit ratio = number of cache requests hit / total cache access requests = hit/ (hit+miss)

/ * java Framework Project case www.1b23.com*/

NoSQL

The difference between and caching

To be clear, the cache introduced here is different. Although redis can also be used as a data storage solution (such as Redis or Tair), NoSql uses it as a DB. If it is used as DB, it is necessary to effectively ensure the availability and reliability of the data storage scheme.

Working with scen

It needs to be combined with specific business scenarios to see whether the data involved in this business is suitable to be stored in NoSQL, whether the operation mode of the data is suitable to be operated in NoSQL, or whether some additional features of NoSQL (such as atomic addition and subtraction, etc.) are needed.

If business data does not need to be associated with other data, does not need support such as transactions or foreign keys, and is likely to be written very frequently, it is more appropriate to use NoSQL (such as HBase).

For example, Meituan Dianping has an internal monitoring system for exception. If a serious failure occurs in the application system, a large amount of exception data may be generated in a short time. If you choose MySQL, it will cause a surge in the instantaneous writing pressure of MySQL, which can easily lead to problems such as sharp deterioration of MySQL server performance and master-slave synchronization delay. This scenario is more suitable to use NoSQL similar to Hbase to store.

View / stored procedure

General business logic should not use stored procedures as far as possible. Scheduled tasks or report statistics functions can be processed by stored procedures according to team resources.

GVM tuning

Through the monitoring and alarm of some machine key indicators (gc time, gc count, memory size changes of each generation, machine load value and CPU utilization, number of threads of JVM, etc.) on the monitoring system (for example, without a ready-made system, it is also easy to make a simple report and monitoring system), you can also see the output of commands such as gc log and jstat, combined with the performance data and request experience of some key interfaces of the online JVM process service. You can basically determine whether there is a problem with the current JVM and whether it needs to be tuned.

Asynchronous / multithreading

For some client requests, it may be necessary to do some ancillary things on the server side for these requests. In fact, the user does not care about these things or the user does not need to get the processing results of these things immediately. In this case, it is more suitable to deal with these things asynchronously.

Asynchronous action

A: shorten the response time of the interface, make the user's request return quickly, and the user experience is better.

B: avoid threads running for a long time, which will cause insufficient available threads in the service thread pool for a long time, resulting in an increase in the length of the thread pool task queue, blocking more request tasks and leaving more requests unprocessed by technology.

C: if the thread runs for a long time, it may also cause a series of problems, such as system Load, CPU utilization, the decline of the overall machine performance, and even cause avalanches. Asynchronous thinking can effectively solve this problem without increasing the number of machines and CPUs.

Asynchronous implementation

A: open up additional threads, here you can use the practice of opening an additional thread or using a thread pool to handle the corresponding task outside the I / O thread (processing request and response), and let response return first in the IO thread.

B: using message queuing (MQ) middleware services

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.