How to solve the avalanche effect caused by hot spot update 07/12 Update SLTechnology News&Howtos

How to solve the avalanche effect caused by hot spot update

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

What this article shares with you is about how to solve the avalanche effect caused by hot spot updates. The editor thinks it is very practical, so I share it with you. I hope you can learn something after reading this article. Let's take a look at it with the editor.

Case study of Part Ⅰ

The scenario of this failure is relatively simple, when there are a large number of failed requests in the business, which are almost unavailable. At the same time, there are a large number of alarms with high CPU utilization in the corresponding MySQL database.

1. Log in to the database, and the screenshot of the scene viewed by show processlist is as follows:

2. The MySQL version is 5.7, and the database table structure is as follows:

CREATE TABLE `docid_ Generator` (`id` int (4) NOT NULL AUTO_INCREMENT,PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=2

3. The business request Session is as follows:

Connectupdate docid_generator set id=last_insert_id (id+1); select last_insert_id () exit

Through preliminary investigation, we know that there is only one field and a row of records in the table, and the business logic realizes the id allocation function through the last_insert_id (expr) function feature in mysql; it is viewed in reverse order according to the time-consuming execution of processlist, and the longest sql is also this kind of update request; innodb status can see a large number of transactions waiting for the X lock of the record. Update's X lock makes requests only serial, causing the response to be slow, but what caused the first batch of update requests to get stuck?

Pref analysis shows that the lock_deadlock_recursive function occupies nearly 50% of the time of cpu recycle events. This function is called recursively through the depth-first algorithm to detect whether the deadlock condition is satisfied, and then rollback the transaction with the minimum cost.

Looking at the innodb_trx transaction lock waiting queue in information_schema, it is found that there are already 6100 locks waiting for information.

By consulting the documentation, it is found that the recent deadlock check information output by the InnoDB monitor contains "TOO DEEP OR LONG SEARCH IN THE LOCK TABLE WAITS-FOR GRAPH, WE WILL ROLL BACK FOLLOWING TRANSACTION", indicating that the length of the waiting transaction list has reached the limit of 200. A waiting list of more than 200 transactions is considered a deadlock and the transaction that attempts to check the waiting list will be rolled back. The same error can occur if the locking thread must view more than 1000000 locks owned by the transaction on the waiting list.

Each request maintains its own lock queue. In this case, the concurrency of the business is 200, because a single record X lock can only be executed serially, and its own lock queue is maintained in sequence. The length of lock queue blocked by limit case records is (1x 199) * 200ma 2! So this stage takes a long time.

It would be easy to know why it takes a long time. Because the business scenario is a single id allocation, there is only one record, and there is no deadlock logically, you can turn off deadlock detection. Fortunately, version 5. 7 innodb_deadlock_detect can turn off deadlock detection. After the shutdown, we tested 200 more concurrently, down from 10s to 0.2s, with a 50-fold improvement in performance.

At the end of the analysis, I believe you must have a deeper understanding of this fault case. In the previous introduction, in order not to interrupt the continuity of fault analysis, we skipped the introduction of some database concepts. Let's select a few to introduce them in detail.

"deadlock" can be understood as a blocking phenomenon caused by competition for resources or communication between two or more threads during execution, and they will not be able to push forward without external force. At this point, it is said that the system is in a deadlock state or the system has a deadlock, and these processes that are always waiting for each other are called deadlock processes.

In the database, we can understand it vividly as:

As shown in the figure above, transaction An is waiting for transaction B to release id=2 's lock, and transaction B is waiting for transaction A to release id=1 's lock.

This situation is a deadlock, which can be solved in two ways:

1) enter and wait directly until the timeout. This timeout can be set by the parameter innodb_lock_wait_timeout

2) initiate deadlock detection, after the deadlock is found, actively roll back a transaction in the deadlock chain so that other transactions can be executed. Setting the parameter innodb_deadlock_detect to on means that this logic is turned on.

Innodb_deadlock_detect=on, which uses the option that disables MySQL's deadlock detection feature. On highly concurrent systems, deadlock detection may slow down when many threads wait for the same lock. When a deadlock occurs, it may be more efficient if deadlock detection is disabled, so that you can rely on the settings of innodb_lock_wait_timeout for transaction rollback.

MySQL turns deadlock detection on by default, and InnoDB automatically detects the transaction that sent the deadlock and rolls back one or all of the transactions that caused the deadlock. InnoDB selects a less weighted transaction to roll back among the transactions that caused the deadlock, and this weight value may be determined by the number of rows of the transaction insert, updated, and deleted.

If innodb_table_locks = 1 (the default) and autocommit = 0, InnoDB is aware of the table lock, and the upper MySQL layer knows the row-level lock. Otherwise, InnoDB cannot detect table locks set by MySQL LOCK TABLES statements or deadlocks set by storage engines other than InnoDB. These situations are resolved by setting the value of the innodb_lock_wait_timeout system variable.

Part Ⅱ 's effective solution to emergency downgrade

If e-commerce business encounters similar hot update failures in high-concurrency scenarios such as promotion and instant kill scenarios, online education business registration and check-in, and game business launch, I believe you will not have too much time to rationally sort out and explore the root causes of the problem, and it will be more difficult to make the most reasonable optimization plan in a relatively short period of time. At this time, the requirement of the user or business side to the database must be to let the business run (recover) first, no matter what method is used.

Then, for the failures of hot spot updates, DBA's common contingency plans: restart, switching, kill (whether using pt-kill or its own kill script, are obviously very difficult to solve, and will aggravate congestion), permission control (most likely accidentally injure some normal core business logic, resulting in business failure), most likely can not complete business recovery. Even if it is damaging, it is difficult to achieve it without relying on business-side intervention.

DBbrain, Tencent Cloud Database Intelligent Manager, provides two major features, "SQL current limit" and "hotspot data protection", to prevent user databases from being hung up by heavy pressure during hotspot updates, to help users achieve effective degradation and protection on the database side, and ensure the normal operation of users' core business.

1. SQL current limit

DBbrain provides "SQL current limit" function, which can help users achieve elegant temporary downgrade on the database side. By rejecting SQL before entering the database kernel, it can solve more scenarios in which kill cannot quickly recover from high concurrency failures. In addition to the "deadlock detection blocking scenarios caused by hot spot updates" described above, it is also applicable to:

A certain type of SQL concurrency increases sharply, affecting normal business, such as cache penetration or abnormal calls, resulting in a sudden increase in SQL statements with little concurrency.

Data tilts to SQL, affecting normal business, such as pulling a particularly large data during a big push, resulting in a busy system as a whole.

No index SQL is created, which affects normal business. For example, the amount of newly launched SQL calls is particularly large, and the index is not created, resulting in the overall system busy.

Users can set the properties of the target SQL in the DBbrain console.

SQL type: select, update, delete, insert, replace

Maximum concurrency: SQL whose concurrency exceeds the set threshold at the same time will be rejected

Current limit time: you can set the duration of the rule. It will no longer take effect after timeout.

SQL keyword: keyword matching is unordered. When matching, traverse the keyword to see if there is this keyword in SQL. Several keywords are matched several times.

DBbrain automatically rejects requests according to the keywords of SQL samples to ensure the normal operation of business core services, and counts the number of SQL requests rejected during the period when "SQL current limit" is enabled.

two。 Hot spot update protection

For the second kill scenario, DBbrain greatly optimizes the performance of update operations for single rows of data. When automatic hotspot update detection is enabled, the system will automatically detect whether there are single row hotspot updates (subsequent transactions will start waiting for more than 32 row locks on the same data row). If so, a large number of concurrent update will be queued for execution to reduce the concurrency performance degradation caused by a large number of row locks or triggering a large number of deadlock detection.

The "hot spot update protection" function provided by DBbrain supports two modes of automatic end and manual shutdown, and the automatic end time can be flexibly controlled.

3. Hot spot update optimization suggestion

In the above case, versions 5.7.15 and above can improve performance by turning off deadlock detection, and can also alleviate the load pressure on the database caused by a large number of hotspot updates through "SQL current limit" and "hotspot update protection" provided by Tencent Cloud Database Intelligent Manager DBbrain. Then the following chapters will share some enlightening suggestions from the perspective of business implementation.

3.1) implementation based on MySQL

The table structure is as follows:

CREATE TABLE `id_ allocate` (`id` bigint NOT NULL AUTO_INCREMENT,business_tag varchar (20) not null,PRIMARY KEY (`id`), UNIQUE KEY `name` (business_tag)) ENGINE=InnoDB AUTO_INCREMENT=2

3.1.1) similar to the example above, through the mysql last_insert_id (expr) function method:

Request logic:

Connectupdate id_allocate set id=last_insert_id (id+1) where business_tag='test1'; select last_insert_id () exit Note: more than 5.7turn off deadlock detection innodb_deadlock_detect

3.1.2) through the mysql auto_increment field, remove the business_tag field, leave only the id field, and request logic:

Connectinsert into id_allocate value (null); select last_insert_id () exit

Note: the amount of data will continue to increase. It can be deleted at regular low peak or created as a partition table, and historical data can be deleted regularly.

Relying solely on MySQL implementation, the first method is easier to use. For high availability, the common idea is to set a self-increasing step size and starting value in two MySQL instances, such as two databases, set auto-increment-increment=2, and set auto-increment-offset to 1 and 2, respectively. The two DB in business requests are obtained in turn: 1, 3, 5, 7 and 2, 4, 6, 8. This method can avoid the influence of single MySQL failure, but at the same time, the strict monotonous increase of the system also turns into an increasing trend (in the case of a single machine failure, the id may become smaller).

3.2) implementation based on Redis

Using the incr and incrby methods of redis, the qps can be supported higher. Similarly, if you are worried about high availability, you can set two key to be stored on two redis instances, which can be guaranteed by controlling the initial value and the offset of the incrby. The obvious disadvantage here is that redis data cannot be persisted. However, Tencent Cloud redis supports master / slave synchronization, dual-server room disaster recovery and backup features, and can also be used for scenarios with urgent project development and high performance requirements.

3.3) Service-oriented implementation

Table structure:

CREATE TABLE `id_ allocate` (`id` bigint NOT NULL AUTO_INCREMENT,business_tag varchar (20) not null,max_id bigint not null,step int not null,PRIMARY KEY (`id`), UNIQUE KEY `name` (business_tag)) ENGINE=InnoDB AUTO_INCREMENT=2

Business_tag identification service

Max_id identifies the maximum id currently allocated

Step identifies the size of the id interval pulled away each time idallocate-server accesses the database.

Implementation idea: a third party acquires id by invoking idallocate-server services. Idallocate-server memory contains at least three values: the current mid, the maximum id1, and the difference between the maximum id2;id2 and id1 by one step. Initially, the idallocate-server service updates twice from the database to get the initial values mid, id1, and id2, respectively:

Beginselect max_id from id_allocate where business_tag='test1' for update; # get midupdate id_allocate set max_id=max_id+step where business_tag='test1';select max_id from id_allocate where business_tag='test1';#, get id1commitbeginupdate id_allocate set max_id=max_id+step where business_tag='test1';select max_id from id_allocate where business_tag='test1';#, get id2commit

As the third party requests idallocate-server to obtain the id,mid, when it reaches 90% of the id1, it is necessary to detect whether the id2 already exists, and if not, access the database to obtain it. If it exists, after the mid reaches the id1 size, the id2 part is allocated. When the mid reaches 90% of the id2, it is necessary to detect whether the id1 exists. In turn, loop to ensure that at least one step-sized buffer segment exists in idallocate-server memory.

In the above scheme:

1. Availability: idallocate-server services can be scaled out to avoid a single point; the MySQL layer can be guaranteed by semi-synchronization or strong consistency synchronization between master and slave clusters, and MySQL failures will not affect the service in a short period of time.

two。 Performance: reduce the request for updating MySQL to 1/step with pure MySQL id allocation (no step id size update once db), reducing the pressure on the database; at the same time, through the design of id2 and id1 double segments, it avoids the delay glitch that when a single id1 is fully allocated, you need to wait for idallocate-server to update the latest data to db in real time.

The above is how to solve the avalanche effect caused by hot spot updates. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.