Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Summary of the experience of broken links in weird druid link pool

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Background symptom troubleshooting and repair background

Recently, we have been doing the work related to the upgrade of the computer room one after another, in cooperation with DBA to adjust the database link mode of the production line, and switch the address of the original direct link to read the database to a unified read load balancer agent haproxy to facilitate the relocation of cabinets and servers.

After switching, discard connection errors occur online from time to time, causing the program to report a 500th error, but not every time.

Development framework: spring boot+mybatis+druid+shardingJDBC

Network architecture:

Appserver- > mysql (master) write

Appserver- > haproxy- > mysql (slave) / n read

The first reaction must be due to the problem caused by the change in the address of the library. I think the problem should be that the connection keeping alive policy in the druid link pool does not work, as long as you make configuration changes. As a result, this problem led us to investigate for several days, and we actually stepped into the pit that we once met in a thousand years.

This problem is very frustrating, vomiting blood again and again, and finally we locate the problem and fix it elegantly. Let's experience the process of hitting you in the face one at a time.

Symptom

Let's start with the symptoms of failure, there are often the following errors:

Discard connection

Com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

The last packet successfully received from the server was 72557 milliseconds ago. The last packet sent successfully to the server was 0 milliseconds ago.

According to the initial judgment of the error log, the link with db must have been broken, and trying to use a broken link will cause this error. But according to what we know about druid, druid has a link check function, so it's not supposed to get an invalid link, so we hit the road with this clue.

Investigation

In order to accurately know the survival time of db links, know that the idle time of haproxy to forwarded db tcp links is less than 1m, and if more than 1m is inactive, it will be turned off. In other words, the original long link between us and db will be broken within 1m. Regardless of whether this time setting is in line with all large concurrency scenarios, at least there will be a valid link check in the link pool of druid, and we will not get invalid links. We have made configuration adjustments.

Let's take a look at the configuration of druid related to link time:

Datasource.druid.validationQuery=SELECT 1datasource.druid.validationQueryTimeout=2000datasource.druid.testWhileIdle=truedatasource.druid.minEvictableIdleTimeMillis=100000datasource.druid.timeBetweenEvictionRunsMillis=20000

The meaning of each item of the configuration is not explained here.

We have enabled the testWhileIdle configuration to initiate a check every time we get a link. According to the configuration of timeBetweenEvictionRunsMillis, druid will only initiate a check when the time is greater than this time, so the possible scenario is to get a link that is about to expire. According to this clue, we adjust this time to 20000ms, that is, after 20s, we will check whether the currently taken link is valid. The way to check is to use the sql statement configured by validationQuery, but we can't find any trace of SELECT 1.

Why can't you find SELECT 1?

First of all, you need to figure out why validationQuery doesn't work, and start debug druid source code with this question.

If (isTestWhileIdle ()) {final long currentTimeMillis = System.currentTimeMillis (); final long lastActiveTimeMillis = poolableConnection.getConnectionHolder (). GetLastActiveTimeMillis (); final long idleMillis = currentTimeMillis-lastActiveTimeMillis; long timeBetweenEvictionRunsMillis = this.getTimeBetweenEvictionRunsMillis (); if (timeBetweenEvictionRunsMillis = timeBetweenEvictionRunsMillis) {boolean validate = testConnectionInternal (poolableConnection.getConnection ()) If (! validate) {if (LOG.isDebugEnabled ()) {LOG.debug ("skip not validate connection.");} discardConnection (realConnection); continue }

The idle time is bound to be greater than the timeBetweenEvictionRunsMillis time, and a testConnectionInternal method check is initiated. Let's continue to follow.

Protected boolean testConnectionInternal (DruidConnectionHolder holder, Connection conn) {boolean valid = validConnectionChecker.isValidConnection (conn, validationQuery, validationQueryTimeout)

Internally, the validConnectionChecker check object is used to initiate the check.

Public boolean isValidConnection (Connection conn, String validateQuery, int validationQueryTimeout) throws Exception {if (conn.isClosed ()) {return false;} if (usePingMethod) {if (conn instanceof DruidPooledConnection) {conn = ((DruidPooledConnection) conn). GetConnection ();} if (conn instanceof ConnectionProxy) {conn = ((ConnectionProxy) conn). GetRawObject () } if (clazz.isAssignableFrom (conn.getClass () {if (validationQueryTimeout)

< 0) { validationQueryTimeout = DEFAULT_VALIDATION_QUERY_TIMEOUT; } try { ping.invoke(conn, true, validationQueryTimeout * 1000); } catch (InvocationTargetException e) { Throwable cause = e.getCause(); if (cause instanceof SQLException) { throw (SQLException) cause; } throw e; } return true; } } String query = validateQuery; if (validateQuery == null || validateQuery.isEmpty()) { query = DEFAULT_VALIDATION_QUERY; } Statement stmt = null; ResultSet rs = null; try { stmt = conn.createStatement(); if (validationQueryTimeout >

0) {stmt.setQueryTimeout (validationQueryTimeout);} rs = stmt.executeQuery (query); return true;} finally {JdbcUtils.close (rs); JdbcUtils.close (stmt);}}

Debug just found that druid uses mysql.ping to check the validity of links by default.

Druid uses msyql.ping protocol check by default

Whether or not to use the msyql.ping protocol will not cause mysql to re-slide the session idle time. With this question, open the list of _ _ information_schema.processlist__ processes to see if the session time will be refreshed. It is found that the session time will be refreshed through debug, indicating that there is no problem, and this clue is broken.

Haproxy timeout active close upstream and downstream links

After adjusting the direction, we began to wonder whether some strategies of haproxy caused the link to become invalid, and began to doubt whether there is a relevant session persistence mode for haproxy's rotation and forwarding of back-end links, and whether our misconfiguration led to the usurpation of haproxy links and mysql links.

Of course, this conjecture is a bit exaggerated, but if there is no way, technicians have to be skeptical of everything.

In order to restore the network route of the production line, I set up a haproxy locally to understand its working principle and configuration. For convenience, I installed one with yum. The version is HA-Proxy version 1.5.18. I don't know if it is my local environment problem or this version of bug. The mode tcp activity check we configured has not been valid.

Listen service 127.0.0.1:60020 mode tcp balance roundrobin option tcplog server server1 192.168.36.66:3306 check inter 2000 rise 2 fall 3 server server2 192.168.36.66:3306 check inter 2000 rise 2 fall 3

As the haproxy activity check has failed, so I can not forward my link, for a long time I can only manually install a low version of haproxy HA-Proxy version 1.4.14.

Complete configuration:

Defaults mode tcp retries 3 option redispatch option abortonclose maxconn 32000 timeout connect 2s timeout client 5m timeout server 5m listen test1 bind 0.0.0.0:60000 mode tcp balance roundrobin server s1 192.168.36.66:3306 weight 1 maxconn 10000 check inter 10s Server s2 192.168.36.66:3306 weight 1 maxconn 10000 check inter 10s server s3 192.168.36.66:3306 weight 1 maxconn 10000 check inter 10s

Version 1.4 successfully completed the activity check.

I use haproxy for debug, and there is no problem with debugging. I also looked at how haproxy forwards links and maintains the relationship between two links internally through conversations. If it is a tcp long link, there should be no problem. Haproxy has session persistence mode in http mode, and tcp should be directly bundled. Once the time comes to timeout, it will take the initiative to link close and mysql, and there is no usurpation problem. At this point, the clue is cut off again.

Custom ValidConnectionChecker burial log

There is no way, we can only try to bury the check log of druid to check whether the time difference between the last check and the error report is consistent with connectionId.

Public class MySqlValidConnectionCheckerDebug extends MySqlValidConnectionChecker {@ Override public boolean isValidConnection (Connection conn, String validateQuery, int validationQueryTimeout) {Long connId = 0L; try {Field connField = ConnectionImpl.class.getDeclaredField ("connectionId"); connField.setAccessible (true); connId = (Long) connField.get (ConnectionProxyImpl) conn) .getConnectionRaw ()) } catch (Exception e) {log.error ("valid connection error", e);} finally {log.info ("valid connection ok. Conn: "+ connId);} return true;}

In order to get the connectionId, it can only be obtained by reflection. There is no problem under the local debug, and the connectionId can be obtained normally. However, an error was reported when it was sent to the verification environment for verification. I found it strange that the configuration and verification of the original development environment was different from that of production, and there was no separation of read and write in the development environment.

Both verification and production use mysql's replication mechanism, which causes me to reflect the acquired code and report an error.

Datasource.druid.url=jdbc:mysql:replication

Through debug found that the original druid connection is JDBC4Connection, has become ReplicationConnection, and inside the packaging of two connection, a masterconnection, a slaveconnection, seems to be a bit of a problem.

Through debug, it is found that the check of druid will still go normally. When you go inside the ReplicationConnection, ReplicationConnection has a currentConnection. This link will switch between masterConnection and slaveConnection, based on the readOnly parameter.

Since druid is not aware of the upper parameters during the check, readOnly will not be set. So it's masterConnection, but when you use it in the program, you spread readOnly to ShardingJDBC through spring's TransactionManager, and ShardingJDBC is set to ReplicationConnection, which leads to the actual use of slaveConnection.

After finding this problem, I went to druid github Issues and sure enough, someone mentioned this problem, and it has been fixed in the higher version of druid.

Repair

There are two ways to fix this problem. The first one is to recommend upgrading druid, which already has a MySqlReplicationValidConnectionChecker checker dedicated to solving this problem. The second way is to implement the ValidConnectionChecker inspector yourself, but there is a possibility that bug will emerge in the future.

Due to the time constraint, the article only talked about the main investigation routes, in fact, it took us more than a week, plus we spent more than ten hours on weekends to find this fundamental problem.

The main reason why this problem is difficult to locate is that there are too many things involved, such as framework level, web link level, mysql server level, haproxy agent and so on.

Here are some technical gains from this entire investigation process.

Related technical problems

1.mysqlConenction provides a ping method for activity checking, and the default MySqlValidConnectionChecker uses pinginternal.

Ping = clazz.getMethod ("pingInternal", boolean.class, int.class)

two。 Lower versions of druid do not support custom ValidConnectionChecker for personalized checks.

The test method of 3.druid uses considerations, and testOnBorrow checks when it gets the link, which is mutually exclusive with testWhileIdle.

If (isTestOnBorrow ()) {} else {if (isTestWhileIdle ()) {

There is a delay in the 3.kill mysql processlist process session to the link side tcp state, which is the four disconnection delay of tcp.

4.haproxy version 1.5.18 mode tcp check is not performed and the health check setting is invalid.

You need to pay attention to the 5.mysql replication connection master/slave switching logic, whether bug will appear when used in combination with upstream and downstream link pools, especially sub-libraries, read-write separation, and custom shards.

6. When troubleshooting the mysql server, open various logs, operation logs, and binlog logs.

The 7.springtransactionmanagenent transaction propagation feature will affect the selection of downstream data sources, setreadonly, setautocommit.

8. An earlier version of druid MySqlValidConnectionChecker never executes the ReplicationConnection ping method.

Author: Wang Qingpei (Senior architect of Hujiang Network)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report