How does Orchestrator, a highly available tool for MySQL, probe mechanism? 04/15 Update SLTechnology News&Howtos

How does Orchestrator, a highly available tool for MySQL, probe mechanism?

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to detect the MySQL high-availability tool Orchestrator. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Fault detection

Orch uses a holistic approach to detect whether the main library and the intermediate main library are normal.

A more naive method, for example, if the monitoring tool detects that the main library cannot be connected or queried, it will issue an alarm. This method is vulnerable to network failure and cause false positives. In order to reduce false positives, it will be run n times at intervals of t. In some cases, this reduces the chance of false positives, but increases the response time for real failures.

Orchestrator takes advantage of the replication topology. Orch not only monitors the master library, but also detects the slave library. For example, to diagnose a failure of the primary library, orch must meet the following two conditions:

The main library could not be reached.

You can contact the slave library corresponding to the master library, and these slave libraries are not connected to the master library.

Orch does not classify errors by time, but by the replication topology server (also known as multiple observers) itself. In fact, when all the slave libraries are not connected to the master library, the replication topology is actually corrupted and there is a reason to fail over.

This overall fault detection method of orch is very reliable in the production environment.

Detection mechanism

Orch pulls the instance status on the monitored instance every InstancePollSeconds (default is 5s), and stores the status information in the orchestrator.database_ instance table of the orch Metabase, and then orch acquires the status of each instance from the Metabase every InstancePollSeconds second and displays it on the web interface.

The statement to pull the status of the instance is as follows:

Show variables like 'maxscale%'show global status like' Uptime'select @ @ global.hostname, ifnull (@ @ global.report_host,''), @ @ global.server_id, @ @ global.version, @ @ global.version_comment, @ @ global.read_only, @ @ global.binlog_format, @ @ global.log_bin, @ @ global.log_slave_updatesshow master statusshow global status like 'rpl_semi_sync_%_status'select @ @ global.gtid_mode, @ @ global.server_uuid @ global.gtid_executed, @ global.gtid_purged, @ @ global.master_info_repository = 'TABLE', @ @ global.binlog_row_imageshow slave statusselect count (*) > 0 and MAX (User_name)! =' 'from mysql.slave_master_infoshow slave hostsselect substring_index (host,':', 1) as slave_hostname from information_schema.processlist where command IN ('Binlog Dump',' Binlog Dump GTID') SELECT SUBSTRING_INDEX (@ @ hostname,'.', 1)

After pulling the status of the instance, store the status value in the Metabase of orch with the following statement:

Note: the value after values is the instance status value pulled above. INSERT INTO database_instance (hostname, port, last_checked, last_attempted_check, last_check_partial_success, uptime, server_id, server_uuid, version, major_version, version_comment, binlog_server, read_only, binlog_format, binlog_row_image, log_bin, log_slave_updates, binary_log_file, binary_log_pos, master_host, master_port, slave_sql_running, slave_io_running, replication_sql_thread_state, replication_io_thread_state Has_replication_filters, supports_oracle_gtid, oracle_gtid, master_uuid, ancestry_uuid, executed_gtid_set, gtid_mode, gtid_purged, gtid_errant, mariadb_gtid, pseudo_gtid, master_log_file, read_master_log_pos, relay_master_log_file, exec_master_log_pos, relay_log_file, relay_log_pos, last_sql_error, last_io_error, seconds_behind_master, slave_lag_seconds, sql_delay, num_slave_hosts, slave_hosts Cluster_name, suggested_cluster_alias, data_center, region, physical_environment, replication_depth, is_co_master, replication_credentials_available, has_replication_credentials, allow_tls, semi_sync_enforced, semi_sync_master_enabled, semi_sync_replica_enabled, instance_alias, last_discovery_latency, last_seen) VALUES ('10.10.30.5, 3306, NOW (), NOW (), 1, 322504, 1521 'e2685a0f-d8f8-11e9-a2c9-002590e95c3cages, '5.7.22 e2685a0f-d8f8 logbooks,' 5.7 colors, 'MySQL Community Server (GPL)', 0,1, 'ROW',' FULL', 1,1, 'mysql-bin.000016', 129186924,' 10.10.30.6, 3306, 1,1,1, 1,0,1,1, '6bf30525-d8f8-11e9-808c 0c47a74fca8colors,' 6bf30525-d8f8-11e9-808c-0cc47a74fca8 E2685a0f-d8f8-11e9-a2c9-002590e95c3ccargo, '6bf30525-d8f8-11e9-808c-0cc47a74fca8:1-1554568,\ ne2685a0f-d8f8-11e9-a2c9-002590e95c3c:1-632541,' ON',',', 0,0, 'mysql-bin.000017', 150703414,' mysql-bin.000017', 150703414, 'mysql-relay-bin.000052', 137056344,',', 0,0,1 '[{\ "Hostname\":\ "10.10.30.6\",\ "Port\": 3306}]', '10.10.30.6 qhp-6',', 1, 1, 1, 0, 0, 0, 0,', 8083748, NOW () ON DUPLICATE KEY UPDATE hostname=VALUES (hostname), port=VALUES (port), last_checked=VALUES (last_checked) Last_attempted_check=VALUES (last_attempted_check), last_check_partial_success=VALUES (last_check_partial_success), uptime=VALUES (uptime), server_id=VALUES (server_id), server_uuid=VALUES (server_uuid), version=VALUES (version), major_version=VALUES (major_version), version_comment=VALUES (version_comment), binlog_server=VALUES (binlog_server), read_only=VALUES (read_only), binlog_format=VALUES (binlog_format), binlog_row_image=VALUES (binlog_row_image), log_bin=VALUES (log_bin) Log_slave_updates=VALUES (log_slave_updates), binary_log_file=VALUES (binary_log_file), binary_log_pos=VALUES (binary_log_pos), master_host=VALUES (master_host), master_port=VALUES (master_port), slave_sql_running=VALUES (slave_sql_running), slave_io_running=VALUES (slave_io_running), replication_sql_thread_state=VALUES (replication_sql_thread_state), replication_io_thread_state=VALUES (replication_io_thread_state), has_replication_filters=VALUES (has_replication_filters) Supports_oracle_gtid=VALUES (supports_oracle_gtid), oracle_gtid=VALUES (oracle_gtid), master_uuid=VALUES (master_uuid), ancestry_uuid=VALUES (ancestry_uuid), executed_gtid_set=VALUES (executed_gtid_set), gtid_mode=VALUES (gtid_mode), gtid_purged=VALUES (gtid_purged), gtid_errant=VALUES (gtid_errant), mariadb_gtid=VALUES (mariadb_gtid), pseudo_gtid=VALUES (pseudo_gtid), master_log_file=VALUES (master_log_file), read_master_log_pos=VALUES (read_master_log_pos) Relay_master_log_file=VALUES (relay_master_log_file), exec_master_log_pos=VALUES (exec_master_log_pos), relay_log_file=VALUES (relay_log_file), relay_log_pos=VALUES (relay_log_pos), last_sql_error=VALUES (last_sql_error), last_io_error=VALUES (last_io_error), seconds_behind_master=VALUES (seconds_behind_master), slave_lag_seconds=VALUES (slave_lag_seconds), sql_delay=VALUES (sql_delay) Num_slave_hosts=VALUES (num_slave_hosts), slave_hosts=VALUES (slave_hosts), cluster_name=VALUES (cluster_name), suggested_cluster_alias=VALUES (suggested_cluster_alias), data_center=VALUES (data_center), region=VALUES (region), physical_environment=VALUES (physical_environment), replication_depth=VALUES (replication_depth), is_co_master=VALUES (is_co_master), replication_credentials_available=VALUES (replication_credentials_available), has_replication_credentials=VALUES (has_replication_credentials), allow_tls=VALUES (allow_tls) Semi_sync_enforced=VALUES (semi_sync_enforced), semi_sync_master_enabled=VALUES (semi_sync_master_enabled), semi_sync_replica_enabled=VALUES (semi_sync_replica_enabled), instance_alias=VALUES (instance_alias), last_discovery_latency=VALUES (last_discovery_latency), last_seen=VALUES (last_seen)

Then orch acquires the status of each monitored instance from the Metabase every InstancePollSeconds second and displays it on the page through the web side.

Failed to detect the instance

If an instance dies, orch fails to pull the instance status every InstancePollSeconds and cannot get the latest instance status, so it cannot use the above insert to store the instance status in the Metabase Then orch updates the Metabase as follows: / / update the last_checked and last_check_partial_success fields of the database_instance table update database_instance set last_checked = NOW () every InstancePollSeconds Last_check_partial_success = 0 where hostname = '10.10.30.170' and port = 3306max / update the last_attempted_check field of the database_instance table update database_instance set last_attempted_check = NOW () where hostname =' 10.10.30.170' and port = 3306 every InstancePollSeconds+1s interval

Why should last_attempted_check be introduced here? pick the comments in the source code in two places.

/ / UpdateInstanceLastAttemptedCheck updates the last_attempted_check timestamp in the orchestrator backed database// for a given instance.// This is used as a failsafe mechanism in case access to the instance gets hung (it happens), in which case// the entire ReadTopology gets stuck (and no, connection timeout nor driver timeouts don't help. Don't look at me,// the world is a harsh place to live in). / / And so we make sure to note down * before* we even attempt to access the instance And this raises a red flag when we// wish to access the instance again: if last_attempted_check is * newer* than last_checked, that's bad news and means// we have a "hanging" issue.func UpdateInstanceLastAttemptedCheck (instanceKey * InstanceKey) error {writeFunc: = func () error {_, err: = db.ExecOrchestrator (`update database_instance set last_attempted_check = NOW () where hostname =? And port =? `, instanceKey.Hostname, instanceKey.Port,) return log.Errore (err)} return ExecDBWriteFunc (writeFunc)} / / ValidSecondsFromSeenToLastAttemptedCheck returns the maximum allowed elapsed time// between last_attempted_check to last_checked before we consider the instance as invalid.func ValidSecondsFromSeenToLastAttemptedCheck () uint {return config.Config.InstancePollSeconds + 1}

Determine whether the instance is alive or not

Whether the instance monitored by orch is normal can be determined by the following ways:

/ / every InstancePollSeconds interval, instance determines whether an instance is normal select ifnull (last_checked) as follows

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.