How to restore Topology by MySQL High availability tool Orchestrator 07/11 Update SLTechnology News&Howtos

How to restore Topology by MySQL High availability tool Orchestrator

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article is to share with you about the MySQL high availability tool Orchestrator for topology recovery, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article, without saying much, let's take a look at it.

Preface

The editor will talk about the topology recovery of orchestrator.

Topology recovery

Orch can recover from a series of failure scenarios. In particular, it can recover the failure scenarios of the main library or the intermediate main library.

Automatic and manual

Orch supports:

Automatic recovery (take action against unexpected failure).

Graceful, planned landlord slave switch.

Manual recovery.

Manually, force failover.

Request

To run any type of failover, the topology must support any of the following:

Oracle GTID (master_auto_position=1)

MariaDB GTID

Pseudo GTID (pseudo GTID)

Binlog Servers

What is recovery?

Recovery is based on fault detection and consists of a series of events:

Hooks before recovery (hook: external execution process or script).

Repair the topology.

Hooks after recovery.

Note:

The hooks before recovery is configured by the user.

-sequential execution.

-the failure of any hook (non-zero exit code) will abort the failover.

Topology fixes are managed by orch and are state-based, not configuration-based. Orch tries to do its best, taking into account the existing topology, version, server configuration, and other factors.

The restored hooks is also configured by the user.

Restore scenario 1: the middle main library hangs up

A simple recovery case is DeadIntermediateMaster. Its replicas is isolated, but when GTID or Pseudo GTID is used, replicas can still be reconnected to the topology. We may choose to do this:

Find the sibling of the failed intermediate primary server, and then move the orphaned copy below the sibling.

Promote a copy from an orphaned copy to make it an intermediate primary library of the same level, and then connect the copy to the topology.

Reset all orphaned copies.

Combine the above part of the practice.

The actual implementation depends largely on the topology settings (which instances have log-slave-updates set, whether there is latency, whether there is replication filtering, the version of mysql, and so on). Your topology most likely supports at least one of the above (in particular, matching replicas is a simple solution unless replication filtering is used).

Restore scenario 2: the main library hangs up

Recovering from a dead main library is a more complex operation for a number of reasons:

If there is a potential outage (power outage, network), recovery should be as fast as possible.

Some servers may be lost during the recovery process. Orch needs to determine which one it will be.

The state of the topology may be that the user wants to prevent the restore.

Master service discovery must be done: the application must be able to communicate with the new master library (potentially told that the master library has changed).

You need to find the most suitable replica and promote it to the main library.

A naive way is to choose the latest copy, but this is not always the right choice.

-the latest copy does not necessarily need to be configured as the main library for other replica (e.g. binlog format, mysql version, replication filter, etc.). Blindly promoting the latest copy as the primary library may lose the ability of replica redundancy.

-orch will try to promote the copy that retains the maximum service capacity as the primary library.

Promote the copy and take over its siblings.

Keep its siblings up to date (up to date).

Maybe you need to do a two-phase promotion; the user may have marked the specific server to be promoted (see the register-candidate command).

Call hooks.

To a large extent, master service discovery needs to be implemented by users. Common solutions are:

Based on the discovery of DNS; orch needs to call hook that can modify the DNS entry.

ZooKeeper/Consul KV/etcd/ other key-based discoveries; orch has built-in support for Consul KV, otherwise the external hook must update the KMurv storage system.

Based on the discovery of proxy; orch will call an external hook to update the configuration of proxy, or update the Consul/Zk/etcd mentioned above, which itself will trigger the configuration update of proxy.

Other ways.

Orch tries to be a general solution, so there is no limit to the user's method of service discovery.

Automatic recovery

Optional. Automatic recovery may be applied to all (*) clusters or to specific clusters.

The recovery occurs after detection and assumes that the recovery is not blocked (see below).

For a better solution, apply different configurations to the primary recovery and the intermediate master recovery. The following is a detailed classification of the configurations related to recovery.

The analysis mechanism runs all the time and periodically checks for failure / recovery. It automatically recovers the following:

An actionable scenario (which is not consistent with only one main library).

An instance that is not in downtime.

For instances that belong to a cluster, the cluster is explicitly configured to enable recovery.

For instances in a cluster that have not been restored recently, unless these recent restores are confirmed.

Global recovery is enabled.

Elegant promotion of the main library

Use this to replace the main library in a planned and orderly manner.

Usually, for upgrade, mainframe maintenance, etc., the main library will be replaced with another one. This is the elegant main library of ascension.

In an elegant takeover:

Designate a server to ascend.

Orch sets master to read-only.

Orch ensures that the specified server catches up with replication.

Orch promotes the specified server to the new main library.

Orch sets the promoted server to writable.

This operation takes a few seconds, during which time the main library seen by the application is read-only.

In addition to the standard hooks,orch provides a special hooks to run graceful takeover:

PreGracefulTakeoverProcesses

PostGracefulTakeoverProcesses

For example, you may want to disable the pager during a planned failover. The advanced usage is to stagnate traffic at the proxy layer.

In the elegant ascension main library, any of the following must be met:

Specifies the server to promote (must be a direct replica of the master).

Set the topology so that there is only one direct replica under master (in this case, the identity of the specified copy is not important and does not need to be mentioned).

Call graceful takeover in the following ways:

Command line: orchestrator-client-c graceful-master-takeover-alias mycluster-s designated.master.to.promote:3306

Web api:

-/ api/graceful-master-takeover/:clusterHint/:designatedHost/:designatedPort

Gracefully promote the new master library (planned failover) and specify the server to be promoted.

-/ api/graceful-master-takeover/:clusterHint

Gracefully upgrade the new master library (planned failover). No server specified, which works when there is only one direct copy of master.

Web interface:

-drag the direct copy of master to the left half of the master box.

Manual recovery

When the instance is identified as fail but automatic recovery is disabled or blocked, manual recovery is used.

You can have orch recover by providing a specific instance of a failure. The instance must be recognized as failure. You can request a restore for an instance in downtime (because this is a manual restore and can overwrite the automatic configuration). Restore in the following ways:

Command line: orchestrator-client-c recover-I dead.instance.com:3306-- debug

Web api:/api/recover/dead.instance.com/:3306

Web interface: the instance turns black; click the recovery button.

Manual recovery is not affected by the parameter RecoveryPeriodBlockSeconds, nor by the parameters RecoverMasterClusterFilters and RecoverIntermediateMasterClusterFilters. As a result, users can always recover as needed. When a database instance already has a recovery running, the recovery of the instance at the same time may be blocked.

Manual, forced failover

Forced failover ignores orch's own ideas.

Maybe orch doesn't think that an instance is fail, or your application logic requires that master must be change at this time, or maybe orch is not sure about the type of fail. If you want to fail over now, you can do this:

Command line: orchestrator-client-c force-master-failover-- alias mycluster

Or orchestrator-client-c force-master-failover-I instance.in.that.cluster

Web api:/api/force-master-failover/mycluster

Or / api/force-master-failover/instance.in.that.cluster/3306

Web,api, command line

Audit the recovery in the following ways:

/ web/audit-recovery

/ api/audit-recovery

/ api/audit-recovery-steps/:uid

Audit and control are conducted in the following ways:

/ api/blocked-recoveries: blocked recovery.

/ api/ack-recovery/cluster/:clusterHint: confirm the recovery on a given cluster.

/ api/ack-all-recoveries: confirm all restores.

/ api/disable-global-recoveries: global switch to disable orch to run any restore.

/ api/enable-global-recoveries: re-enable restore.

/ api/check-global-recoveries: check if global recovery is enabled.

Run a manual restore:

/ api/recover/:host/:port: restores the specified host, assuming that orch agrees that a failure has occurred.

/ api/recover-lite/:host/:port: same as above, no external hooks is used (useful for testing).

/ api/graceful-master-takeover/:clusterHint/:designatedHost/:designatedPort: gracefully promote a new master (planned failover), specifying the server to be promoted.

/ api/graceful-master-takeover/:clusterHint: gracefully promote a new master (planned failover). No server specified, which works when there is only one direct copy of master.

/ api/force-master-failover/:clusterHint: force a given cluster to fail over in case of emergency.

Some corresponding command line calls:

Orchestrator-client-c recover-I some.instance:3306

Orchestrator-client-c graceful-master-takeover-I some.instance.in.somecluster:3306

Orchestrator-client-c graceful-master-takeover-alias somecluster

Orchestrator-client-c force-master-takeover-alias somecluster

Orchestrator-client-c ack-cluster-recoveries-alias somecluster

Orchestrator-client-c ack-all-recoveries

Orchestrator-client-c disable-global-recoveries

Orchestrator-client-c enable-global-recoveries

Orchestrator-client-c check-global-recoveries

Blockage, confirmation, concussion prevention

Orch avoids concussion by introducing blocking periods (cascading failures lead to continuous interruptions and resource consumption). On any given cluster, orch does not enable automatic recovery at intervals less than the blocking period unless the user explicitly allows it.

The blocking period is represented by the parameter RecoveryPeriodBlockSeconds. It is only used for recovery on the same cluster. Parallel recovery on different clusters is unaffected.

Once the recovery in the pending state has exceeded the RecoveryPeriodBlockSeconds time or has been confirmed (acknowledged), the blocking is removed.

You can confirm the restore through the Web API / interface (view audit/recovery page) or through the command line interface (orchestrator-client-c ack-cluster-recoveries-alias somealias).

Note that manual restores, such as orchestrator-client-c recover or orchstrator-client-c force-master-failover, ignore blocking periods.

Add promotion rule

In the event of a failover, some servers are more suitable to be promoted to the primary library, while some servers are not suitable to be promoted to the primary library. For example:

The hardware configuration of a server is poor. Prefer not to promote it as the main library.

A server is located in a remote data center and does not want to be promoted to the primary library.

A server is used as a backup source and LVM snapshots are always open. Do not want to promote it to the main library.

A server is well configured and very suitable for use as a candidate. Prefer to promote it as the main library.

The configuration of a server is general and there is no special preference.

You can set preferences in the following ways:

Orchestrator-c register-candidate-I ${:: fqdn}-- promotion-rule ${promotion_rule} promotion rules are:

Prefer

Neutral

Prefer_not

Must_not

The promotion rule is valid for 1 hour by default (parameter: CandidateInstanceExpireMinutes). This is in line with the dynamic characteristics of orch. You can specify promotion rules by setting cron job:

* / 2 * root "/ usr/bin/perl-le 'sleep rand 10' & & / usr/bin/orchestrator-client-c register-candidate-I this.hostname.com-- promotion-rule prefer" this setting comes from the production environment. This cron will be updated through puppet to represent the appropriate promotion_rule. A server may be perfer at some point, but it becomes prefer_not after 5 minutes. Integrate your own service discovery methods and scripts to provide the latest promotion_rule.

Downtime (Downtime)

All failures / recoveries have been analyzed. However, the downtime state of the instance should also be considered. An instance can be shut down through orchestrator-client-c begin-downtime. Automatic recovery skips the server that is down.

In fact, downtime is created specifically for this purpose, enabling DBA to prevent automatic failover to specific servers.

Note that manual recovery (for example, orchestrator-client-c recover) overrides downtime.

Recovery hooks

Orch supports external scripts that hooks-- calls during the restore process. These are arrays of commands called through shell, especially bash.

OnFailureDetectionProcesses: executed when a failover phenomenon is detected (before deciding whether or not to fail over).

Executed when PreGracefulTakeoverProcesses:graceful master takeover, immediately before master becomes read-only.

PreFailoverProcesses: execute immediately before orch performs a restore operation. Any failure in this process (non-zero exit code) terminates the recovery. Tip: this gives the opportunity to abort recovery based on some internal state of the system.

PostMasterFailoverProcesses: executes when the master restore ends successfully.

PostIntermediateMasterFailoverProcesses: executed when the intermediate master restore ends successfully.

PostFailoverProcesses: executed at the end of any successful restore (including and supplementing to PostMasterFailoverProcesses, PostIntermediateMasterFailoverProcesses).

PostUnsuccessfulFailoverProcesses: executed at the end of any unsuccessful restore.

PostGracefulTakeoverProcesses: executed when the master library is switched in a planned and elegant manner, after the old master library is located in the new master library.

This is how the MySQL high availability tool Orchestrator performs topology recovery. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.