Principle Analysis of MHA based on MySQL High availability Architecture 07/06 Update SLTechnology News&Howtos

Principle Analysis of MHA based on MySQL High availability Architecture

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the principle of MySQL high availability architecture MHA analysis, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor with you to understand.

MHA role deployment

MHA services have two roles, MHA Manager (management node) and MHA Node (data node):

MHA Manager: usually deployed on a separate machine or directly on one of the slave (the latter is not recommended) to manage multiple master/slave clusters. Each master/slave cluster is called an application; and has two functions:

(1) master automatic switchover and failover command run

(2) other help scripts run: manually switch master;master/slave status detection

MHA node: runs on each MySQL server (master/slave/manager), which speeds up failover by monitoring scripts that have the ability to parse and clean up logs. Its functions are:

(1) copy the binlog data of the master node

(2) compare the relay log files of the slave node

(3) there is no need to stop the SQL thread of the slave node and delete the relay log regularly.

At present, MHA mainly supports the architecture of one master and multiple slaves. To build MHA, there must be at least three database servers in a replication cluster. One master and two slaves, that is, one serves as master, one acts as standby master, and the other acts as slave library, because at least three servers are needed. Taobao has also carried out transformation on this basis for the consideration of machine cost. At present, Taobao TMHA already supports one master and one slave.

In fact, we can also use one master and one slave for our own use, but the master host cannot be switched after downtime, and the binlog cannot be completed. After the mysqld process of master crash, you can still switch successfully and complete the binlog.

Official introduction: https://code.google.com/p/mysql-master-ha/

The following figure shows how to manage multiple sets of master-slave replications through MHA Manager. You can summarize how MHA works as follows:

(1) Save binary log events (binlog events) from crashed master

(2) identify the slave with the latest updates

(3) Relay logs (relay log) that apply differences to other slave

(4) apply binary log events saved from master (binlog events)

(5) upgrade a slave to a new master

(6) make other slave connect to the new master for replication

MHA component

(1) Manager tool:

-masterha_check_ssh: check the SSH configuration of MHA.

-masterha_check_repl: check MySQL replication.

-masterha_manager: start MHA.

-masterha_check_status: check the current running status of MHA.

-masterha_master_monitor: monitor whether master is down.

-masterha_master_switch: controls failover (automatic or manual).

-masterha_conf_host: add or remove configured server information.

(2), Node tools (these tools are usually triggered by MHAManager scripts and do not require manual operation).

-save_binary_logs: save and copy the binary log of master.

-apply_diff_relay_logs: identify different relay log events and apply them to other slave.

-filter_mysqlbinlog: remove unnecessary ROLLBACK events (MHA no longer uses this tool).

-purge_relay_logs: clears the relay log (does not block SQL threads).

(3) Custom extension:

-secondary_check_script: check the availability of master through multiple network routes

-master_ip_failover_script: update the masterip; used by application (need to be modified)

-shutdown_script: force shutdown of the master node

-report_script: send a report

-init_conf_load_script: load initial configuration parameters

-master_ip_online_change: update the ip address of master node; (need to modify)

MHA environment preparation

OS:CentOS 6.8

MySQL: 5.7.18

MHA package: MHA 0.57

Role ip address hostname server_id "type"

Master 10.180.2.163 MHA-M1 13306 "write"

S1 10.180.2.164 MHA-S1 23306 "read" (in fact, monitoring can be deployed together, and a group of MHA can have multiple monitoring nodes)

S2 10.180.2.165 MHA-S2 33306 "read", "Monitoring replication group" (monitoring generally cannot be deployed to master nodes to prevent Master downtime and cannot be switched)

Install the MHA Node package

(1) install the perl module (DBD:mysql) required by MHA node on all nodes and download the MHA software package

12yum install perl-DBD-MySQL-y (epel source may be required) https://mega.nz/#F!G4oRjARB!SWzFS59bUv9VrKwdAeIGVw (MHA0.57)

(2) install mha node on all nodes (including Manager nodes):

Tar xf mha4mysql-node-0.57.tar.gzcd mha4mysql-node-0.57perl Makefile.PLmake & & make install

When the installation is complete, the files are as follows:

[root@MHA-S1 bin] # lltotal 48-r-xr-xr-x 1 root root 16381 Aug 7 14:06 apply_diff_relay_logs-r-xr-xr-x 1 root root 4807 Aug 7 14:06 filter_mysqlbinloglrwxrwxrwx 1 root root 26 Aug 8 17:10 mysql-> / usr/local/mysql/bin/mysqllrwxrwxrwx 1 root root 32 Aug 8 17:09 mysqlbinlog- > / usr/local/mysql/bin/mysqlbinlog-r-xr-xr-x 1 root root 8261 Aug 7 14:06 purge_ Relay_logs-rwxr-xr-x 1 root root 314 Aug 8 16:21 purge_relay.sh-r-xr-xr-x 1 root root 7525 Aug 7 14:06 save_binary_ logs [root @ MHA-S1 bin] # pwd/usr/local/bin

Increase the system environment variables:

Echo "export PATH=\ $PATH:/usr/local/bin" > > / etc/profile source ~ / .bash_profile install MHA Manager package tar xf mha4mysql-node-0.57.tar.gzcd mha4mysql-node-0.57perl Makefile.PLmake & & make install

After the installation is complete, the following script files are generated under the / usr/local/bin directory

[root@MHA-S2 bin] # pwd/usr/local/bin [root@MHA-S2 bin] # lltotal 140-r-xr-xr-x 1 root root 16381 Aug 7 14:07 apply_diff_relay_logs-r-xr-xr-x 1 root root 4807 Aug 7 14:07 filter_mysqlbinlog-rwxr-xr-x 1 root root 166 Aug 9 17:18 manager.sh-r-xr-xr-x 1 root root 1995 Aug 7 17:28 masterha_check_repl-r-xr-xr-x 1 root root 1779 Aug 7 17:28 masterha_check_ssh-r-xr-xr-x 1 root root 1865 Aug 7 17:28 masterha_check_status-r-xr-xr-x 1 root root 3201 Aug 7 17:28 masterha_conf_host-r-xr-xr-x 1 root root 2517 Aug 7 17:28 masterha_manager-r-xr-xr-x 1 root root 2165 Aug 7 17:28 masterha_master_monitor-r-xr-xr-x 1 root root 2373 Aug 7 17:28 masterha_master _ switch-r-xr-xr-x 1 root root 5171 Aug 7 17:28 masterha_secondary_check-r-xr-xr-x 1 root root 1739 Aug 7 17:28 masterha_stop-rwxr-xr-x 1 root root 2169 Aug 9 10:49 master_ip_failover-rwxr-xr-x 1 root root 3648 Aug 7 17:30 master_ip_failover.old-rwxr-xr-x 1 root root 10369 Aug 12 21:33 master_ip_online_change-rwxr-xr-x 1 root root 9870 Aug 7 17:30 master_ip_online_change.oldlrwxrwxrwx 1 root root 26 Aug 8 17:10 mysql-> / usr/local/mysql/bin/mysqllrwxrwxrwx 1 root root 32 Aug 8 17:09 mysqlbinlog- > / usr/local/mysql/bin/mysqlbinlog-rw- 1 root root 0 Aug 12 20:04 nohup.out-rwxr-xr-x 1 root root 11867 Aug 7 17:30 power_manager-r-xr-xr-x 1 root root 8261 Aug 7 14:07 purge_relay_ Logs-rwxr-xr-x 1 root root 314 Aug 8 16:20 purge_relay.sh-r-xr-xr-x 1 root root 7525 Aug 7 14:07 save_binary_logs-rwxr-xr-x 1 root root 1360 Aug 7 17:30 send_report

Copy the relevant scripts to the / usr/local/bin directory (after the software package is unzipped, it is not necessary, because these scripts are incomplete and need to be modified by ourselves, which is left to us by the software developer. If you open the parameters corresponding to any of the following scripts, and the corresponding scripts here are not modified, you will make a mistake and you will be badly cheated.)

[root@MHA-S2 scripts] # ll

Total 32

-rwxr-xr-x 1 root root 3443 Jan 8 2012 master_ip_failover # script managed by vip is not required when switching automatically. If we use keepalived, we can write scripts to manage vip ourselves, such as monitoring mysql. If mysql is abnormal, we can stop keepalived, so that vip will drift automatically.

-rwxr-xr-x 1 root root 9186 Jan 8 2012 master_ip_online_change # vip management when online switching is not necessary, you can also write a simple shell to complete it.

-script for shutting down the host after a failure of rwxr-xr-x 1 root root 11867 Jan 8 2012 power_manager # is not required

-rwxr-xr-x 1 root root 1360 Jan 8 2012 send_report # A script that sends an alarm after a failover is not necessary. You can write a simple shell to complete it.

[root@MHA-S2 scripts] # cp * / usr/local/bin/

Configure SSH login without password authentication

Ssh-keygenssh-copy-id root@xxx (please include yourself in XXX, or check-ssh will be miserable later)

Build a master-slave replication environment

Explain in detail the documents built in the previous two-master replication environment

Ensure that both Slave sets are built successfully

Slave_IO_Running: YesSlave_SQL_Running: Yes two slave servers set read_only (read service from the library, so it is not written into the configuration file, because slave can be upgraded to master at any time) root@localhost:mysql3306.sock [(none)] > set global read_only=1

Create a monitoring user (execute on master)

Grant all privileges on *. * to root@'%' identified by '123456 FLU privileges

At this point, the replication is completed, and the MHA is configured later.

MHA environment configuration

(1) create a MHA working directory

Mkdir-p / etc/mha

Modify the app1.cnf configuration file with the following contents:

[root@MHA-S2 ~] # / etc/mha/=/var/log/masterha/app1/=/var/log/masterha/=/data/mysql//=/usr/local/bin/=/usr/local/bin/===/===/usr/local/bin/=/usr/local/bin/masterha_secondary_check-s MHA-S1-s MHA-===MHA-==// is set as a candidate master. If this parameter is set, this slave library will be promoted to master library after master-slave switching occurs. Even if the master database is not the latest slavecheck_repl_delay=// of events in the cluster, by default, if a slave lags behind the relay logs of master 100m, MHA will not select the slave as a new master, because the recovery of this slave will take a long time. Setting check_repl_delay=0,MHA to trigger the switch will ignore replication delay when selecting a new master. This parameter is very useful for hosts with candidate_master=1 set, because the candidate must be the new master=MHA-S1 port==MHA-= during the handover.

(2) set the clearing method of relay log (on each slave node):

1'set global relay_log_purge=0'

Note:

In the process of switching, MHA depends on the relevant information of relay log in the recovery process of the library, so here you need to set the automatic removal of relay log to OFF, which adopts the method of manually clearing relay log. By default, the relay log from the server is automatically deleted after the SQL thread finishes execution. However, in a MHA environment, these relay logs may be used when restoring other slave servers, so you need to disable the automatic deletion of relay logs. Periodically clear the relay log to take into account replication latency. Under the ext3 file system, deleting large files takes a certain amount of time, which can lead to serious replication delays. To avoid replication latency, you need to create a hard link for the relay log temporarily, because it is fast to delete large files through hard links on linux systems. (in mysql databases, hard links are usually used to delete large tables.)

The MHA node includes the pure_relay_logs command tool, which creates a hard link for the relay log, executes SET GLOBAL relay_log_purge=1, and waits a few seconds for the SQL thread to switch to the new relay log before executing SET GLOBAL relay_log_purge=0.

The parameters of the pure_relay_logs script are as follows:

-- user mysql user name-- password mysql password-- port port number-- workdir specifies the location where the hard link of the relay log is created. The default is / var/tmp. Since the creation of hard link files in different partitions of the system will fail, the specific location of the hard link needs to be executed. After successful execution of the script, Hard-linked relay log files are deleted-disable_relay_log_purge by default, if relay_log_purge=1, the script will clean up nothing and exit automatically. By setting this parameter, relay_log_purge will be set to 0 in the case of relay_log_purge=1. After cleaning up the relay log, finally set the parameter to OFF.

(3) set up regular cleaning relay scripts (for example, once a day, all servers)

[root@MHA-S2 bin] # purgehammer relay.Bingray log_dir = [!-$log_dir-- user=$user-- password=$-- disable_relay_log_purge-- port=$port-- workdir=$work_dir > > $log_dir/purge_relay_logs.log > &

Add to crontab [root @ MHA-S2 bin] # crontab-L0 4 * / bin/bash / root/purge_relay_log.sh

You can perform the following manually to see if there is an error.

Check SSH configuration

[root@MHA-S2 bin] # masterha_check_ssh-- conf=/etc/mha/app1.cnf Mon Aug 14 18:07:02 2017-[warning] Global configuration file / etc/masterha_default.cnf not found. Skipping.Mon Aug 14 18:07:02 2017-[info] Reading application default configuration from / etc/mha/app1.cnf..Mon Aug 14 18:07:02 2017-[info] Reading server configuration from / etc/mha/app1.cnf..Mon Aug 14 18:07:02 2017-[info] Starting SSH connection tests..Mon Aug 14 18:07:03 2017-[debug] Mon Aug 14 18:07:02 2017-[debug] Connecting via SSH from root@MHA-M1 (10. Mon Aug 14 18:07:02 2017-[debug] ok.Mon Aug 14 18:07:02 2017-[debug] Connecting via SSH from root@MHA-M1 (10.180.2.165 ok.Mon Aug 22) to root@MHA-S2 (10.180.2.165 ok.Mon Aug 22) .Mon Aug 14 18:07:03 2017-[debug] ok.Mon Aug 14 18: 07:03 2017-[debug] Mon Aug 14 18:07:03 2017-[debug] Connecting via SSH from root@MHA-S1 (10.180.2.164debug 22) to root@MHA-M1 (10.180.2.163debug) .Mon Aug 14 18:07:03 2017-[debug] ok.Mon Aug 14 18:07:03 2017-[debug] Connecting via SSH from root@MHA-S1 (10.180.2.164debug) to root@MHA- S2 (10.180.2.165 Mon Aug 22).. Mon Aug 14 18:07:03 2017-[debug] ok.Mon Aug 14 18:07:04 2017-[debug] Mon Aug 14 18:07:03 2017-[debug] Connecting via SSH from root@MHA-S2 (10.180.2.165 ok.Mon Aug 22) to root@MHA-M1 (10.180.2.163 ok.Mon Aug 22).. Mon Aug 14 18:07:03 2017-[debug] ok.Mon Aug 14 18 : 07:04 2017-[debug] Connecting via SSH from root@MHA-S2 (10.180.2.165 debug 22) to root@MHA-S1 (10.180.2.166 to root@MHA-S1 22).. Mon Aug 14 18:07:04 2017-[debug] ok.Mon Aug 14 18:07:04 2017-[info] All SSH connection tests passed successfully.

Check the state of the entire replication environment

Found that there was a mistake.

Tue Aug 8 17:46:31 2017-[info] Checking master_ip_failover_script status:Tue Aug 8 17:46:31 2017-[info] / usr/local/bin/master_ip_failover-command=status-ssh_user=root-orig_master_host=MHA-M1-orig_master_ip=10.180.2.163-orig_master_port=3306 Bareword "FIXME_xxx" not allowed while "strict subs" in use at / usr/local/bin/master_ip_failover line 93. Execution of / usr/local/bin/master_ip_failover aborted due to compilation errors.Tue Aug 8 17:46:31 2017-[error] [/ usr/local/share/perl5/MHA/MasterMonitor.pm] Ln229] Failed to get master_ip_failover_script status with return code 255:0.Tue Aug 8 17:46:31 2017-[error] [/ usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations. At / usr/local/bin/masterha_check_repl line 48Tue Aug 8 17:46:31 2017-[error] [/ usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.Tue Aug 8 17:46:31 2017-[info] Got exit code 1 (Not master dead).

Originally, there are two ways of Failover: one is the virtual IP address, and the other is the global configuration file. MHA does not limit which method to use, but allows users to choose the way in which virtual IP addresses involve other software, such as keepalive software, and modify the script master_ip_failover. Here, the option master_ip_failover_script= / usr/local/bin/master_ip_failover in app1.cnf is blocked before it can be passed.

# master_ip_failover_script= / usr/local/bin/master_ip_failover Tue Aug 8 17:49:40 2017-[info] Got exit code 0 (Not master dead). MySQL Replication Health is OK.

Check the status of MHA Manager

[root@MHA-S2 mha] # masterha_check_status-conf=/etc/mha/app1.cnf app1 is stopped (2:NOT_RUNNING).

Manual start

[root@MHA-S2 mha] # nohup masterha_manager-conf=/etc/mha/app1.cnf-remove_dead_master_conf-ignore_last_failover

< /dev/null >

/ var/log/masterha/app1/manager.log 2 > & 1 & [1] 16774 [root@MHA-S2 mha] # ps-ef | grep masterharoot 16774 15297 4 17:52 pts/3 00:00:00 perl / usr/local/bin/masterha_manager-conf=/etc/mha/app1.cnf-remove_dead_master_conf-ignore_last_ failover [root @ MHA-S2 mha] # masterha_check_status-conf=/etc/mha/app1.cnf app1 (pid:16774) is running (0:PING_OK) Master:MHA-M1

-- remove_dead_master_conf this parameter means that when a master-slave switch occurs, the ip of the old master library will be removed from the configuration file. (if the old master is repaired after an abnormal switch, and if you want to add a new MHA, you must remember the information of app1.cnf to make up the server1.)

-- location of manger_log logs

-- ignore_last_failover by default, if MHA detects continuous downtime and the interval between two outages is less than 8 hours, Failover will not be performed. This restriction is to avoid ping-pong effect. This parameter means that the file generated by the last MHA trigger switch is ignored. By default, after the MHA switch occurs, the app1.failover.complete file will be generated in the log directory, that is, the / data I set above. The switch will not be triggered if the file is found to exist in the directory next time, unless the file is deleted after the first switch. For convenience, it is set to-- ignore_last_failover here.

Check the startup log

[root@MHA-S2 app1] # vi manager.log Tue Aug 8 17:52:37 2017-[warning] Global configuration file / etc/masterha_default.cnf not found. Skipping.Tue Aug 8 17:52:37 2017-[info] Reading application default configuration from / etc/mha/app1.cnf..Tue Aug 8 17:52:37 2017-[info] Reading server configuration from / etc/mha/app1.cnf..Tue Aug 8 17:52:37 2017-[info] MHA::MasterMonitor version 0.57.Tue Aug 8 17:52:38 2017-[info] GTID failover mode = 1Tue Aug 8 17:52:38 2017-[info] Dead Servers:Tue Aug 8 17:52:38 2017-[info] Alive Servers:Tue Aug 8 17:52:38 2017-[info] MHA-M1 (10.180.2.163 Tue Aug 3306) Tue Aug 8 17:52:38 2017-[info] MHA-S1 (10.180.2.164 MHA-M1 3306) Tue Aug 8 17:52:38 2017-[info] MHA-S2 (10.180.2.165 Tue Aug 3306) Tue Aug 8 17:52:38 2017-[info] Alive Slaves:Tue Aug 8 17:52:38 2017-[info] MHA-S1 (10.180.2.164 log-bin:enabledTue Aug 3306) Version=5.7.18-log (oldest major version between slaves) log-bin:enabledTue Aug 8 17:52:38 2017-[info] GTID ONTue Aug 8 17:52:38 2017-[info] Replicating from MHA-M1 (10.180.2.163 oldest major version between slaves) Tue Aug 8 17:52:38 2017-[info] Primary candidate for the new Master (candidate_) Master is set) Tue Aug 8 17:52:38 2017-[info] MHA-S2 (10.180.2.165 Version=5.7.18-log 3306) Version=5.7.18-log (oldest major version between slaves) log-bin:enabledTue Aug 8 17:52:38 2017-[info] GTID ONTue Aug 8 17:52:38 2017-[info] Replicating from MHA-M1 (10.180.2.163 Vera 3306) Tue Aug 8 17:52:38 2017-[info] Current Alive Master: MHA -M1 (10.180.2.163 Tue Aug 3306) Tue Aug 8 17:52:38 2017-[info] Checking slave configurations..Tue Aug 8 17:52:38 2017-[info] Checking replication filtering settings..Tue Aug 8 17:52:38 2017-[info] binlog_do_db= Binlog_ignore_db=Tue Aug 8 17:52:38 2017-[info] Replication filtering check ok.Tue Aug 8 17:52:38 2017-[info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.Tue Aug 8 17:52:38 2017-[info] Checking SSH publickey authentication settings on the current master..Tue Aug 8 17:52:38 2017-[info] HealthCheck: SSH to MHA-M1 is reachable.Tue Aug 8 17:52:38 2017-[info] MHA-M1 (10.180.2.163 Checking SSH publickey authentication settings on the current master..Tue Aug 3306) (current master) +-- MHA-S1 (10.180.2.164 Checking SSH publickey authentication settings on the current master..Tue Aug 3306) +-- MHA-S2 (10.180.2. 165 Tue Aug 8 17:52:38 2017-[warning] master_ip_failover_script is not defined.Tue Aug 8 17:52:38 2017-[warning] shutdown_script is not defined.Tue Aug 8 17:52:38 2017-[info] Set master ping interval 1 seconds.Tue Aug 8 17:52:38 2017-[info] Set secondary check script: / usr/local/bin/masterha_secondary_check-s MHA-S1-s MHA-S2Tue Aug 8 17:52:38 2017-[info] Starting ping health check on MHA-M1 (10.180.2.163 info) ..Tue Aug 8 17:52:38 2017-[info] Ping (SELECT) succeeded Waiting until MySQL doesn't respond..

Configure VIP

Vip can be configured in two ways, one is to manage the floating of the virtual ip through keepalived, and the other is to start the virtual ip by script (that is, no keepalived or heartbeat-like software is required).

Here only demonstrates how to use scripts to manage VIP, modify master_ip_failover scripts, and use scripts to manage VIP

[root@MHA-M1] # / sbin/ifconfig eth2:1 10.180.2.168 Universe 19

Script:

[root@MHA-S2 bin] # cat MastersMasteripipipipsfailoverflows; usr new_master_ip binv Env perluse strict;use warnings FATAL = > 'all';use Getopt::Long;my ($command, $ssh_user, $orig_master_host, $orig_master_ip, $orig_master_port, $new_master_host, $new_master_ip, $new_master_port); my $vip =' 10.180.2.168 Compact 19 percent new_master_host my $key ='1' My $ssh_start_vip = "/ sbin/ifconfig eth2:$key $vip"; my $ssh_stop_vip = "/ sbin/ifconfig eth2:$key down" GetOptions ('command=s' = >\ $command,' ssh_user=s' = >\ $ssh_user, 'orig_master_host=s' = >\ $orig_master_host,' orig_master_ip=s' = >\ $orig_master_ip, 'orig_master_port=i' = >\ $orig_master_port,' new_master_host=s' = >\ $new_master_host, 'new_master_ip=s' = >\ $new_master_ip 'new_master_port=i' = >\ $new_master_port,) Exit & main (); sub main {print "\ n\ nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\ n\ n"; if ($command eq "stop" | | $command eq "stopssh") {my $exit_code = 1; eval {print "Disabling the VIP on old master: $orig_master_host\ n"; & stop_vip (); $exit_code = 0;} If ($@) {warn "Got Error: $@\ n"; exit $exit_code;} exit $exit_code;} elsif ($command eq "start") {my $exit_code = 10; eval {print "Enabling the VIP-$vip on the new master-$new_master_host\ n"; & start_vip () $exit_code = 0;}; if ($@) {warn $@; exit $exit_code;} exit $exit_code;} elsif ($command eq "status") {print "Checking the Status of the script.. OK\ n "; exit 0;} else {& usage (); exit 1;}} sub start_vip () {'ssh $ssh_user\ @ $new_master_host\" $ssh_start_vip\ "`;} sub stop_vip () {return 0 unless ($ssh_user); `ssh $ssh_user\ @ $orig_master_host\" $ssh_stop_vip\ "` } sub usage {print "Usage: master_ip_failover-command=start | stop | stopssh | status-- orig_master_host=host-- orig_master_ip=ip-- orig_master_port=port-- new_master_host=host-- new_master_ip=ip-- new_master_port=port\ n";}

Uncomment the master_ip_online_failover in the app1.cnf file and test:

Check MHA check [root@MHA-S2 bin] # masterha_check_repl-- conf=/etc/mha/app1.cnf Wed Aug 9 10:49:42 2017-[warning] Global configuration file / etc/masterha_default.cnf not found again. Skipping.Wed Aug 9 10:49:42 2017-[info] Reading application default configuration from / etc/mha/app1.cnf..Wed Aug 9 10:49:42 2017-[info] Reading server configuration from / etc/mha/app1.cnf..Wed Aug 9 10:49:42 2017-[info] MHA::MasterMonitor version 0.57.Wed Aug 9 10:49:43 2017-[info] GTID failover mode = 1Wed Aug 9 10:49:43 2017-[info] Dead Servers:Wed Aug 9 10:49:43 2017-[info] Alive Servers:Wed Aug 9 10:49:43 2017-[info] MHA-M1 (10.180.2.163 Wed Aug 3306) Wed Aug 9 10:49:43 2017-[info] MHA-S1 (10.180.2.164 Wed Aug 3306) Wed Aug 9 10:49:43 2017-[info] MHA-S2 (10.180.2.165 MHA-M1 3306) Wed Aug 9 10:49:43 2017-[info] Alive Slaves:Wed Aug 9 10:49:43 2017-[info] MHA-S1 (10.180.2.164 log-bin:enabledWed Aug 3306) Version=5.7.18-log (oldest major version between slaves) log-bin:enabledWed Aug 9 10:49:43 2017-[info] GTID ONWed Aug 9 10:49:43 2017-[info] Replicating from MHA-M1 (10.180.2.163 Primary candidate for the new Master) Wed Aug 9 10:49:43 2017-[info] Primary candidate for the new Master (candidate_) Master is set) Wed Aug 9 10:49:43 2017-[info] MHA-S2 (10.180.2.165 Version=5.7.18-log 3306) Version=5.7.18-log (oldest major version between slaves) log-bin:enabledWed Aug 9 10:49:43 2017-[info] GTID ONWed Aug 9 10:49:43 2017-[info] Replicating from MHA-M1 (10.180.2.163 Vera 3306) Wed Aug 9 10:49:43 2017-[info] Current Alive Master: MHA -M1 (10.180.2.163 Wed Aug 3306) Wed Aug 9 10:49:43 2017-[info] Checking slave configurations..Wed Aug 9 10:49:43 2017-[info] Checking replication filtering settings..Wed Aug 9 10:49:43 2017-[info] binlog_do_db= Binlog_ignore_db= Wed Aug 9 10:49:43 2017-[info] Replication filtering check ok.Wed Aug 9 10:49:43 2017-[info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.Wed Aug 9 10:49:43 2017-[info] Checking SSH publickey authentication settings on the current master..Wed Aug 9 10:49:43 2017-[info] HealthCheck: SSH to MHA-M1 is reachable.Wed Aug 9 10:49:43 2017-[info] MHA-M1 (10.180.2.163 Checking SSH publickey authentication settings on the current master..Wed Aug 3306) (current master) +-- MHA-S1 (10.180.2.164 Checking SSH publickey authentication settings on the current master..Wed Aug 3306) +-- MHA-S2 (10.180.2. 165 Wed Aug 9 10:49:43 2017-[info] Checking replication health on MHA-S1..Wed Aug 9 10:49:43 2017-[info] ok.Wed Aug 9 10:49:43 2017-[info] Checking replication health on MHA-S2..Wed Aug 9 10:49:43 2017-[info] ok.Wed Aug 9 10:49:43 2017-[info] Checking master_ip_failover_script status:Wed Aug 9 10:49:43 2017-[ Info] / usr/local/bin/master_ip_failover-- command=status-- ssh_user=root-- orig_master_host=MHA-M1-- orig_master_ip=10.180.2.163-- orig_master_port=3306

IN SCRIPT TEST====/sbin/ifconfig eth2:1 down==/sbin/ifconfig eth2:1 10.180.2.168Compact 19 colors =

Checking the Status of the script.. OK

Wed Aug 9 10:49:43 2017-[info] OK.

Wed Aug 9 10:49:43 2017-[warning] shutdown_script is not defined.

Wed Aug 9 10:49:43 2017-[info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

The above is the whole process of MHA installation and configuration, the following is a simple test.

(1) failover test

Manually kill the mysqld process on master to check the switching status

The above is the full log process of switching. We can see that MHA switching mainly goes through the following steps:

1. Profile check phase, which checks the entire cluster profile configuration

two。 Master handling of outages, which includes virtual ip removal and host shutdown (which I haven't implemented here yet and need to be studied)

3. Copy the relay log of the difference between dead maste and the latest slave, and save it to the specific directory of MHA Manger

4. Identify the slave with the latest updates

5. Apply binary log events saved from master (binlog events)

6. Upgrade a slave to a new master for replication

7. Make other slave connect to the new master for replication

Note:

1. After switching, you will find that the MHA Manager monitoring program will die automatically. The official website has the following explanation and solution:

Running MHA Manager from daemontoolsCurrently MHA Manager process does not run as a daemon. If failover completed successfully or the master process was killed by accident, the manager stops working. To run as a daemon, daemontool. Or any external daemon program can be used. Here is an example to run from daemontools.

Here, if we use the shell script to execute it, the monitoring program will not die.

[root@MHA-S2 bin] # more manager.sh #! / bin/shnohup masterha_manager-- conf=/etc/mha/app1.cnf-- remove_dead_master_conf-- ignore_last_failover

< /dev/null >

/ var/log/masterha/app1/manager.log 2 > & 1 &

two。 When you have finished repairing the dead master and want to rejoin the first two-node MHA, it is also possible.

Old Master:

Root@localhost:mysql3306.sock [tt] > show master status\ gateway * 1. Row * * File: 3306-binlog.000004 Position: 194 Binlog_Do_DB: Binlog_Ignore_DB: Executed_Gtid_Set: a5757eae-7981-11e7-82c7 -005056b662d3:1-322101 row in set (0.00 sec)

Existing master:

Root@localhost:mysql3306.sock [tt] > show master status\ 11e7-a280 * 1. Row * * File: 3306-binlog.000003 Position: 61945043 Binlog_Do_DB: Binlog_Ignore_DB: Executed_Gtid_Set: 1c2dc99f-7b57-11e7-a280-005056b665cb:1-3 A5757eae-7981-11e7-82c7-005056b662d3:1-322101 row in set (0.00 sec)

Because of GTID, we can switch directly to change master and compare the data first:

Old master:

Root@localhost:mysql3306.sock [tt] > select * from T1 + | id | C1 | +-+-+ | 1 | A1 | | 2 | a2 | | 3 | A3 | 4 | A4 | +-+-+ 4 rows in set (0.02 sec)

New master:

Root@localhost:mysql3306.sock [tt] > select * from T1 + | id | C1 | +-+-+ | 1 | A1 | | 2 | a2 | | 3 | A3 | 4 | A4 | 5 | A5 | +-+-+

Old master Direct change master to:

Change master to master_host='MHA-S1',master_user='repl',master_password='123456',master_port=3306,master_auto_position=1

Start slave look at the output:

Root@localhost:mysql3306.sock [tt] > show slave status\ G * * 1. Row * * Slave_IO_State: Waiting for master to send event Master_Host: MHA-S1 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: 3306-binlog.000003 Read_Master_Log_Pos: 61945043 Relay_Log_File: MHA-M1-relay-bin.000004 Relay_Log_Pos: 715 Relay_Master_Log _ File: 3306-binlog.000003 Slave_IO_Running: Yes Slave_SQL_Running: Yes

See if the data will be completed:

Root@localhost:mysql3306.sock [tt] > select * from T1 + | id | C1 | +-+-+ | 1 | A1 | | 2 | a2 | | 3 | A3 | 4 | A4 | 5 | A5 | +-+-+

Found that the data has been completed, it is no problem to join the replication.

Finally, we have to modify app1.cnf to add server1.

[server1] hostname=MHA-M1port=3306

Restart the monitor and check the MHA status

[root@MHA-S2 tmp] # masterha_check_repl-- conf=/etc/mha/app1.cnf Sat Aug 12 20:37:01 2017-[info] Replication filtering check ok. Sat Aug 12 20:37:01 2017-[error] [/ usr/local/share/perl5/MHA/Server.pm, ln398] MHA-M1 (10.180.2.163 ln398): User repl does not exist or does not have REPLICATION SLAVE privilege! Other slaves can not start replication from this host. Sat Aug 12 20:37:01 2017-[error] [/ usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations. At / usr/local/share/perl5/MHA/ServerManager.pm line 1403 Sat Aug 12 20:37:01 2017-[error] [/ usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers. Sat Aug 12 20:37:01 2017-[info] Got exit code 1 (Not master dead).

If you find that there is a problem with the permission, fix it as soon as possible:

MHA-M1:

Set session sql_log_bin=OFF;grant replication slave on *. * to repl@'%' identified by '123456 sets session sql_log_bin=ON

Perform the MHA status check again:

Masterha_check_repl-- conf=/etc/mha/app1.cnfSat Aug 12 20:41:14 2017-[info] Checking replication health on MHA-M1..Sat Aug 12 20:41:14 2017-[info] ok.Sat Aug 12 20:41:14 2017-[info] Checking replication health on MHA-S2..Sat Aug 12 20:41:14 2017-[info] ok.Sat Aug 12 20:41:14 2017-[info] Checking master_ip_failover_script status:Sat Aug 12 20:41:14 2017-[info] / usr/local/bin/master_ip_failover-command=status-ssh_user=root-orig_master_host=MHA-S1-orig_master_ip=10.180.2.164-orig_master_port=3306 IN SCRIPT TEST====/sbin/ifconfig eth2:1 down==/sbin/ifconfig eth2:1 10.180.2.168/24===Checking the Status of the script.. OK Sat Aug 12 20:41:15 2017-[info] OK.Sat Aug 12 20:41:15 2017-[warning] shutdown_script is not defined.Sat Aug 12 20:41:15 2017-[info] Got exit code 0 (Not master dead) .MySQL Replication Health is OK.

Finally, start the monitoring program.

[root@MHA-S2 bin] # nohup monitor.sh & [root@MHA-S2 bin] # masterha_check_status-- conf=/etc/mha/app1.cnf app1 (pid:32084) is running (0:PING_OK), master:MHA-S1

(2) Manual online switching test

In many cases, the existing primary server needs to be migrated to another server. For example, the hardware failure of the primary server, the need to rebuild the RAID control card, moving the primary server to a better performance server, and so on. Maintenance of the primary server results in performance degradation, resulting in downtime at least not being able to write data. In addition, blocking or killing a currently running session can lead to data inconsistencies between hosts. MHA provides fast switching and elegant blocking writes, a switching process that takes only 0. 5-2 seconds, during which data cannot be written. In many cases, blocking writes of 0.5-2s are acceptable. Therefore, switching the primary server does not require a scheduled allocation of the maintenance time window.

The general process of online switching of MHA:

1. Detect replication settings and determine the current primary server

two。 Identify the new primary server

3. Block writes to the current primary server

4. Waiting for all to catch up with replication from the server

5. Grant write to the new primary server

6. Reset the slave server

Note that the application architecture needs to consider the following two issues when switching online:

1. The problem of automatically identifying master and slave (master machines may switch) can basically be solved if you use the vip approach.

two。 The problem of load balancing (you can define the approximate read-write ratio, the load ratio that each machine can bear, which needs to be considered when a machine leaves the cluster)

In order to ensure the complete consistency of data and complete the handover in the fastest time, the online handover of MHA must meet the following conditions before the handover is successful, otherwise the handover will fail.

1. All slave IO threads are running

two。 All slave SQL threads are running

3. The Seconds_Behind_Master parameter in the output of all show slave status is less than or equal to running_updates_limit seconds. If running_updates_limit is not specified during the switch, then running_updates_limit is 1 second by default.

4. On the masterside, none of the updates take longer than running_updates_limit seconds through show processlist output.

The online switching steps are as follows:

Stop the monitoring program first

[root@MHA-S2 app1] # masterha_stop-conf=/etc/mha/app1.cnf Stopped app1 successfully.

Modify the master_ip_online_change script as follows:

[root@MHA-S2 bin] # more masteripipwriter on lineline changeholders Env perl# Copyright (C) 2011 DeNA Co.,Ltd.## This program is free software; you can redistribute it and/or modify# it under the terms of the GNU General Public License as published by# the Free Software Foundation; either version 2 of the License, or# (at your option) any later version.## This program is distributed in the hope that it will be useful,# but WITHOUT ANY WARRANTY Without even the implied warranty of# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the# GNU General Public License for more details.## You should have received a copy of the GNU General Public License# along with this program; if not, write to the Free Software# Foundation, Inc.,# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA## Note: This is a sample script and is not complete. Modify the script based on your environment.use strict;use warnings FATAL = > 'all';use Getopt::Long;use MHA::DBHelper;use MHA::NodeUtil;use Time::HiRes qw (sleep gettimeofday tv_interval); use Data::Dumper;my $_ tstart;my $_ running_interval = 0.1 My ($command, $orig_master_host, $orig_master_ip, $orig_master_port, $orig_master_user, $new_master_host, $new_master_ip, $new_master_port, $new_master_user,); my $vip = '10.180.2.168 new_master_user 19mm; # Virtual IP my $key = "1"; my $ssh_start_vip = "/ sbin/ifconfig eth2:$key $vip"; my $ssh_stop_vip = "/ sbin/ifconfig eth2:$key down" My $ssh_user = "root"; my $new_master_password='123456';my $orig_master_password='123456' GetOptions ('command=s' = >\ $command, #' ssh_user=s' = >\ $ssh_user, 'orig_master_host=s' = >\ $orig_master_host,' orig_master_ip=s' = >\ $orig_master_ip, 'orig_master_port=i' = >\ $orig_master_port,' orig_master_user=s' = >\ $orig_master_user, # 'orig_master_password=s' = >\ $orig_master_password 'new_master_host=s' = >\ $new_master_host,' new_master_ip=s' = >\ $new_master_ip, 'new_master_port=i' = >\ $new_master_port,' new_master_user=s' = >\ $new_master_user, # 'new_master_password=s' = >\ $new_master_password,) Exit & main (); sub current_time_us {my ($sec, $microsec) = gettimeofday (); my $curdate = localtime ($sec); return $curdate. "". Sprintf ("d", $microsec);} sub sleep_until {my $elapsed = tv_interval ($_ tstart); if ($_ running_interval > $elapsed) {sleep ($_ running_interval-$elapsed);}} sub get_threads_util {my $dbh = shift; my $my_connection_id = shift; my $running_time_threshold = shift; my $type = shift $running_time_threshold = 0 unless ($running_time_threshold); $type = 0 unless ($type); my @ threads; my $sth = $dbh- > prepare ("SHOW PROCESSLIST"); $sth- > execute (); while (my $ref = $sth- > fetchrow_hashref ()) {my $id = $ref- > {Id}; my $user = $ref- > {User}; my $host = $ref- > {Host} My $command = $ref- > {Command}; my $state = $ref- > {State}; my $query_time = $ref- > {Time}; my $info = $ref- > {Info}; $info = ~ s / ^\ s * (. *?)\ s steps 1 / if defined ($info); next if ($my_connection_id = = $id); next if (defined ($query_time) & $query_time)

< $running_time_threshold ); next if ( defined($command) && $command eq "Binlog Dump" ); next if ( defined($user) && $user eq "system user" ); next if ( defined($command) && $command eq "Sleep" && defined($query_time) && $query_time >

= 1); if ($type > = 1) {next if (defined ($command) & & $command eq "Sleep"); next if (defined ($command) & & $command eq "Connect");} if ($type > = 2) {next if (defined ($info) & & $info = ~ m / ^ select / I); next if (defined ($info) & $info = ~ m / ^ show / I);} push @ threads, $ref;} return @ threads } sub main {if ($command eq "stop") {# Gracefully killing connections on the current master # 1. Set read_only= 1 on the new master # 2. DROP USER so that no app user can establish new connections # 3. Set read_only= 1 on the current master # 4. Kill current queries # * Any database access failure will result in script die. My $exit_code = 1; eval {# Setting read_only=1 on the new master (to avoid accident) my $new_master_handler = new MHA::DBHelper (); # args: hostname, port, user, password, raise_error (die_on_error) _ or_not $new_master_handler- > connect ($new_master_ip, $new_master_port, $new_master_user, $new_master_password, 1); print current_time_us (). "Set read_only on the new master.. "; $new_master_handler- > enable_read_only (); if ($new_master_handler- > is_read_only ()) {print" ok.\ n ";} else {die" Failed!\ n ";} $new_master_handler- > disconnect (); # Connecting to the orig master, die if any database error happens my $orig_master_handler = new MHA::DBHelper () $orig_master_handler- > connect ($orig_master_ip, $orig_master_port, $orig_master_user, $orig_master_password, 1); # # Drop application user so that nobody can connect. Disabling per-session binlog beforehand # $orig_master_handler- > disable_log_bin_local (); # print current_time_us (). "Drpping app user on the orig master..\ n"; # FIXME_xxx_drop_app_user ($orig_master_handler); # # Waiting for N * 100 milliseconds so that current connections can exit my $time_until_read_only = 15; $_ tstart = [gettimeofday]; my @ threads = get_threads_util ($orig_master_handler- > {dbh}, $orig_master_handler- > {connection_id}) While ($time_until_read_only > 0 & & $# threads > = 0) {if ($time_until_read_only% 5 = = 0) {printf "% s Waiting all running% d threads are disconnected.. (max% d milliseconds)\ n ", current_time_us (), $# threads + 1, $time_until_read_only * 100; if ($# threads

< 5 ) { print Data::Dumper->

New ([$_])-> Indent (0)-> Terse (1)-> Dump. "\ n" foreach (@ threads);}} sleep_until (); $_ tstart = [gettimeofday]; $time_until_read_only--; @ threads = get_threads_util ($orig_master_handler- > {dbh}, $orig_master_handler- > {connection_id}) Setting read_only=1 on the current master so that nobody (except SUPER) can write print current_time_us (). "Set read_only=1 on the orig master.. "; $orig_master_handler- > enable_read_only (); if ($orig_master_handler- > is_read_only ()) {print" ok.\ n ";} else {die" Failed!\ n ";} # # Waiting for M * 100 milliseconds so that current update queries can complete my $time_until_kill_threads = 5 Threads = get_threads_util ($orig_master_handler- > {dbh}, $orig_master_handler- > {connection_id}); while ($time_until_kill_threads > 0 & & $# threads > = 0) {if ($time_until_kill_threads% 5 = = 0) {printf "% s Waiting all running% d queries are disconnected.. (max% d milliseconds)\ n ", current_time_us (), $# threads + 1, $time_until_kill_threads * 100; if ($# threads

< 5 ) { print Data::Dumper->

New ([$_])-> Indent (0)-> Terse (1)-> Dump. "\ n" foreach (@ threads);}} sleep_until (); $_ tstart = [gettimeofday]; $time_until_kill_threads--; @ threads = get_threads_util ($orig_master_handler- > {dbh}, $orig_master_handler- > {connection_id}) } print "Disabling the VIP on old master: $orig_master_host\ n"; & stop_vip (); # # Terminating all threads print current_time_us (). "Killing all application threads..\ n"; $orig_master_handler- > kill_threads (@ threads) if ($# threads > = 0); print current_time_us (). "done.\ n"; # $orig_master_handler- > enable_log_bin_local (); $orig_master_handler- > disconnect (); # # After finishing the script, MHA executes FLUSH TABLES WITH READ LOCK $exit_code = 0;}; if ($@) {warn "Got Error: $@\ n"; exit $exit_code;} exit $exit_code } elsif ($command eq "start") {# Activating master ip on the new master # 1. Create app user with write privileges # 2. Moving backup script if needed # 3. Register new master's ip to the catalog database# We don't return error even though activating updatable accounts/ip failed so that we don't interrupt slaves' recovery.# If exit code is 0 or 10, MHA does not abort my $exit_code = 10; eval {my $new_master_handler = new MHA::DBHelper () # args: hostname, port, user, password, raise_error_or_not $new_master_handler- > connect ($new_master_ip, $new_master_port, $new_master_user, $new_master_password, 1); # # Set read_only=0 on the new master # $new_master_handler- > disable_log_bin_local (); print current_time_us (). "Set read_only=0 on the new master.\ n"; $new_master_handler- > disable_read_only (); # # Creating an app user on the new master # print current_time_us (). "Creating app user on the new master..\ n"; # FIXME_xxx_create_app_user ($new_master_handler); # $new_master_handler- > enable_log_bin_local (); $new_master_handler- > disconnect (); # # Update master ip on the catalog database, etc print "Enabling the VIP-$vip on the new master-$new_master_host\ n"; & start_vip () $exit_code = 0;}; if ($@) {warn "Got Error: $@\ n"; exit $exit_code;} exit $exit_code;} elsif ($command eq "status") {# do nothing exit 0;} else {& usage (); exit 1 } # A simple system call that enable the VIP on the new mastersub start_vip () {`ssh $ssh_user\ @ $new_master_host\ "$ssh_start_vip\" `;} # A simple system call that disable the VIP on the old_mastersub stop_vip () {`ssh $ssh_user\ @ $orig_master_host\ "$ssh_stop_vip\"` } sub usage {print "Usage: master_ip_online_change-- command=start | stop | status-- orig_master_host=host-- orig_master_ip=ip-- orig_master_port=port-- new_master_host=host-- new_master_ip=ip-- new_master_port=port\ n"; die;}

Perform handover

[root@MHA-S2 tmp] # masterha_master_switch-conf=/etc/mha/app1.cnf-master_state=alive-new_master_host=MHA-M1-new_master_port=3306-orig_master_is_new_slave-running_updates_limit=10000

The meaning of the parameters is:

-- adding this parameter to the orig_master_is_new_slave switch changes the original master into a slave node. If this parameter is not added, the original master will not start.

-- running_updates_limit=10000, in case of failover, if the candidate master has a delay, the mha handover cannot be successful. Plus this parameter indicates that the delay can be switched within this time range (in s), but the handover time is determined by the size of the relay log during recover.

Check the status of each machine after switching:

S2:

Root@localhost:mysql3306.sock [tt] > show slave status\ G * * 1. Row * * Slave_IO_State: Waiting for master to send event Master_Host: MHA-M1 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: 3306-binlog.000004 Read_Master_Log_Pos: 748 Relay_Log_File: MHA-S2-relay-bin.000002 Relay_Log_Pos: 420 Relay_Master_Log _ File: 3306-binlog.000004 Slave_IO_Running: Yes Slave_SQL_Running: Yes

S1:

Root@localhost:mysql3306.sock [tt] > show slave status\ gateway * 1. Row * * Slave_IO_State: Waiting for master to send event Master_Host: MHA-M1 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: 3306-binlog.000004 Read_Master_Log_Pos: 748 Relay_Log_File: MHA-S1-relay-bin.000002 Relay_Log_Pos: 420 Relay_Master_Log_File: 3306-binlog.000004 Slave_IO_Running: Yes Slave_SQL_Running: Yes

M1:

Root@localhost:mysql3306.sock [tt] > show slave status\ GEmpty set (0.00 sec)

Log for online switching:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.