Implementing zabbix High availability Cluster with pacemaker+corosync 04/18 Update SLTechnology News&Howtos

Implementing zabbix High availability Cluster with pacemaker+corosync

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

What is pacemaker

1.pacemaker simple description

The origin of 2.pacemaker

II. Characteristics of pacemaker

III. Internal structure of pacemaker

1. Cluster component description:

two。 Functional Overview

4. Centos6.x+pacemaker+corosync to achieve high availability of zabbix

1. Environmental description

5. Install pacemaker and corosync (each node should be running)

1. The prerequisite is that each node completes the host parsing work.

2. Time synchronization of each node

3. Each node completes the work of mutual trust

4. Turn off the firewall and SELinux

5. Install pacemaker+corosync+pcs

6. Configure corosync

1. Set variables

2. Change the corosync configuration file

3. Generate key file

7. Install and configure cman

8. Edit cluster.conf

Check the configuration file and boot automatically

X. allocation of resources

11. Verification

XII. Common commands

XIII. Zabbix startup script

1. What is pacemaker? 1.pacemaker simply explains

Pacemaker (literal translation: pacemaker) is a cluster resource manager. It achieves maximum availability of cluster services (also known as resource management) node-and resource-level failure detection and recovery using the messaging and membership capabilities provided by your preferred cluster infrastructure (OpenAIS or Heaerbeat).

It can be clustered of any size and is equipped with a powerful dependency model that enables administrators to accurately express the relationships (including order and location) between cluster resources. Almost anything can be scripted and can be managed as part of a pacemaker cluster.

Let me make it clear again that pacemaker is a resource manager and does not provide heartbeat information because it seems to be a common misunderstanding and worth it. Pacemaker is a continuing CRM (also known as Heartbeat V2 Explorer) that was originally a heartbeat but has become a stand-alone project.

The origin of 2.pacemaker

As we all know, when Heartbeat reaches V3, it is split into multiple projects, of which pacemaker is the split resource manager.

Components after the split of Heartbeat 3.0:

Heartbeat: separate the original message communication layer into a heartbeat project, and the new heartbeat is only responsible for maintaining the information of each node in the cluster and their previous communication.

Cluster Glue: equivalent to a middle layer, it is used to associate heartbeat and pacemaker, and mainly consists of two parts, namely, LRM and STONITH.

Resource Agent: a collection of scripts used to control the start and stop of a service and monitor the status of the service. These scripts will be called by LRM to enable various resources to start, stop, monitor, and so on.

Pacemaker: that is, Cluster Resource Manager (CRM for short), which is used to manage the control center of the entire HA. The client uses pacemaker to configure and manage and monitor the entire cluster.

II. Characteristics of pacemaker

Fault detection and recovery at the host and application level

Almost any redundant configuration is supported

Multiple cluster configuration modes are supported simultaneously

Configure policy to handle quorum loss (when multiple machines fail)

Support for application startup / shutdown sequence

Support, applications that must / must be running on the same machine

Applications that support multiple modes (such as master / slave)

You can test the cluster status of any failure or cluster

III. Internal structure of pacemaker

1. Cluster component description:

Stonithd: heartbeat system.

Lrmd: the local resource management daemon. It provides a common interface to support resource types. Call the resource agent (script) directly.

Pengine: policy engine. Configure the next state of cluster computing based on the current state and configuration. Generate a transition diagram that contains a list of actions and dependencies.

CIB: cluster information base. Contains definitions of all cluster options, nodes, resources, their relationship to each other and the status quo. Synchronously updates to all cluster nodes.

CRMD: the cluster resource management daemon. It is mainly the PEngine and LRM of message agents, and also elects a cluster of leaders (DC) to coordinate activities (including starting / stopping resources).

The message and member layer of OpenAIS:OpenAIS.

Heartbeat: heartbeat message layer, an alternative to OpenAIS.

CCM: consensus cluster member, heartbeat member layer.

CMAN is the core of the Red Hat RHCS suite, CCS is the CMAN cluster configuration system, configuring cluster.conf, and cluster.conf is actually the configuration file of openais, which is mapped to openais through CCS.

two。 Functional Overview

CIB uses XML to represent the configuration and current status of all resources in the cluster of the cluster. The contents of CIB are automatically synchronized throughout the cluster, using PEngine to calculate the ideal state of the cluster, generate a list of instructions, and then feed it to DC (the designated coordinator). All the nodes in the Pacemaker cluster elect the DC node as the primary decision node. If the selected DC node goes down, it will quickly establish a new DC on all nodes. DC passes the policies generated by PEngine to LRMd (local resource management daemon) or CRMD on other nodes through the cluster messaging infrastructure. The ideal strategy for PEngine recalculation when a node in the cluster goes down. In some cases, it may be necessary to shut down the node to protect shared data or complete resource recycling. To this end, Pacemaker is equipped with stonithd equipment. STONITH can "shoot" other nodes, usually with a remote power switch. Pacemaker configures STONITH devices to store resources in CIB, making it easier for them to monitor resource failures or outages.

4. Centos6.x+pacemaker+corosync to achieve high availability of zabbix 1. Environment description

OS:Centos 6.7 x86_64 mini

APP: Pacemaker 1.1.15

LNMP+Zabbix 3.4.1

Corosync+pcs+cman

IP ADDR:vip-192.168.8.47/20

Zabbix01-192.168.61plus 20

Zabbix02-192.168.8.63Universe 20

Zabbixdb-192.168.8.120/20

The PS:IP address needs to be configured according to the individual's specific environment, and VIP and zabbix should be in the same network segment.

Topological structure

PS: the next step is to directly introduce the installation and configuration of pacemaker and corosync. For the section on the zabbix+LNMP environment, please refer to the previously published articles "zabbix3.2 compilation and installation" or "zabbix High availability".

Fifth, install pacemaker and corosync (each node should run) 1. Premise that each node completes the host parsing work.

Vim / etc/hosts

# cat / etc/hosts

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

:: 1 localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.8.61 zabbix01.okooo.cn zabbix01

192.168.8.63 zabbix02.okooo.cn zabbix02

192.168.8.120 zbxdb.okooo.cn zbxdb

2. Time synchronization of each node

Ntpdate 210.72.145.44

3. Each node completes the work of mutual trust

Ssh-keygen-t rsa-f ~ / .ssh/id_rsa-P''

Ssh-copy-id-I. ssh / id_rsa.pub root@zabbix01/02/db.okooo.cn

4. Turn off the firewall and SELinux

# cat / etc/selinux/config

# This file controls the state of SELinux on the system.

# SELINUX= can take one of these three values:

# enforcing-SELinux security policy is enforced.

# permissive-SELinux prints warnings instead of enforcing.

# disabled-SELinux is fully disabled.

SELINUX=disabled

# SELINUXTYPE= type of policy in use. Possible values are:

# targeted-Only targeted network daemons are protected.

# strict-Full SELinux protection.

SELINUXTYPE=targeted

# / etc/init.d/iptables status

Iptables: Firewall is not running.

5. Install pacemaker+corosync+pcs

Yum install-y pacemaker corosync pcs

Configure corosync1 and set variables

Export ais_port=4000

Export ais_mcast=226.94.1.1

Export ais_addr=192.168.15.0

Env | grep ais

2. Change the corosync configuration file

Cp / etc/corosync/corosync.conf.example / etc/corosync/corosync.conf

Sed-i.bak "s /. * mcastaddr:. * / mcastaddr:\ $ais_mcast / g" / etc/corosync/corosync.conf

Sed-i.bak "s /. * mcastport:. * / mcastport:\ $ais_port / g" / etc/corosync/corosync.conf

Sed-i.bak "s /. * bindnetaddr:. * / bindnetaddr:\ $ais_addr / g" / etc/corosync/corosync.confcat / etc/corosync/corosync.conf

# Please read the corosync.conf.5 manual page compatibility: whitetanktotem {version: 2 secauth: on # start authentication threads: 2 interface {ringnumber: 0 bindnetaddr: 192.168.15.0 # modify heartbeat network segment mcastaddr: 226.94.1.1 # Multicast propagate heartbeat information mcastport: 4000 ttl: 1} logging {fileline: off to_stderr: no to_logfile: yes to_syslog: no logfile: / var/log/cluster/corosync.log # Log location debug: off timestamp : on logger_subsys {subsys: AMF debug: off} amf {mode: disabled} # enable pacemakerservice {ver: 0 name: pacemaker} aisexec {user: root group: root} Note: you can see the meaning of all options with man corosync.conf. 3. Generate key file Note: corosync generates key file will call / dev/random random number device by default, once the random number of IRQS interrupted by the system is not enough, it will generate a lot of waiting time, therefore, in order to save time, we replace random with urandom before generating key, in order to save time. Mv / dev/ {random,random.bak} ln-s / dev/urandom / dev/randomcorosync-keygenPS: the above steps should be run on all nodes. 7. Install and configure cmanyum install-y cman

Sed-i.sed "s/.*CMAN_QUORUM_TIMEOUT=.*/CMAN_QUORUM_TIMEOUT=0/g" / etc/sysconfig/cman

# cat / etc/sysconfig/cman

# CMAN_CLUSTER_TIMEOUT-amount of time to wait to join a cluster

# before giving up. If CMAN_CLUSTER_TIMEOUT is positive, then we will

# wait CMAN_CLUSTER_TIMEOUT seconds before giving up and failing if

# we can't join a cluster. If CMAN_CLUSTER_TIMEOUT is zero, then we

# will wait indefinitely for a cluster join. If CMAN_CLUSTER_TIMEOUT is

# negative, do not check to see if we have joined a cluster.

# CMAN_CLUSTER_TIMEOUT=5# CMAN_QUORUM_TIMEOUT-amount of time to wait for a quorate cluster on

# startup. Quorum is needed by many other applications, so we may as

# well wait here. If CMAN_QUORUM_TIMEOUT is zero, quorum will

# be ignored.

CMAN_QUORUM_TIMEOUT=0# CMAN_SHUTDOWN_TIMEOUT-amount of time to wait for cman to become a

# cluster member before calling 'cman_tool' leave during shutdown.

# The default is 60 seconds

# CMAN_SHUTDOWN_TIMEOUT=6# CMAN_NOTIFYD_START-control the startup behaviour for cmannotifyd

# the variable can take 3 values:

# yes | will always start cmannotifyd

# no | will never start cmannotifyd

# conditional (default) | will start cmannotifyd only if scriptlets

# are found in / etc/cluster/cman-notify.d

# CMAN_NOTIFYD_START=conditional# CMAN_SSHD_START-control sshd startup behaviour

# the variable can take 2 values:

# yes | cman will start sshd as early as possible

# no (default) | cman will not start sshd

# CMAN_SSHD_START=no# DLM_CONTROLD_OPTS-allow extra options to be passed to dlm_controld daemon.

# DLM_CONTROLD_OPTS= "" # Allow tuning of DLM kernel config.

# do NOT change unless instructed to do so.

# DLM_LKBTBL_SIZE= ""

# DLM_RSBTBL_SIZE= ""

# DLM_DIRTBL_SIZE= ""

# DLM_TCP_PORT= "" # FENCE_JOIN_TIMEOUT-- seconds to wait for fence domain join to

# complete. If the join hasn't completed in this time, fence_tool join

# exits with an error, and this script exits with an error. To wait

# indefinitely set the value to-1.

# FENCE_JOIN_TIMEOUT=20# FENCED_MEMBER_DELAY-amount of time to delay fence_tool join to allow

# all nodes in cluster.conf to become cluster members. In seconds.

# FENCED_MEMBER_DELAY=45# FENCE_JOIN-boolean value used to control whether or not this node

# should join the fence domain. If FENCE_JOIN is set to "no", then

# the script will not attempt to the fence domain. If FENCE_JOIN is

# set to "yes", then the script will attempt to join the fence domain.

# If FENCE_JOIN is set to any other value, the default behavior is

# to join the fence domain (equivalent to "yes")

# When setting FENCE_JOIN to "no", it is important to also set

# DLM_CONTROLD_OPTS= "- f0" (at least) for correct operation.

# Please note that clusters without fencing are not

# supported by Red Hat except for MRG installations.

# FENCE_JOIN= "yes" # FENCED_OPTS-allow extra options to be passed to fence daemon.

# FENCED_OPTS= "" # NETWORK_BRIDGE_SCRIPT-- script to use for xen network bridging.

# This script must exist in the / etc/xen/scripts directory.

# The default script is "network-bridge".

# NETWORK_BRIDGE_SCRIPT= "network-bridge" # CLUSTERNAME-override clustername as specified in cluster.conf

# CLUSTERNAME= "" # NODENAME-- specify the nodename of this node. Default autodetected.

# NODENAME= "" # CONFIG_LOADER-- select default config parser.

# This can be:

# xmlconfig-read directly from cluster.conf and use ricci as default

# config propagation method. (default)

# CONFIG_LOADER=xmlconfig# CONFIG_VALIDATION-select default config validation behaviour.

# This can be:

# FAIL-Use a very strict checking. The config will not be loaded if there

# are any kind of warnings/errors

# WARN-Same as FAIL, but will allow the config to load (this is temporarily

# the default behaviour)

# NONE-Disable config validation. Highly discouraged

# CONFIG_VALIDATION=WARN# CMAN_LEAVE_OPTS-allows extra options to be passed to cman_tool when leave

# operation is performed.

# CMAN_LEAVE_OPTS= "" # INITLOGLEVEL-- select how verbose the init script should be.

# Possible values:

# quiet-only one line notification for start/stop operations

# terse (default)-show only required activity

# full-show everything

# INITLOGLEVEL=terse

8. Edit cluster.confvim / etc/cluster/cluster.conf # cat / etc/cluster/cluster.conf

= > when running more clusters in same network change multicast address

Check the configuration file and boot automatically

Ccs_config_validate

Service cman start

Cman_tool nodes

Service pacemaker start

Chkconfig cman on

Chkconfig pacemaker on

X. allocation of resources

Pcs cluster auth zabbix01 zabbix02 # inter-node authentication

Pcs cluster start-- all # starts all nodes in the cluster

Pcs resource create ClusterIP IPaddr2 ip=192.168.8.47 cidr_netmask=32 op monitor interval=2s # create a resource called ClusterIP with a type of IPadd2,VIP that is 192.168.47 Universe 32 detected every 2 seconds

Pcs property set stonith-enabled=false # since we don't have a STONITH device, let's turn off this property first

Pcs resource create zabbix-server lsb:zabbix_server op monitor interval=5s # creates a resource called zabbix-server, the standard is lsb, and the application is zabbis_server, which is detected every 5 seconds. Lsb refers to the startup script under / etc/init.d/

Pcs resource group add zabbix ClusterIP zabbix-server # add ClusterIP zabbix-server resources to the zabbix resource group

Pcs property set no-quorum-policy= "ignore" # ignores arbitration when a quorum is not present

Pcs property set default-resource-stickiness= "100" # Resource stickiness is 100

Pcs constraint colocation add zabbix-server ClusterIP # resource co-location

Pcs constraint order ClusterIP then zabbix-server # resource restrictions to ensure that VIP and service are running on the same node and that VIP is completed before service.

Pcs constraint location ClusterIP prefers zabbix01 # ClusterIP prefers to be on zabbix01 nodes and can be used for node failure recovery

Pcs constraint location zabbix-server prefers zabbix01 # zabbix-server prefers to be on zabbix01 nodes and can be used for fault recovery

11. Verification

1. Stop the zabbix_server service on zabbix01

PS: clusters can ensure the high availability of services

Root@zabbix01:~

# pcs resource

Resource Group: zabbix

ClusterIP (ocf::heartbeat:IPaddr2): Started zabbix01

Zabbix-server (lsb:zabbix_server): Started zabbix01

Root@zabbix01:~

# ip a

1: lo: mtu 65536 qdisc noqueue state UNKNOWN

Link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Inet 127.0.0.1/8 scope host lo

Inet6:: 1/128 scope host

Valid_lft forever preferred_lft forever

2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000

Link/ether 00:50:56:bb:68:49 brd ff:ff:ff:ff:ff:ff

Inet 192.168.8.61/20 brd 192.168.15.255 scope global eth0

Inet 192.168.8.47/32 brd 192.168.15.255 scope global eth0

Inet6 fe80::250:56ff:febb:6849/64 scope link

Valid_lft forever preferred_lft forever

PS: both VIP and resources are running on zabbix01 at this time

Root@zabbix01:~

# ip a

1: lo: mtu 65536 qdisc noqueue state UNKNOWN

Link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Inet 127.0.0.1/8 scope host lo

Inet6:: 1/128 scope host

Valid_lft forever preferred_lft forever

2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000

Link/ether 00:50:56:bb:68:49 brd ff:ff:ff:ff:ff:ff

Inet 192.168.8.61/20 brd 192.168.15.255 scope global eth0

Inet6 fe80::250:56ff:febb:6849/64 scope link

Valid_lft forever preferred_lft forever

Root@zabbix01:~

# ssh zabbix02 "pcs resource"

Resource Group: zabbix

ClusterIP (ocf::heartbeat:IPaddr2): Started zabbix02

Zabbix-server (lsb:zabbix_server): Started zabbix02

Root@zabbix01:~

# pcs cluster start zabbix01

Zabbix01: Starting Cluster...

Root@zabbix01:~

# ip a

1: lo: mtu 65536 qdisc noqueue state UNKNOWN

Link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Inet 127.0.0.1/8 scope host lo

Inet6:: 1/128 scope host

Valid_lft forever preferred_lft forever

2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000

Link/ether 00:50:56:bb:68:49 brd ff:ff:ff:ff:ff:ff

Inet 192.168.8.61/20 brd 192.168.15.255 scope global eth0

Inet 192.168.8.47/32 brd 192.168.15.255 scope global eth0

Inet6 fe80::250:56ff:febb:6849/64 scope link

Valid_lft forever preferred_lft forever

Root@zabbix01:~

# pcs resource

Resource Group: zabbix

ClusterIP (ocf::heartbeat:IPaddr2): Started zabbix01

Zabbix-server (lsb:zabbix_server): Started zabbix01

PS: after restarting the cluster, both resources and VIP are taken over by zabbix02, but after zabbix01 is restored, resources and VIP are returned to zabbix01.

XII. Common commands

1. View the cluster status

# pcs cluster status

Cluster Status:

Stack: cman

Current DC: zabbix01 (version 1.1.15-5.el6-e174ec8)-partition with quorum

Last updated: Thu Sep 21 02:13:20 2017 Last change: Wed Sep 20 09:13:10 2017 by root via cibadmin on zabbix01

2 nodes and 2 resources configured

PCSD Status:

Zabbix01: Online

Zabbix02: Online

2. View the configuration

# pcs config show

Cluster Name: zabbixcluster

Corosync Nodes:

Zabbix02 zabbix01

Pacemaker Nodes:

Zabbix01 zabbix02

Resources:

Group: zabbix

Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)

Attributes: ip=192.168.8.47 cidr_netmask=32

Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s)

Stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)

Monitor interval=2s (ClusterIP-monitor-interval-2s)

Resource: zabbix-server (class=lsb type=zabbix_server)

Operations: start interval=0s timeout=15 (zabbix-server-start-interval-0s)

Stop interval=0s timeout=15 (zabbix-server-stop-interval-0s)

Monitor interval=5s (zabbix-server-monitor-interval-5s)

Stonith Devices:

Fencing Levels:

Location Constraints:

Ordering Constraints:

Colocation Constraints:

Ticket Constraints:

Alerts:

No alerts defined

Resources Defaults:

No defaults set

Operations Defaults:

Timeout: 60s

Cluster Properties:

Cluster-infrastructure: cman

Dc-version: 1.1.15-5.el6-e174ec8

Default-resource-stickiness: 100

Have-watchdog: false

Last-lrm-refresh: 1505857479

No-quorum-policy: ignore

Stonith-enabled: false

# pcs cluster cib

3. View resources

# pcs resource

Resource Group: zabbix

ClusterIP (ocf::heartbeat:IPaddr2): Started zabbix01

Zabbix-server (lsb:zabbix_server): Started zabbix01

4. View resource groups

# pcs resource group list

Zabbix: ClusterIP zabbix-server

XIII. Zabbix startup script

# cat / etc/init.d/zabbix_server

#! / bin/bash

# Location of zabbix binary. Change path as neccessary

DAEMON=/usr/local/zabbix/sbin/zabbix_server

NAME= `basename $DAEMON`

# Pid file of zabbix, should be matched with pid directive in nginx config file.

PIDFILE=/tmp/$NAME.pid

# this file location

SCRIPTNAME=/etc/init.d/$NAME

# only run if binary can be found

Test-x $DAEMON | | exit 0

RETVAL=0

Start () {

Echo $"Starting $NAME"

$DAEMON

RETVAL=0

}

Stop () {

Echo $"Graceful stopping $NAME"

[- s "$PIDFILE"] & & kill-QUIT `cat $PIDFILE`

RETVAL=0

}

Forcestop () {

Echo $"Quick stopping $NAME"

[- s "$PIDFILE"] & & kill-TERM `cat $PIDFILE`

RETVAL=$?

}

Reload () {

Echo $"Graceful reloading $NAME configuration"

[- s "$PIDFILE"] & & kill-HUP `cat $PIDFILE`

RETVAL=$?

}

Status () {

If [- s $PIDFILE]; then

Echo $"$NAME is running."

RETVAL=0

Else

Echo $"$NAME stopped."

RETVAL=3

}

# See how we were called.

Case "$1" in

Start)

Start

Stop)

Stop

Force-stop)

Forcestop

Restart)

Stop

Start

Reload)

Reload

Status)

Status

Echo $"Usage: $0 {start | stop | force-stop | restart | reload | status}"

Exit 1

Esac

Exit $RETVAL

references

Https://www.zabbix.org/wiki/Docs/howto/high_availability_on_Centos_6.x

Https://ericsysmin.com/2016/02/18/configuring-high-availability-ha-zabbix-server-on-centos-7/

Https://access.redhat.com/documentation/zh-CN/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/ch-clusteradmin-HAAR.html

Http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/pdf/Pacemaker_Explained/Pacemaker-1.1-Pacemaker_Explained-en-US.pdf

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.