SQL Server 2017 AlwaysOn on Linux configuration and maintenance (9) 07/01 Update SLTechnology News&Howtos

SQL Server 2017 AlwaysOn on Linux configuration and maintenance (9)

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

2.3.3 configuring Cluster Explorer Pacemaker

Introduction to Pacemaker on Linux

On Windows Server operating systems, Windows Server Failover Cluster (WSFC) provides high availability, fault detection, and automatic failover of SQL Server AlwaysOn AG. WSFC is a Cluster Resource Manager (CRM) that is responsible for maintaining a consistent mirror of the cluster on all nodes on the cluster. The purpose of Cluster Administrator is to provide high availability and fault tolerance for resources running on the cluster.

On the Linux operating system, the cluster resource manager is actually the open source software Pacemaker. It is mainly by the ClusterLabs organization to provide community contributions, Red Hat and SUSE to drive collaborative development. Pacemaker is available on most Linux releases, and SQL Server AlwaysOn AG is only supported in the current Red Hat Enterprise Linux 7.3 SP2 version 7.4, SUSE Linux Enterprise Server 12 SP2, and Ubuntu 16.04.

The Pacemaker stack consists of the following components:

The Pacemaker software itself, which is similar to the cluster service on Windows.

Corosync, a set of communication systems, is similar to heartbeat and arbitration on Windows (not to be confused with Heartbeat, it is a Linux program that functions similar to Corosync); it is also responsible for restarting failed application processes.

LibQB, a high-performance logging, tracking, interprocess communication and polling system, is similar to how cluster.log is generated on Windows.

Resource Agents, resource agent, software that allows Pacemaker to manage services and resources, such as starting or stopping SQL Server AlwaysOn AG resources, such as cluster resource DLL on Windows.

Fence Agents, the quarantine agent, allows Pacemaker to isolate and block nodes that behave abnormally to affect cluster availability.

Install the Pacemaker package on all nodes

Sudo yum install pacemaker pcs fence-agents-all resource-agents

Look at the installed packages, which make up the Pacemaker stack with different components:

Pcs, the Pacemaker Configuration System,Pacemaker and Corosync configuration tools

Fence-agents-all, a collection of all supported quarantine agents

Resource-agents, a repository of all resource agents that conform to the Open Cluster Framework (OCF) specification.

Set the password for the default user created when installing the Pacemaker and Corosync packages

Use the same password on all nodes.

Sudo passwd hacluster

Enable and start the pcsd service and Pacemaker

Allows nodes to rejoin the cluster after reboot. Run the following command on all nodes:

Sudo systemctl enable pcsdsudo systemctl start pcsdsudo systemctl enable pacemaker

Create a cluster

First of all, in order to prevent the residual configuration files with Cluster from affecting the post-build, you can first execute the following command to delete the existing Cluster:

Sudo pcs cluster destroy # On all nodessudo systemctl enable pacemaker

Then create and configure the cluster:

Sudo pcs cluster auth-u hacluster-p sudo pcs cluster setup-name sudo pcs cluster start-- allsudo pcs cluster enable-- all

After Pacemaker is configured, use pcs to interact with the cluster. Execute all commands on one node in the cluster.

Configure isolation

The Pacemaker cluster vendor needs to enable STONITH and configure quarantined devices for supported cluster installations. When Cluster Explorer is unable to determine the state of the node or the resources on the node, isolation takes the cluster back to a known state.

Resource-level isolation to ensure that there is no data corruption in the event of an outage by configuring resources. For example, when the communication link is damaged, you try resource-level isolation to mark the disk on a node as obsolete.

Node-level isolation ensures that a node does not run any resources. This is achieved by resetting the node. Pacemaker supports a variety of isolation devices, depending on your environment. You can use intelligent power distribution units (PDU), network switches, HP iLO devices, or plug-ins like VMWare STONITH agents. Currently, STONITH agents for Hyper-V and Microsoft Azure are not supported.

Note: disable STONITH for testing purposes only. If you plan to use Pacemaker in a production environment, you should implement STONITH according to the environment plan and keep it enabled.

Production deployment isolation, refer to the official documentation: high availability add-in for Red Hat and Pacemaker: isolation

Because the node-level isolation configuration largely depends on your environment, in the test environment, you can disable node-level isolation with the following script:

Sudo pcs property set stonith-enabled=false

Configure the cluster properties cluster-recheck-interval

Cluster-recheck-interval represents the polling interval that checks for changes in cluster resource parameters, constraints, and other cluster options. If the replica fails, the cluster attempts to restart the replica within a certain interval determined by the failure- timeout value and the cluster-recheck-interval value. For example, if failure-timeout is set to 60 seconds and cluster-recheck-interval is set to 120 seconds, the restart attempt interval is greater than 60 seconds and less than 120 seconds. It is officially recommended that failure-timeout be set to 60 seconds and cluster-recheck-interval set to more than 60 seconds. Cluster-recheck-interval does not recommend setting to a smaller value. The following script updates the attribute value to 2 minutes:

Sudo pcs property set cluster-recheck-interval=2min

Configure the cluster properties start-failure-is-fatal

All releases that include RHEL 7. 3 and 7. 4, using the latest available Pacemaker package 1.1.18-11.el7, describe the behavior changes when the cluster is configured with start-failure-is-fatal as false. It affects the failover workflow. If service disruption occurs in the primary replica, the cluster should fail over to one of the available secondary replicas. Instead, the user will notice that the cluster has been trying to start the failed master copy. If the primary replica will never be online (due to a permanent outage), the cluster will never fail over to another available secondary replica. Because of this change, the previously recommended configuration for start-failure-is-fatal is no longer valid, and the configuration needs to be restored to its default value of true.

Sudo pcs property set start-failure-is-fatal=true

In addition, the AG resource needs to be updated to include the failover-timeout attribute.

Use the following script to update the failover-timeout property of the ag1 resource to 60s:

Pcs resource update ag1 meta failure-timeout=60s

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.