Example Analysis of SQL SERVER Always on Monitoring script and misunderstanding 07/12 Update SLTechnology News&Howtos

Example Analysis of SQL SERVER Always on Monitoring script and misunderstanding

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

SQL SERVER Always on monitoring scripts and misunderstanding example analysis, many novices are not very clear about this, in order to help you solve this problem, the following small series will be explained in detail for everyone, there are people who need this to learn, I hope you can gain something.

SQL SERVER Always on monitoring seems to be rarely mentioned, most of them are through the control panel, or found that some databases in some availability groups are no longer synchronized, only to realize that there is a problem.

A few questions first.

1 What is the time interval for health monitoring?

2. The working status of a machine in your current cluster, which contains how to quickly discover if there are multiple databases in your group, one of which is out of sync for some reason.

3 How long is the RTO for the master to switch over to the slave if the master fails

In fact, configuring this information in monitoring, real-time monitoring will help to discover whether the ALWAYS ON cluster works normally and whether it can provide services normally.

Before we talk about this, let's briefly talk about the ALWAYS ON architecture.

From the figure we can see, SQL SERVER Always on the general replication principle and transmission mode, here need to emphasize one thing, is Always on synchronous mode and asynchronous mode, many people think, I just choose the synchronous mode can, from the library must be real-time and the main library synchronization, data at any point in time will not have any difference pool.

This understanding is actually wrong from the absolute value. From the figure, our slave library has committed ACK in log Hardened, which means that the data is only written to the LDF file of the slave library, and it has already informed the master library that the data can be committed. However, at this time, the data has not been brushed into the data file, so even if it is synchronous mode, the synchronized data between the primary and standby databases is different in time. Look at the speed at which REDO works.

OK, now to clarify some of the misconceptions about synch nouns. SQL SERVER ALWAYS ON monitoring Script problems, follow-up can be through these Script and DMV, to some always on STATUS graphical display, rather than have to enter SQL SERVER to get these data, and status display.

The following is a script for monitoring that displays differently when executed on primary and secondary hosts.

SELECT

ag.name AS 'GroupName'

,cs.replica_server_name AS 'Replica'

,rs.role_desc AS 'Role'

,ag.health_check_timeout as health_check_timeout_ms

,case ag.failure_condition_level

when 1 then 'service down'

when 2 then 'the server is out of control'

when 3 then 'default value,or spin lock'

when 4 then 'Please check your memory resource'

when 5 then 'automatic failover'

end as failure_condition_level

,REPLACE(ar.availability_mode_desc,'_',' ') AS 'AvailabilityMode'

,ags.primary_recovery_health_desc

,ags.secondary_recovery_health_desc

,ar.failover_mode_desc AS 'FailoverMode'

,rs.recovery_health_desc

,rs.synchronization_health_desc

,ar.seeding_mode_desc AS 'SeedingMode'

,ar.endpoint_url AS 'EndpointURL'

,al.dns_name AS 'Listener'

FROM sys.availability_groups ag

JOIN sys.dm_hadr_availability_group_states ags ON ag.group_id = ags.group_id

JOIN sys.dm_hadr_availability_replica_cluster_states cs ON ags.group_id = cs.group_id

JOIN sys.availability_replicas ar ON ar.replica_id = cs.replica_id

JOIN sys.dm_hadr_availability_replica_states rs ON rs.replica_id = cs.replica_id

LEFT JOIN sys.availability_group_listeners al ON ar.group_id = al.group_id

There are mainly several fields here, which need to be introduced.

1 primary_recovery_health_desc , secondary_recovery_health_desc

Mainly show whether the relevant services are online

2 synchronization_health_desc Current synchronization status

3 recovery_health_desc Judge whether all databases in the current synchronization group are synchronized. If ONLINE_IN_PROGRESS is displayed, it means that a database in the replication group is not synchronized. If PROGRESS is displayed, it means that all databases in this synchronization group are in normal synchronization.

The script above looks something like this.

Finally, RTO.

Estimating failover time (RTO) is a term used to estimate how long it will take for a primary to fail in our cluster.

The main time required for a failover is composed of the following points

Diagnosis and decision times for host failures, redo times for data, and final switchover times

Official Micorsoft statement

All we can tell is the ratio between the local redo queue and the current redo rate

We get the relevant parameters redo_queue_size (KB/S) in sys.dm_hadr_database_replica_states, redo_rate secondary node REDO rate (KB/S) from the following system DMV

And the ratio of the two gives you an idea of how long it takes to complete the queue.

There is a script that can automatically determine if it is executed on the master node, it will automatically ignore and no data will be displayed. Only when executed on the slave node will TREDO/S of the current node be displayed.

SELECT

ag.name AS 'GroupName'

,db_name(hst.database_id) as dbname

,cs.replica_server_name AS 'Replica'

,ag.health_check_timeout as health_check_timeout_ms

,cast(hst.redo_queue_size as float) / hst.redo_rate as Tredo/S

,ags.primary_recovery_health_desc

,ags.secondary_recovery_health_desc

,ar.failover_mode_desc AS 'FailoverMode'

FROM sys.availability_groups ag

JOIN sys.dm_hadr_availability_group_states ags ON ag.group_id = ags.group_id

JOIN sys.dm_hadr_availability_replica_cluster_states cs ON ags.group_id = cs.group_id

Join sys.dm_hadr_database_replica_states as hst on ags.group_id = hst.group_id

JOIN sys.availability_replicas ar ON ar.replica_id = cs.replica_id

where hst.database_id = db_id('test') and ar.replica_metadata_id is not null and ags.primary_recovery_health is null

In fact, if such queries are implemented on the monitoring interface, it may be more efficient to find faults and solve problems in ALWAYSON.

Did reading the above help you? If you still want to have further understanding of related knowledge or read more related articles, please pay attention to the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.