Principle Analysis of High availability of Skype for Business Server Front end 07/01 Update SLTechnology News&Howtos

Principle Analysis of High availability of Skype for Business Server Front end

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Recently, we have dealt with the problem that the Skype front-end service of the same customer cannot be started twice in a row. The front-end server will always shut down unexpectedly in the customer's environment, and a lot of time will be wasted each time. Today, the system combs the working principle of the Skype for business Server front-end, and the corresponding troubleshooting process.

The high availability of the Skype for Business Server front end is based on Windows Fabric, Windows Fabric does not need to be installed manually, and the Windows feature is automatically installed and configured automatically when planning the topology and installing Skype for business Server components. You can open the following folder on each FE server to view the configuration of the entire Fabric.

The ClusterManifest.current file is automatically generated and overwritten every time the Windows system is rebooted, so do not change the configuration manually. If you change the configuration manually, the loss of routing arbitration will be serious and the pool arbitration will be lost. As a result, the entire pool failure client will not be able to log in to any FE server.

Using Get-CsUserPoolInfo, you can clearly see the user's current pool and which is the main front-end server in the pool.

The main registration server is FE03

The main user service server is FE03

The secondary server is FE01

The idle server is FE02

From the event viewer, you can also see that the main server is transferred from FE02 to FE03, and the msRTCSIP-UserRoutingGroupID value of the user attribute also points to the routing group of FE03. These routing groups are stored in the SQL database of the FE front-end server and do not rely on the back-end SQL.

So you can see that the whole front end is highly available based on Windows Fabric, while Fabric distributes the master server based on routing groups, and all of this is based on the FE front end.

According to the description of Microsoft's official documentation, it is known that the front-end server has even or odd servers, if it is an odd number of front-end servers, Fabric routing arbitration will vote in these front-end servers, if it is an even number of front-end servers, then routing voting will add a back-end SQL member, if the back-end database uses SQL mirroring technology At this time, the SQL server has to participate in the front-end FE Fabric routing vote, and if the master server fails over to the mirror server, the vote will fail, which will lead to the shutdown of the whole front-end service.

So if you'd better choose to have an odd number of front-end servers in the front-end pool.

As mentioned earlier, Fabric is based on the front-end server, and the back-end database is the basis for providing the entire front-end service. If you shut down the back-end database manually, you can find an interesting phenomenon, that is, the Skype client will continue to be used normally, and when it reaches a certain period of time, it can no longer be used. You can actually find that the Skype front end can survive for 30 minutes without the back-end SQL through the Get-CsRegistrarConfiguration command. Of course, you can also modify the survival time through Set-CsRegistrarConfiguration, but Microsoft officials do not recommend this.

The above analyzes the route arbitration and voting of Fabric, as well as the inevitability of the existence of back-end SQL. So some students are about to have questions, is the Skype front-end high availability must be 3? This answer I would like to say: if you want to achieve the high availability of the front end, the front end server needs at least 3 servers (of course, there can only be a maximum of 12 servers).

However, more engineers of Party B create two servers in the front-end pool during the implementation of the project, so they can not meet the basic conditions of Windows Fabric and cannot implement primary servers, secondary servers and idle servers. At this time, each frontend is directly connected to the backend database, and there is no survival time in the case of backend database downtime as I mentioned earlier, but the Skype client immediately goes into a disconnected state and becomes unavailable.

Next, I would like to share with you how to restore the normal operation of the front-end service if the whole Windows Fabric routing arbitration and pool arbitration are lost due to the simultaneous downtime of all three frontend of the same customer recently.

First describe what happened: one day the customer's Vmware virtualization platform host alarm resources are insufficient, need to migrate the virtual machine to another host, three front-end servers to migrate at the same time, this is no problem, the tragedy is that Vmware in the migration process of the physical server pressure directly collapsed, resulting in all three front-end servers accidentally shut down, you should know that this is the greatest harm to Windows Fabric The expected situation of restarting the Skype front-end virtual machine after manual migration occurred, all the services on the Skype front-end server could not be started, the customer panicked, the project manager panicked, and the processing time given by the customer was 3 hours, because the whole group leader would use Skype for audio and video calls in 3 hours.

When I first got the news, the pressure was actually very great, but based on the fact that I had already dealt with this problem once before, and first of all to communicate with the customer, the first thing I absolutely can't do is to restore the snapshot of the virtual machine. Because as soon as the snapshot is restored, the highly available architecture of the entire front end is all messed up, virtually increasing the difficulty of solving the problem.

My next operation process is as follows:

First of all, start all the Skype services on the entire front-end server manually, and it is found that none of them can be started, and a lot of errors are reported in the event Viewer because they cannot connect to the back-end database, the front-end structure pool cannot start normally, and so on.

Next, cut to the SQL server (SQL has made mirroring highly available) to solve the problem that all SQL database services are not working properly.

Then it must be to reset the structure pool at the front end, using the Reset-CsPoolRegistrarState command to reset. This command is followed by the-ResetType argument of the following type:

ServiceReset: means to stop and restart the RtcSrv and FabricHostSvc services

QuorumLo***ecovery: means to reload user data in backup storage for any routing group that is currently lost in arbitration

FullReset: the same type of reset as QuorumLo***ecovery will be performed, but in addition, the local Skype for Business Server database will be rebuilt

MachineStateRemoved: removes the specified server from the pool. This type of reset should be used only if the server in question (or its database) has been permanently lost.

I have not used the FullReset parameter so far, which is a great trick, because the risk is so great that Microsoft's official website posted such a sentence: it is best to consult Microsoft technical engineers before using FullReset.

My first attempt is to use the ServiceReset parameter to rearrange the entire route by restarting the Fabric service, and then restart the entire front-end service. After executing the command, I was pleasantly surprised to find that all services except the Skype for Business Server front-end service have been started. Industry insiders know that if the front-end service is not running, then all clients will not be able to log in and therefore cannot use Skype.

At this point, go back to the log to see that the error warning still displayed is that the structure pool did not complete user initialization.

Although the above ServiceReset parameter resets the routing group, but the routing group is not loaded into the structure pool, then issue the QuorumLo***ecovery command to reload the user data directly from the backup storage.

The entire Windows Fabric is reset and the entire storage service is rebuilt after running the command. But at this time, the Skype for business Server front-end service does not start, because at the beginning of the article, it is said that the configuration of Windows Fabric is that every time the machine is rebooted, it will automatically generate a XML file of ClusterManifest.current, which contains the primary server and idle server of the entire Windows Fabric, as well as the entire routing and arbitration, and the file changed manually is invalid. So the best and most reliable solution is to wait for the Skype front-end server to stop automatically, then turn off all the front-end servers, and then power on in turn. After boot, let the three front-end automatically go to the Windows Fabric routing arbitration vote and structure pool arbitration reset (this process takes time depending on the number of front-end servers. My three front-end servers have waited for a total of about 30 minutes).

Finally, all structural pools and route arbitrations are automatically set to normal, and the Skype for Business Server front-end service runs normally.

Finally, the problem was solved within the time set by the customer, which was a surprise without danger.

Therefore, based on the above actual cases, the following points are strongly recommended:

1. Try to use SQL failover cluster or Alwayson,SQL failover cluster and Alwayson for back-end databases. Please refer to my previous article.

2. Do not put all "eggs" in one "basket". It is recommended to put different front-end servers in different data centers or at least on different host computers.

3. With regard to the number of front-end servers in the front-end pool: if there is only one, do the standard version directly, it is strongly not recommended to use two front-end servers and at least three front-end services. At the same time, it is recommended that the number of front-end servers is odd (3 or 5 is the best configuration. If it is 7 or 9, it involves obtaining the names of the first five servers through commands. Restart must be started in the order of the first five. This kind of large pool failure is more troublesome to deal with)

4. For any number of front-end pools, if you need to update the patch, be sure to use the Invoke-CsComputerFailOver-ComputerName command to offline the server, and then use the Invoke-CsComputerFailBack-ComputerName command to bring the server online. It is best to update the second one after updating all the front-end services. Be sure to upgrade the backend database after updating all the front-end services.

The above is some of my personal knowledge sharing, which does not mean that Microsoft or any product group may be incorrectly worded or described incorrectly, but I hope it can help you solve the problem that the front-end service of Skype/Lync Server Enterprise Edition cannot be started in the real environment, especially due to abnormal structure pool.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.