How to solve the problem of high availability of MySQL in GitHub 07/19 Update SLTechnology News&Howtos

How to solve the problem of high availability of MySQL in GitHub

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to solve the high availability of MySQL in GitHub". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "how to solve the high availability of GitHub MySQL"!

Introduction

GitHub uses MySQL to store data in all non-Git places, and its availability is critical to the operation of GitHub. Websites, API, authentication, etc., all require database access. GitHub runs multiple MySQL clusters to meet its different services and tasks. These clusters use a typical master complex structure, in which only a single node (master) can write, and other nodes (replication points) asynchronously replicate the changes (replay) of the master node and provide read services.

The availability of the primary node is critical. Without the master node, the entire cluster cannot be written (the data cannot be saved). Any data change operations, such as submission, Bug, registration, new libraries, and so on, will fail. Obviously, to be able to write, you need to have an available write node, the master node. The key is that we can locate or find this node.

In the case of a primary node crash, a new primary node must be guaranteed to appear and broadcast quickly. When a crash is detected, the master node switch and the cluster broadcast inform that the total time together is the total downtime. Of course, the smaller the better!

This article shows GitHub's MySQL high availability and cluster master solution, which allows GitHub to operate reliably across data centers, tolerating data center isolation (unavailable), and achieving shorter downtime.

High availability goals the solution described in this article is the highly available iterative optimization previously implemented by GitHub. When the capacity is expanded, the high availability of MySQL can adapt to changes. We also expect to have a strategy similar to MySQL in GitHub for other services.

When considering high availability and service discovery, some problems will lead you to an appropriate solution. The incomplete list of problems is as follows:

How long can you stand the downtime?

Is crash detection reliable? Can you tolerate mistakes (switching prematurely)?

Is failover reliable? Where did it fail?

Does the scheme also work well across centers? In a network with high / low latency?

Can the solution tolerate the collapse of the entire data center or network isolation?

Is there a mechanism to organize or slow down the occurrence of brain fissure (both nodes in the cluster claim to be masters, they are independent of each other, do not know each other, and both accept writes)?

Can you withstand data loss? How much can I bear to lose?

To illustrate these problems, let's take a look at the previous highly available solution and why optimize it!

Discard discovery strategies based on VIP and DNS

In previous iterations, we used:

Use orchestrator for crash detection and failover, and

Using VIP and DNS as the master node, it is found that under this strategy, the client discovers the write node by name, such as mysql-writer-1.github.net, and the name is resolved to a virtual IP address (VIP), which can be accessed by the host of the corresponding master node. So, normally, the client interprets the name, connects to the acquired IP, and finds the master node.

Consider such a replication topology that spans three data centers:

When the primary node fails, one of the replicated nodes must be nominated to replace its [master] bit.

Orchestrator will detect invalidation, nominate a new owner, and then redefine the name or VIP. The client does not know how to locate the new master, because they only know the name, and the name must be interpreted to the new master. The situation is as follows:

VIP are collaborative: they are claimed and owned by the database server. To get or release a VIP, the server must send an ARP request. The server that occupies the VIP must be released before the new master can get it. The implied effect is:

An orderly failover operation must first contact the old owner of the failure and ask it to release the VIP, and then contact the new master to ask him to catch the VIP. What if the old owner loses access or refuses to release VIP? In the old master failure scenario, it is possible that it cannot respond on time or does not respond at all.

Allowing brain fissures can solve the problem: both hosts generate the same VIP. Different clients may be connected to one of them, depending on who is close on the network.

The root cause is that two separate servers need to collaborate, and this structure is unreliable.

Even if the old owner collaborates, the whole process wastes valuable time: switching to the new owner must wait until the old owner can be contacted.

And when the VIP changes, there is no guarantee that the existing client connection will be disconnected from the old owner, and there will still be a de facto brain fissure.

VIP is usually designed with physical location boundaries and is owned by routers or switches. Therefore, we can only reset the VIP to the server in the relevant location. In some special cases, we cannot reset the VIP to the new master of another data center, and we must modify the DNS.

It will take longer for DNS changes to spread. The DNS is cached on the client (refreshed after a specific time). Failures across data centers often lead to longer downtime: it takes more time for clients to realize that the primary node has changed.

These limitations prompt us to seek better solutions and consider more:

The master node injects itself independently through pt-heartbeat heartbeat service, based on delay measurement and flow control. The service must be started on the newly nominated primary node. If possible, the service on the old owner will be shut down.

Similarly, Pseudo-GTID injection is managed independently by the primary node. It should be started at the new master and preferably shut down at the old master.

The new master is made writable and the old master is made read-only (if possible).

These additional steps will add more total downtime, and they will encounter failures and conflicts themselves. This solution is effective, and GitHub has also made a successful fail-over and works well under radar monitoring. But we hope that Gao can do better in these areas:

No feeling across data centers.

Able to tolerate data center failures.

Remove unreliable collaboration processes.

Reduce total downtime.

As far as possible to achieve lossless failure switching.

GitHub High availability solution: orchestrator,Consul,GLB

Our new strategy, accompanied by incidental optimization, addresses or mitigates the above concerns. The current HA structure includes:

Orchestrator is responsible for failure detection and handover (using orchestrator/raft across data centers).

Hashicorp's Consul is used for service discovery.

GLB/HAProxy acts as a proxy layer between the client and the write node.

Anycast is used for network routing.

The new structure removes VIP and DNS modifications. Although we have introduced more components, we are able to decouple components and simplify tasks, as well as take advantage of solid and stable solutions. Next, I will explain it in detail.

A normal process

Normally, the application connects to the write node through GLB/HAProxy, and the application can never feel the identity of the master node. In the past, applications needed to use names. For example, the primary node of a cluster1 cluster uses mysql-writer-1.github.net. Under the current structure, the name is interpreted to an anycast address (IP).

With anycast, the name is interpreted to the same IP anywhere, but traffic is routed to different places based on the client location. In particular, each of our data centers has GLB, and our highly available load balancer is deployed in multiple cabinets. Traffic destined for mysql-writer-1.github.net is always routed to the GLB cluster in the local data center.

Therefore, all clients are served by local agents. GLB runs on top of HAProxy. HXProxy has a write node pool: each MySQL cluster has such a pool, and each pool has only one back-end server-the cluster master node. GLB/ HA proxy cabinets in all data centers use the same pool, that is, the same back-end server in the pool. Therefore, if an application wants to write to mysql-writer-1.github.net, it is always routed to the master node of the cluster1 cluster, regardless of the GLB server to which it is connected.

Applications only need to understand that discovery ends in GLB, and there is no need for rediscovery. The rest is up to GLB to route traffic to the correct destination. So how does GLB know to serve those back-end servers and how to spread changes to GLB? Using Consul to discover

Consul is a well-known service discovery solution that also provides DNS services. In our scenario, we use it as highly available KV storage.

Store the identity of the cluster master node in the KV storage of Consul. For each cluster, there is a set of KV entries that identify the fqdn, port, ipv4, and ipv6 of the primary node of the cluster.

Each GLB/HAProxy node runs consul-template: a service that listens for changes in Consul data (in our case, describes changes in data to the cluster) and produces a legal configuration file that can be used to reload HAProxy (when the configuration changes).

Therefore, the change in the identity of a master node in Consul is tracked by each GLB/HAProxy cabinet and reconfigured itself to make the new master the only entity in the cluster background server pool and overloaded to reflect these changes.

At GitHub, each of our data centers has a Consul, which is a highly available structure. But these structures are independent of each other, do not interconnect replication and do not share any data.

So how does Consul know about the change? And how information is transmitted across data centers? Orchestrator/raft

We are running an orchestrator/raft structure: orchestrator nodes communicate with each other through raft. Each data center has 1-2 orchestrator nodes.

Orchestrator is responsible for failure detection, MySQL failure switching, and the transmission of information changes of the master node in Consul. Failover is performed by a single orchestrator/raft lead node, but changes-the cluster has a new master message that is extended to all orchestrator nodes through the raft mechanism.

When the orchestrator node receives the message of the new master change, they interact with the local Consul: trigger a KV write operation. A data center with multiple orchestrator nodes may trigger multiple writes to Consul.

All the processes are combined.

When the primary node crashes:

Failure detected by orchestrator.

The orchestrator/raft lead node triggers a restore. A new master was nominated.

Orchestrator/raft broadcasts master node changes to all raft cluster nodes.

Each orchestrator/raft node receives a notification of a change in the leader node and updates the KV storage of the local Consul (the identity of the new master).

The consul-template of each GLB/HAProxy listens to a change in the KV value in the Consul, and then reconfigures and reloads the HAProxy.

Client traffic is directed to the new owner.

Each component in the process has a clear function, and the whole design is well decoupled and quite simple. Orchestrator does not know the load balancer, Consul does not know where the information comes from, HAProxy only focuses on Consul, and the client only focuses on Proxies. And:

No DNS changes need to be extended.

There is no TTL.

The process does not require the collaboration of the old master.

Some other details

To make the process more secure, we also did this:

HAProxy configures a very short hard-stop-after. When it reloads the new background server to the write node pool, it automatically terminates all connections to the old master.

With the hard-stop-after parameter, we do not need to seek the cooperation of the client, so as to reduce the impact of brain fissure. Of course, this is obviously not completely eliminated, and it will take time for us to break all the old connections. But there is a point in time after which we feel comfortable and there will be no unpleasant surprises.

We don't strictly require Consul to be available at any time. In fact, all we need is that it can be used in case of fail-over. If Consul happens to be unavailable, GLB goes on to complete the operation with the last known information without causing any extreme behavior.

GLB is set to verify the identity of the newly nominated primary node, similar to our context-aware MySQL pools, which checks the background server to confirm that it is indeed a write node. If we accidentally delete the primary node identity in the Consul, there is no problem, the empty item will be ignored. If we accidentally write the non-primary server to Consul, no problem, GLB will refuse to update it and continue to run in its old state.

We continue to pursue highly available goals and concerns.

Orchestrator/raft failure detection

Orchestrator uses holistic approach to detect failures, which is very reliable. We did not observe incorrect failover, so we did not bear unnecessary downtime.

Orchestrator/raft delves further into the complete isolation of the data center network. This situation can be misleading: servers in the data center can talk to each other. But at this point, are you quarantined by other data centers, or are other data centers quarantined?

In the orchestrator/raft architecture, the raft leader node is responsible for failover. The lead node is a node supported by the majority of the group (similar to democratic voting). We orchestrator node deployment of a single center to choose, but any NMUE 1 center to do.

In a complete data center network isolation event, the orchestrator node in the data center is disconnected from other nodes (in other data centers). In this way, the orchestrator nodes that isolate the data center cannot be the leader of the raft cluster. If any such node happens to be the leader, it will go down. A new leader will be given from other data centers, this leader has the support of other data centers (voting), and it has the ability to communicate with each other.

Therefore, this orchestrator node is called [shooting], and it comes from outside the network isolated data center. Assuming that there is a master node in an isolated data center, orchestrator will trigger a failover to replace it with servers obtained by other data centers. We have alleviated data center isolation by entrusting other non-isolated data centers to select the host.

Expand faster (broadcast)

Total downtime can be significantly reduced if changes in the primary node can be broadcast more quickly. How do you do that?

When orchestrator starts to fail over, it looks at the number of servers that may be nominated. Through hints or restrictions, understanding of replication rules and past memory, it will make scientific choices according to reasonable procedures.

It believes that the server that can be nominated is an ideal candidate, based on:

There are no obstacles preventing the server from being nominated (or the user potentially hints that the server is suitable for nomination), and

The server can be expected to use its sibling node as a replication node.

When the condition is met, orchestrator makes it writable, immediately broadcasts the server nomination (written to the Consul library), and starts repairing the replication node tree asynchronously, which usually takes a few seconds.

By the time our GLB servers are reloaded, the replication node tree is basically reorganized, which is not necessary. The server returns to writable even if it is OK.

Semi-synchronous replication

In MySQL, semi-synchronous replication is when the master node does not confirm the commit of a transaction until the data change has been passed to one or more replication slave nodes. This behavior provides a way to achieve lossless failover: any changes in the master node have been applied or are being applied on the replication slave node.

Consistency has a cost: it poses a risk to availability. Assuming that the replication node does not acknowledge the receipt of the change, the primary node will block and write operations will pile up. Fortunately, you can set a timeout so that the master node is switched back to asynchronous replication mode, so that the write operation can continue.

We set the timeout at a reasonably low value (500ms), which is long enough to pass the change to the replication point of the local DC, or even to the replication point of the remote DC. During this timeout, we observe perfect semi-synchronous replication behavior (without falling to asynchronous replication) and satisfactory temporary blocking (when confirmation fails).

We use semi-synchronous replication at the local DC replication point, and we expect (not absolutely required) lossless failover in the event of the death of the primary node. Lossless failover in an entire DC failure is too expensive to be our demand.

During the experimental semi-synchronous replication timeout, we also observe a beneficial phenomenon: when the primary node fails, we can influence the identification of the ideal candidate. By setting up semi-synchronous replication on the specified server and identifying it as a candidate, we can reduce the total downtime (by affecting the failure result). In our experiments, we have observed that candidate screening can be completed more quickly, thus speeding up the new main broadcast.

Heartbeat injection

Instead of letting the pt-heartbeat service start _ stop on the elected _ unelected new host, we choose to keep it running anytime, anywhere. This requires a patch to pt-heartbeat to adapt to the read and write state transition of the service, or even a complete downtime.

In the current configuration, the pt-heartbeat service runs on the primary node and the replication point. On the primary node, it generates a heartbeat, and at the replication point, it identifies the server as read-only and periodically checks its status. Once a server is nominated as the new master, the pt-heartbeat on it identifies it as writable and starts to produce a heartbeat.

Orchestrator ownership delegation

Orchestrator took on more work:

Pseudo-GTID generator.

Sets the new master writable state and clears the replication point status.

Set the old master read-only state (if possible).

This is to reduce the conflicts between the jobs of the new owners. The new owner was obviously chosen to expect it to be alive and accessible, otherwise we wouldn't have to nominate it. When it perceives it, let orchestrator apply the change directly to it.

Limitations and defects

The proxy layer makes the application unable to perceive the identity of the master node, but it also allows the identity of the application to be shielded from the master node. All the primary nodes seem to connect from the proxy layer, losing the true source information of the connection. As the distributed system moves forward, we still have some unaddressed situations.

In particular, in the case of network isolation in the data center, it is assumed that the master node is in the isolated DC, and the application with DC can also be written to the master node, and the network is restored, which may lead to inconsistent state. We are exploring ways to reduce the incidence of cerebral fissure by implementing a reliable STONITH in a well-isolated DC. As before, it will take some time to shut down the primary node, which will still have a short brain fissure period. The operation cost of avoiding cerebral fissure completely is quite high.

More cases include: Consul was delayed during failover, partial DC isolation, and so on. We understand that it is impossible to close all the vulnerabilities in a distributed system, so we focus on the most important situations.

Result

Our orchestrator_GLB_Consul architecture implements:

Reliable failure detection

Unknowable data center failover

Typical lossless failure switching

Data center network isolation support

Reduce brain fissure damage

No collaboration delay

Most of the total downtime is about 10-13 seconds, a few is 20 seconds, and the extreme is 25 seconds

At this point, I believe you have a deeper understanding of "how to solve the MySQL high availability of GitHub". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.