What are the best practices for GitHub to migrate to K8S 07/04 Update SLTechnology News&Howtos

What are the best practices for GitHub to migrate to K8S

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

What is the best practice of GitHub migrating to K8S? I believe many inexperienced people are at a loss about it. Therefore, this article summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Why try to change-- release SRE engineer

Before that, the major Ruby On Rails applications (github/github) were similar to those of eight years ago: "Unicorn processes" was called by the Ruby process manager to "God" running in Puppet-managed. Similarly, deploying ChatOps is like what it did when it was first introduced: Capistrano establishes an SSH connection on each front-end server, then updates the code to restart the application, and when the peak request load exceeds the available front-end CPU capacity, the GitHub site's SRE frees up additional capacity and adds it to the active front-end server pool.

Although the basic production methods have not changed much in recent years, GitHub itself has changed a lot, such as new features, a larger community, more employees, and more needs. So this creates a lot of new problems, and many teams want to extract the functions they are responsible for from a large application to form a small service that can be run and deployed independently. With the increase in the number of running services, the SRE team began to provide similar configurations for dozens of other applications, increasing the time spent on server maintenance, configuration, and other work, which was not directly related to improving the overall workflow of GitHub.

New services take days, weeks, or months to deploy because of their complexity and the timing of the SRE team, and over time, some problems emerge: this approach does not give engineers the flexibility they need to continue to build world-class services.

Engineers need a self-help platform on which to experiment, deploy and expand new services, as well as the same platform to meet the needs of Ruby On Rails applications, so that engineers or robots can allocate additional computer resources in seconds, days, or longer to respond to changes in demand.

To meet these needs, the SRE, platform, and developer experience team began a joint project to deploy github.com and api.github.com code to the Kubernetes cluster dozens of times a day.

Why Kubernetes?

To evaluate "platform as a service" tools, GitHub carefully studied Kubernetes, a Google project for automating the deployment, expansion, and management of containerized applications' open source systems, and evaluated Kubernetes features by: the project received hot open source community support, the first running practice (allowing small clusters and applications to be deployed in the first few hours), and a great deal of design experience.

So the experiment was rapidly expanded: a small project was set up to build Kubernetes clusters and deployment tools to support the upcoming Hack Week to gain some real-world experience, and the GitHub internal response to this project was very positive.

Why start with github/github?

At the beginning of the project, GitHub made a well-thought-out decision: critical workload: github/github migration, which was facilitated by a number of factors, the more important of which are:

Self-service expansion tools are needed to deal with continued growth

Want to ensure that development habits and patterns are suitable for large applications and smaller services

Better isolation of applications from development, Staging, production, Enterprise, and other environments

Migrating a critical, high-profile workload can inspire confidence that more Kubernetes will be adopted on GitHub.

Rapid iteration and confidence building with a review lab

As part of the migration, some designs and prototypes are carried out, and it is verified that the front-end server uses Kubernetes basic services such as Pod, deployment and services. You can do some validation of this new design by running gitub/github 's existing test suite in the container, but you still need to see how the container becomes part of the larger Kubernetes resource, and it soon becomes clear that an exploratory test environment for Kubernetes and the services you intend to run is necessary during the verification phase.

At the same time, project members have observed that the existing mode of github/github crawl requests has begun to show signs of growing very difficult, with deployment speed proportional to the number of engineers, as well as using several additional deployment environments as part of the process of verifying pull requests to github/github. At peak hours, the number of fully functional deployment environments is often fixed, which reduces the process of deploying pull requests. Engineers often require that more subsystems be tested in "Branch Lab" while allowing multiple engineers to deploy at the same time, but each engineer can only launch one "Unicorn Process", so it is only useful when testing API and UI changes, because these requirements overlap so much that they can combine these projects and start developing a new Kubernet/github-based deployment environment on github/github, called Review Lab.

In the process of building Review Lab, several subprojects have also been released:

Kubernetes cluster management running on AWS VPC uses Terraform & Kops

A set of Bash integration tests used a brief Kubernetes cluster, which was later used extensively at the beginning of the project to boost confidence in Kuberbetes.

A github Dockerfile/github

Enhancements to the internal CI platform to support building and publishing containers to the container registry

YAML represents 50+Kubernetes resource, check in github/github

Enhancements to on-premises applications that support the deployment of Kubernetes resources from a repository to Kubernetes namespaces and the creation of Kubernetes from internal repositories

This service combines Haproxy and Consul-Template to route Unicorn Pods to existing services and publish service information.

A service that reads Kubernetes events and sends abnormal events to the internal service tracking system

A Rpc-compatible service called kube-me exposes a limited set of kubectl commands to users through chat.

The end result is a chat-based interface for creating a stand-alone deployment of GitHub for any pull request, and once the request has passed all the required CI tasks, users can deploy their request:

Like the previous "Branch Lab", the lab is cleaned up after the last deployment, and since each lab is created in its Kubernetes namespace, the cleanup is as simple as deleting the namespace, and the deployment system is performed automatically when needed.

Review Lab is a successful project that has accumulated a lot of experience and achievements. Before providing such an environment for engineers, it also provides the necessary verification foundation and prototype environment for Kubernetes cluster design, as well as the design and configuration of Kubernetes resources, which are now used to describe github/github workloads. After release, it helps engineers build confidence. GitHub is very satisfied that this environment empowers engineers to experiment and solve problems in a self-help way.

Kubernetes on Metal Cloud

With the release of Review Lab, attention turned to github.com. In order to meet the reliability requirements for the performance of key services (relying on low-latency access to other data services), the infrastructure of Kubernetes needs to be built to support cloud computing running in physical data centers and POP. Similarly, there are 12 sub-projects:

With regard to container networks, because of a timely and detailed post, GitHub chose Calico, which provides the ability to quickly send a cluster in IPIP mode, as well as the flexibility to explore in future network infrastructure.

After reading "Kubernetes the hard way" written by Kelesyhightower more than a dozen times, GitHub assembled some manual servers into a temporary Kubernetes cluster, which passed the test.

Gadgets are also built to generate the required CA and configuration for each cluster in a format that can be used by Puppet and Secret Systems.

Two instance configurations are processed: the Kubernetes node and Kubernetes Apiservers, which allows the user to provide a configured cluster name for introduction within a specified period of time.

A small Go service is built to consume container logs, attach metadata in Key/Value format to each line, and send them to the host's local Syslog endpoint.

Strengthen the internal load balancing service to support Kubernetes Node Port.

These efforts have not been in vain and have passed the cluster of internal acceptance tests, so GitHub is confident that the same set of Inputs (Kubernetes resources used by Review Lab), the same dataset (the network service Review Lab connects to VPN), the same tools will produce similar results in less than a week, although most of the time is spent on internal communications and sorting. But it has a very significant impact on the migration: the entire workload can be migrated from a Kubernetes cluster running on AWS to a cluster running in GitHub data.

Raising the confidence bar

Kubernetes clusters are successful and replicable on Github Metal Cloud, so it's time to deploy "Unicorn" to replace the current front-end server pool. In GitHub, it is common for engineers and their teams to validate new functionality by creating a Flipper feature, select it if feasible, and then strengthen the deployment system to deploy a new set of Kubernetes resources, github-produciton namespaces and existing production servers. Improved GLB enables employee requests to be routed to different backends: Flipper-infuenced-based cookie that allows employees to select the experimental Kubernetes backend on a button in the task control bar.

Loads from internal users can help identify problems, fix BUG, and begin to adapt to Kubernetes production, during which time, confidence is boosted by simulating future applications to be executed, writing operating manuals and executing Failure Tests, and a small amount of production traffic is routed to the cluster to confirm performance and reliability settings under load, starting at 100 requests per second. It was then extended to 10% of github.com and api.github.com requests, and was briefly stopped in several simulations to reassess the risk of a full migration.

Cluster Groups

Because some Failure Tests led to unexpected results, especially tests that simulated a single Apiserver node failure destroyed the cluster, which had a negative impact on the availability of running workloads, according to the survey, these tests did not achieve decisive results, but helped to identify that the associated damage could be an interaction between various customers connecting to Kubernetes Apiserver (such as Calico-Agent Kubelet Kube-Proxy Kube-Controller-Manager) and the behavior of the internal load balancer failed in an Apiserver node, because it was detected that the degradation of the Kuberntes cluster might damage the service, so it began to focus on the key applications running on each site and automatically migrated requests from one unhealthy cluster to another healthy cluster.

Similar work has been placed in GitHub's flowchart to support the deployment of the application to multiple independently operated websites and other positive trade-offs. The final selected design is to use the deployment system to support multiple "partitions" to enhance it through a custom support to provide tedious configuration Kubernetes resource annotations within the cluster, abandon existing federated solutions, and allow the use of business logic that has already appeared in the GitHub deployment system.

From 10% to 100%

With clustering, GitHub gradually converted the front-end server to Kubernetes nodes and increased traffic routed to Kubernetes, and together with other engineering teams completed the front-end conversion in more than a month, while maintaining expected performance and acceptable error rates during this period.

During the migration process, we encountered a problem that continues to this day: in containers with high load / or high utilization, some Kubernetes nodes will experience kernel errors and restart. Although GitHub is not satisfied with this situation and conducts the highest priority troubleshooting, it is pleased to see that Kubernetes can automatically bypass these failures and continue to serve traffic within the error range.

GitHub does some Failure Tests to simulate kernel errors similar to echo c/proc/sysrq triggers, which is a useful supplement.

After reading the above, have you mastered the best practices for migrating GitHub to K8S? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.