How to build K8S in production Environment 04/19 Update SLTechnology News&Howtos

How to build K8S in production Environment

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces how to build K8S in the production environment, the content is very detailed, interested friends can use for reference, I hope it can be helpful to you.

Managing services on distributed systems is one of the most difficult problems faced by operation and maintenance teams. It is important to break through new software in production and learn how to operate reliably. This article is an example of why it is important to learn to operate Kubernetes and why it is difficult. The following is a post-mortem analysis of the one-hour interruption caused by Kubernetes bug.

Why choose to build on top of Kubernetes? How can Kubernetes be integrated into an existing infrastructure? The approach given by the editor is to build (and improve) trust in the reliability of Kubernetes clusters, as well as abstractions built on top of Kubernetes.

We recently built a distributed cron job scheduling system on top of Kubernetes, which is an exciting new platform for container choreography. Kubernetes is very popular now, and there are many exciting promises: the most exciting thing is that programmers don't need to know or care what machines their applications are running.

What is Kubernetes?

Kubernetes is a distributed system for scheduling programs to run in a cluster. You can tell Kubernetes to run five copies of a program and it will schedule them dynamically on the worker node. Containers are automatically scheduled to increase utilization and save money, powerful deployment primitives allows new code to be rolled out step by step, and security contexts and network policies allow enterprises to run multi-tenant workloads in a secure manner.

Kubernetes has many different types of scheduling capabilities. It can schedule long-running HTTP services, daemonsets running on each machine in the cluster, hourly cron jobs, and so on.

Why Kubernetes?

Each infrastructure project starts with business requirements, and our goal is to improve the reliability and security of existing distributed cron operating systems. Our requirements are:

Set up and operate a small team (only 2 people work full-time on the project).

Arrange about 500 different cron jobs reliably on 20 machines.

There are several reasons why we decided to build on Kubernetes:

Want to build an existing open source project.

Kubernetes includes a distributed cron job scheduler that you don't have to write yourself.

Kubernetes is a very active project and often accepts donations.

Kubernetes is written in Go and is easy to learn. Almost all Kubernetes bug is done by inexperienced programmers on the team.

If we can successfully operate Kubernetes, we can build on the future Kubernetes, for example, we are currently developing a kubernet-based system to train machine learning models.

We used to use Chronos as the cron job scheduling system, but it no longer meets the reliability requirements, and most of them are not maintained (submitted once in the past 9 months, the last merge request was in March 2016). Chronos is not maintained, and we do not think it is worthwhile to continue to invest in improving the existing cluster.

If you're thinking about Kubernetes, remember: don't use Kubernetes just because other companies are using it. It takes a lot of time to build a reliable cluster, and the business case that uses it is not very prominent. Use your time in a smart way.

What does reliability mean?

When it comes to operational services, the word "reliable" has no meaning in itself. To discuss reliability, you first need to establish a SLO (service level goal).

We have three main objectives:

99.99% of cron jobs should start running within 20 minutes of the scheduled run time. 20 minutes is a wide window, but we interviewed internal customers and no one asked for more accuracy.

Jobs should run 99.99% of the time (not terminated).

The migration to Kubernetes does not result in any customer-facing events.

This means:

The short downtime for Kubernetes API is acceptable (if it is down for 10 minutes, it can be restored within 5 minutes).

Scheduling errors (cron job runs are completely lost and cannot be run at all) are unacceptable. We attach great importance to arranging error reports.

Be careful with pod evictions and safely terminating instances so that jobs are not terminated too frequently.

A good migration plan is needed.

Set up a Kubernetes cluster

Our basic approach to building the first Kubernetes cluster is to build the cluster from scratch, rather than using tools such as kubeadm or kops. The configuration is provisioned using Puppet, a common configuration management tool. It's good to build from scratch for two reasons: the ability to deeply integrate Kubernetes into the architecture and to understand its internals.

We want to integrate Kubernetes into the existing infrastructure. Seamless integration with existing systems for logging, certificate management, encryption, network security, monitoring, AWS instance management, deployment, database agents, internal DNS servers, configuration management, and more. Integrating all of these systems sometimes requires a little creativity, but it's generally easier than trying to make kubeadm / kops what we want.

After trusting and knowing how to operate these existing systems, we want to continue to use them in the new Kubernetes cluster. For example, security certificate management is a very thorny issue, and there are already ways to issue and manage certificates. With proper integration, we avoided creating a new CA for Kubernetes.

Understand exactly how the parameters you set affect the Kubernetes setting. For example, more than 12 parameters were used when configuring the certificate / CAs for authentication. Knowing these parameters makes it easier to debug settings when you encounter authentication problems.

Build confidence in Kubernetes

At the beginning of Kubernetes, no one on the team used Kubernetes. How to go from "No one has used Kubernetes" to "We have the confidence to run Kubernetes in production"?

Strategy 0: talk to other companies

We asked other companies about Kubernetes's experience. They all use Kubernetes in different ways or in different environments (running HTTP services, bare metal, Google Kubernetes engines, etc.).

When talking about a large and complex system like Kubernetes, it's important to think carefully about your use cases, do your own experiments, build confidence in your environment, and make decisions. For example, you shouldn't read this blog post and conclude: "Stripe is successfully using Kubernetes, so it applies to us!"

Here's what we learned after communicating with several companies that operate Kubernetes clusters:

Priority is given to the reliability of enterprise etcd clusters (etcd is the place where all Kubernetes cluster states are stored).

Some Kubernetes features are more stable than others, so be careful with Alpha features. Some companies can only use stability features after they are stable (for example, if a feature remains stable in version 1.8, they will wait 1.9 or 1.10 before using it).

Consider using a managed Kubernetes system, such as GKE / AKS / EKS. Building a high availability Kubernetes system from scratch is a huge task. AWS does not have a managed Kubernetes service in this project, so this is not for us.

Note the additional network delays introduced by overlay / software-defined networks.

Strategy 1: read the code.

Our plan is to rely heavily on a component of Kubernetes, the cronjob controller. This component was in the alpha phase at the time, which worried us a bit. We tried it in a test cluster, but how can we tell if it is suitable for us in production?

Fortunately, the core function of all cronjob controllers is only 400 lines of Go. Quickly read and display through the source code:

The cron job controller is a stateless service (like other Kubernetes components, except for etcd).

Every 10 seconds, the controller calls the syncAll function: go wait.Until (jm.syncAll,10 * time.Second,stopCh)

The syncAll function takes all the cron jobs from Kubernetes API, iterates through the list, determines which jobs should be run next, and then starts them.

The core logic seems relatively easy to understand. More importantly, if there is a bug in this controller, it may be something we can fix.

Strategy 2: do load testing

Before we start to seriously build the cluster, we do some load tests. We are not worried about how many nodes the Kubernetes cluster can handle (about 20 nodes are planned for deployment), but we do want some Kubernetes to be able to handle as many cron jobs as we want to run (about 50 per minute).

The tests were run in a 3-node cluster and 1000 cron jobs were created, each running every minute. Each of these efforts simply runs bash-c 'echo hello world'. We chose simple jobs because we wanted to test the scheduling and choreography capabilities of the cluster, rather than the total computing power of the cluster.

The test cluster cannot process 1000 cron jobs per minute. Each node can start at most one pod per second, while the cluster can run 200 cron jobs per minute. Since we only want to run about 50 cron jobs per minute, we don't think these limitations are a hindrance.

Strategy 3: give priority to building and testing high availability etcd clusters.

When setting up Kubernetes, one of the most important things is to run etcd. Etcd is the core of the Kubernetes cluster, and it is the place where all the data in the cluster is stored. Everything except etcd is stateless. If etcd is not running, no changes can be made to the Kubernetes cluster (although existing services will continue to run!)

This figure shows that etcd is the core of the Kubernetes cluster-- the API server is the stateless REST/ authentication endpoint in front of the etcd, and then other components talk to etcd through the API server. At run time, there are two key points to keep in mind:

Set up replication so that the cluster does not die if you lose a node. We now have three copies of etcd.

Ensure adequate I / O bandwidth. A problem with our version of etcd is that a node with a high fsync latency may trigger a continuous leader elections, making the cluster unusable. This is compensated for by ensuring that all nodes have more Ibank O bandwidth than the number of writes to etcd.

Setting replication is not a setting-forget operation. After careful testing, we found that an etcd node might be lost, and the cluster recovered gracefully.

Here are some of the work done to set up an etcd cluster:

Set up replication

Monitoring the etcd service is available

Write some simple tools to easily create new etcd nodes and join the cluster

Write some simple tools so that we can easily create new etcd nodes and add them to the cluster

High integration of patch etcd so that we can run more than one etcd cluster in a production environment

Test restore from an etcd backup

The test can rebuild the entire cluster without downtime.

I'm glad I took this test a long time ago. One Friday morning, in our production cluster, an etcd node stopped responding to ping. We were alerted, terminated the node, brought a new node and joined the cluster while Kubernetes continued to run.

Strategy 4: gradually migrate work to Kubernetes

One of our goals is to migrate work to Kubernetes without any disruption. The secret to a successful production migration is not to avoid making mistakes (which is impossible), but to design your migration to reduce the impact of errors.

We are lucky to have a variety of jobs that can be moved to the new cluster, so we can move some low-impact jobs and accept one or two failures.

Before starting the migration, you built an easy-to-use tool to move jobs back and forth between the old system and the new system in less than five minutes, if necessary. This simple tool reduces the impact of errors-if you migrate an unplanned dependent job, it's no big deal! You can move it back, solve the problem, and then try again.

The following is our overall migration strategy:

Roughly ranked according to their importance.

Migrate some repetitive work to Kubernetes. If you find a new situation, roll back quickly, fix the problem, and then try again.

Strategy 5: investigate Kubernetes bug and fix them

We made a rule at the beginning of the project: if Kubernetes does something unexpected, it must investigate, find out the cause, and propose remedial action.

Investigating every question is time-consuming, but it is important. If we simply think of Kubernetes's "weird behavior" as a function of a complex distributed system, we worry because they will be called to produce bug clusters.

After using this method, we found and fixed several bug of Kubernetes.

Here are some of the problems found in the test:

Cronjob with a name longer than 52 characters cannot schedule a job.

Pods sometimes stays in a suspended state forever.

The scheduler crashes every 3 hours.

The hostgw backend of Flannel does not replace outdated routing table entries

Fixing these bug makes us feel much better about using the Kubernetes project-not only does it run better, but it also accepts patches and has a good PR review process.

Kubernetes has bug, just like all software. In particular, we use the scheduler very frequently (cron jobs are always creating new pods), and the scheduler's use of caching can sometimes lead to bug, fallback, and crashes. Caching is difficult! But the code base is accessible, and we have been able to handle the bug encountered.

It is worth mentioning that Kubernetes's pod expulsion logic. Kubernetes has a component called a node controller, which is responsible for expelling the pod and moving them to another node if the nodes are not responding. The allnodes is temporarily unresponsive (for example, due to network or configuration problems), in which case Kubernetes can terminate all pod in the cluster.

If you are running a large Kubernetes cluster, read the node controller documentation carefully, carefully consider the settings, and conduct extensive testing. Every time you test a configuration change to these settings by creating a network partition (for example, pod- expulsion timeout), something surprising happens. It is best to find these accidents in testing rather than in production.

Strategy 6: intentionally cause Kubernetes cluster problems

We discussed doing game day exercises in Stripe earlier. The idea is to figure out what will happen to you eventually in production, and then deliberately create them in production to make sure they can be dealt with.

After several exercises on the cluster, problems such as monitoring or configuration errors are often found. I'm glad to find these problems early, not suddenly six months later.

Here are some race day exercises to run:

Terminate the Kubernetes API server

Terminate all Kubernetes API servers and restore them (this is very effective)

Terminating the etcd node

Shut down the worker nodes in the Kubernetes cluster from the API server (so that they cannot communicate). This causes all pods on the node to be migrated to other nodes.

It's nice to see how Kubernetes responds to the amount of interference we put in. Kubernetes is designed to accommodate errors-it has an etcd cluster that stores all states, an API server that is just the REST interface to the database, and a set of stateless controllers that coordinate all cluster management.

If any Kubernetes core components (API server, controller manager, or scheduler) are interrupted or restarted, once they appear, they will read the relevant state from the etcd and continue to run seamlessly. This is one of the things we want, and it works well in practice.

Here are some of the problems found in the test:

"there is no paged to fix the monitoring."

"when the API server instance is destroyed and restored, human intervention is required. It is best to resolve this problem."

"sometimes when performing an etcd failover, the API server initiates a timeout request until it is restarted."

After running these tests, remedial actions were developed for the problems found: improved monitoring, fixed configuration problems were found, and Kubernetes bug was submitted.

Make cron jobs easy to use

Let's briefly explore how we make kubernetes-based systems easy to use.

The initial goal is to design a system that runs cron jobs, and the team has the confidence to operate and maintain it. Once confidence in Kubernetes is established, engineers need to easily configure and add new cron jobs. We have developed a simple YAML configuration format so that users do not need to understand the internal structure of Kubernetes to use the system. This is the format we developed:

Name: job-name-here

Kubernetes:

Schedule:'15 * / 2 *'

Command:

-ruby

-"/ path/to/script.rb"

Resources:

Requests:

Cpu: 0.1

Memory: 128M

Limits:

Memory: 1024M

Nothing special-we wrote a simple program to convert this format to a Kubernetes cron job configuration and apply it to kubectl.

We also wrote a test suite to ensure that the job names are not too long and that all names are unique. We don't currently use cgroups to enforce memory limits for most jobs, but we plan to introduce it in the future.

Our simple format is easy to use, and because Chronos and Kubernetes cron job definitions from the same format are automatically generated, migrating jobs between the two systems is very simple. This is a key part of making our incremental migration work well. When you migrate a job to Kubernetes, you can move it back in less than ten minutes with a simple three-line configuration change.

Monitoring Kubernetes

It is very pleasant to monitor the internal state of the Kubernetes cluster. We use the kube-state-metrics software package for monitoring, and use a small Go program called veneurl-Prometheus to obtain Prometheus metrics and publish them as statsd indicators to our monitoring system.

For example, the following is a chart of the number of outstanding Pod in the cluster over the past hour. Pending means waiting for a worker node to be assigned to run. You can see the peak number at 11:00, with many cron jobs running in the 0th minute of the hour.

There is also a monitor to check to see if any Pod is stuck in the Pending state-each Pod starts running on the worker node within 5 minutes, or you will receive an alert.

Future plans for Kubernetes

It took five months to set up Kubernetes to run the production code smoothly and migrate all cron jobs to the new cluster, with three engineers working full-time. One of the important reasons we invested in learning Kubernetes is that we want to use Kubernetes more widely in Stripe.

The following principles apply to running Kubernetes (or any other complex distributed system):

Define clear business reasons for the enterprise's Kubernetes project, as well as for all infrastructure projects. Understanding business cases and user needs makes the project easier.

Actively reduce the scope. Avoid using many of the basic features of Kubernetes to simplify clustering. This allows us to send messages more quickly, for example, because pod-to-pod networking is not a necessary condition for our project, we can close all network connections between nodes and postpone network security for Kubernetes.

Spend a lot of time learning how to run Kubernetes clusters correctly. Test the boundary carefully. Distributed systems are very complex and have many potential problems. Take the previous example: if the node controller loses contact with the API server due to configuration, it can kill all pods in the cluster. It takes time and careful attention to learn how Kubernetes performs after each configuration change.

By focusing on these principles, we have the confidence to use Kubernetes in production. We will continue to develop the use of Kubernetes, for example, we are following the release of AWS EKS. We are working on another system, training machine learning models, and are exploring the migration of some HTTP services to Kubernetes. As we continue to run Kubernetes in production, we plan to contribute to open source projects.

On how to build K8S in the production environment to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.