Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the ways to improve the efficiency of Rails CI

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article introduces to you what are the methods to improve the efficiency of Rails CI, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Recently, we set a new record at Gusto: 6 minutes and 29 seconds.

This is the time we spent running the test suite for one of Gusto's largest applications, a Rails monolithic program. 6 minutes and 29 seconds is the fastest record since the launch of the company's continuous Integration (CI) pipeline. The last time the CI suite did this, the company was very small, and now we have hundreds of engineers around the world using this single Rails application to support 1% of small businesses in the United States.

For Gusto, the high-speed CI pipeline is not just for show. We see it as a competitive advantage. The faster the code is deployed, the faster the customer's business will be. As the speed of CI increases, so does the productivity of engineers, with every minute of reduction in CI time increasing the number of pull requests per engineer in Gusto by 2% per week.

Our goal is simple. We want to make the speed of the test suite a function of a parameter: how much are we willing to spend? Simplifying the infrastructure to this level makes it easier to do cost-benefit analysis, such as $1 if you want to increase the build speed from 7 minutes to 5 minutes.

This article describes how we speed up the test suite, involving a Rails singleton program and a JavaScript single-page application (SPA) written mainly in React, which applies to all slower test suites.

My colleague Kent says that there are three steps to building software:

Make it run (Make it work)

Get it back on track (Make it right)

Make it run faster (Make it fast)

"Let it run" means making software that won't crash casually. The code may be obscure at this step, but it is enough to provide value to the customer and pass the test so that we can trust it. Without testing, it is difficult to judge "will it work?"

"getting it back on track" means making the code maintainable and easy to change. The code can not only run on the computer, but also be easy to understand. New engineers can easily add functionality to the code, and defects in the code should be easy to isolate and correct.

"make it run faster" means to improve the performance of the software. Why is it the last step? For financial technology companies like Gusto, if they focus on speed at the expense of quality, then our customers and ourselves are not far from bankruptcy. Not every piece of code requires excellent performance, and if a piece of code may only be executed once a day, then even if it has a "high performance" level, it is difficult to read and understand, it is also a piece of code failure.

We apply this set of principles to the speed-up optimization of the CI suite.

1. Let it run.

Eliminate unreliable testing

The first thing you need to do is to eliminate unreliable tests (test flakes) in the test suite. An unreliable test (flaky test) refers to a test with uncertain results, which sometimes passes and sometimes fails. Fast but unreliable test suites don't convince you that the code will work properly, it's just a coin toss.

In order to enable a large engineering team to eliminate unreliable tests, we have adopted and implemented the following policies:

All failed tests on the master branch will be considered unreliable. These tests are marked as skipped (skipped). The team responsible for unreliable testing can fix them and cancel skipping tags in their spare time.

This approach not only keeps the test suite green, but also allows teams to decide when to write more deterministic tests. They can start writing immediately, or they can choose to wait until the feature is processed again. This approach reduces the damage caused by one team's uncertain testing to other teams.

Of course, there are questions about this approach, "what if we skip an important test?" Is the most common problem. Yes, this question is very important, but we need to find out the background of the problem. A test is marked as skipped because it fails randomly, and the first thing to consider is how much confidence we have in the test and functionality. In many cases, tests are unreliable because there are errors in the production environment!

In this way, our rate of building green lights on the main branch has increased from about 75% to 98%!

two。 Get it back on track.

Return to the default state

Over time, we gradually deviated from the default path of running RSpec tests. It is difficult to comply with the default values. Here are some default values for the RSpec test:

Reset the state between test cases. This ensures that the tests are repeatable, deterministic, and not interdependent.

The test execution is random. This ensures that there is no interdependence between tests and helps avoid test contamination.

Test files use the Rails autoloader. This means that we load only what the application needs, not the program as a whole, which can help avoid incomplete test settings.

The process of re-adopting these default values is not easy. Ensuring that each test case resets its state (database, Redis values, cache, etc.) leads to new unreliable tests. Depending on its nature, we can fix the changes or mark the previously normal test as unreliable.

We slowly reintroduced the RSpec default value, which laid the foundation for faster testing.

3. Make it run faster.

Introduce the upper limit of test time

Our tests are unbalanced. Some test files can be executed in milliseconds, while others take tens of minutes. The test that takes a few minutes is an integration test that involves some of the most important processes in our application. We want these tests to be faster, but we don't want to remove them.

Because the test suites are executed in parallel on multiple nodes, the bottleneck of test speed is quickly encountered.

The speed of our test suite depends on the slowest test file, so a new policy has been implemented:

The execution time of any test file cannot exceed 2 minutes.

The threshold was pulled out of thin air, but it seems to be very practical. We only have more than 40 files that take more than 2 minutes.

After setting the boundaries, we began to deal with slow tests to try to get them through the new threshold, and the time of the previous 40 files fell below the threshold. After that, it is the responsibility of each team to ensure that their test files are executed for no more than 2 minutes, and that test files that take more than 2 minutes to execute are marked as skipped.

Balance the test according to the worst-case scenario

Now we have a reliable test suite, but it is very slow, it can execute the tests in any order, but the method of assigning tests to nodes is random. Some nodes take only a few seconds to complete, while others take tens of minutes. How can we balance them?

The last problem we face is test balance. We evaluated two solutions at this step:

Develop a queue to enter test cases for the node when it is ready. Although this scheme is fine in principle, RSpec needs to make a significant update to the framework to be compatible with this scheme. In addition, it introduces a shared state between all different parallel jobs.

Record the test time in a database at the beginning of a CI process, dividing the test into different buckets so that all groups have the same length.

We have used the recording and bucket method to assign the test to each node because it is very suitable for knapsack (https://docs.knapsackpro.com/ruby/knapsack). This approach also does not share state among many different parallel jobs during the test run. This is important because a shared queue can have hundreds of nodes, and each node can request thousands of jobs per second for a build.

We set up a MySQL instance to record the test time of all files. At the beginning of each CI process, it generates an knapsack file based on the 99th percentile time of each test file. At the end of each CI process, it will upload the new results.

Why the 99th percentile? Because we run CI on shared hardware (AWS), we have no control over the infrastructure, and the test times for each test file will vary greatly. We cannot associate these fluctuations with the type of EC2 instance used, or any other measurable parameter.

Instead of further improving the building infrastructure, we made the system resilient. We use the 99th percentile to organize tests to ensure that there is a lower limit for test performance, rather than significantly lower cases when achieving better average performance. Even if the underlying hardware changes or the infrastructure layer fails, the CI pipeline can still guarantee a predictable level of performance.

After the implementation of this strategy, we have a self-balancing system. The more tests you have, the more balanced the system will be. If some tests slow down over time, the test bucket adjusts the balance accordingly.

Increase parallelism

Now it's interesting: make the test really faster.

The main approach here is to increase parallelism. Since the project began, we have increased the number of parallel jobs from 40 to 130. This slightly increases the cost, but greatly increases the running speed of CI. At Gusto, we use Buildkite as the CI infrastructure, but this parallelization concept applies to all major CI products.

Although we have increased parallelism to more than three times, the cost of CI has not increased linearly with it. Why? Because we made better use of the existing CPU time, the total CPU time did not change by balancing the jobs among the nodes, but the actual elapsed time was significantly reduced.

Over the past few months, we have been making the CI pipeline for Gusto's major applications more solid and faster.

What are the ways to improve the efficiency of Rails CI to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report