Do "capacity estimate" without true and false 07/02 Update SLTechnology News&Howtos

Do "capacity estimate" without true and false

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

If you see my article for the second time, welcome to scan the code at the end of the article to subscribe to my personal official account (cross-border architect).

It arrives on time at 11:45 every Friday. Of course, I will add a meal from time to time.

My "85" original tribute.

With the vigorous development of the Internet in the past 20 years, the upper limit of access pressure faced by a software system has been gradually raised.

Even so, those products with a mass of 100 million or tens of millions are exclusive to a small number of companies. For the million + programmers in the industry, it is estimated that only 10% of people have access to these "big systems."

Therefore, when it comes to capacity estimates, the first thing that comes to mind is that this is a matter for big companies, and small systems like ours don't have to think about it.

In fact, this is not quite true. In this day and age, marketing activities are flying all over the place, and start-ups are racking their brains to think about "instant success", so even systems above 10 million need to consider the problem of capacity estimation.

For large-scale systems, capacity estimation is a rigid demand, which is related to whether the system can bear it, or whether the resources invested will be wasted excessively. After all, 1% is a lot of money.

For small systems, it is no problem to spend an extra 1.8 million yuan and have more redundant resources.

Even so, Brother Z feels that whether he can make a good "capacity estimate" reflects a person's ability to solve problems that have no standard answer.

This is an ability that many programmers lack.

So, whether you are currently in a large company or a small company, this is a basic skill that you must master as long as you want to improve your architectural skills or have the opportunity to seize job opportunities in a large company in the future.

The accumulated thinking of programmers has made everyone accustomed to having 0 and 1 false true and true in everything. However, the really complex questions are those that have no standard answer, in which there is no right or wrong, only appropriate and inappropriate.

Moreover, people's lives are becoming more and more "online" nowadays. If the load capacity of a system, we have not paid attention to it. So, when the "tuyere" that finally got through is really blowing, can you grasp it? Or miss them helplessly.

I think most people have some ideas about capacity estimates. Calculate the requirements for the carrying capacity of the system through the data, and implement the program deployment that meets the requirements.

For example, there will be a big push next month. What state does the system have to reach in order to support the development of the promotion smoothly?

Everyone will have at least one formula in mind:

Flow / stand-alone performance = X machines

But I think this understanding can go a little deeper. Brother Z's understanding is that the essence of capacity estimation is to obtain a reasonable value between technology investment and business development, pursuing a state that is infinitely close to "just right".

In order to achieve a "just right" state, it must mean that you cannot rely on slapping your head to do things, but to consider as many dimensions as possible and collect more dimensional data as a reference.

Because of the actual situation, it is certainly not as simple a linear relationship as the above formula. It's a logarithmic curve like this.

So how to do specific capacity planning?

Before we do that, we need to figure out a few concepts.

The first is the indicators. We mainly focus on the following indicators.

UV (Unique Vistor): the number of visitors over a period of time. Multiple visits made by the same visitor during that period are counted only once.

PV (page view): the number of page views over a period of time, and the same user continues to accumulate the number of times he or she opens the same page.

Response time / system delay (Latency): the delay in processing a request / task by the system (request processing time + data transfer time)

Throughput (Throughput): the number of requests that can be processed per unit of time. That is, the total number of requests / average response time initiated in this unit of time, the number of requests can be a pv, or a rpc call, and so on.

TPS (Transaction Per Second): it can be understood that the unit time is the "throughput" of "seconds".

Second, we need to be aware of areas that incur performance overhead. This is mainly divided into three parts.

Overhead at the hardware / operating system level. For example, the multi-thread switching of the disk IPUP O and the network Imax Ogamot CPU and so on.

The cost of running the process. Such as code logic, locks, and so on.

The cost of communication between multiple processes. Rpc framework, database access framework, redis/memcached access SDK, MQ access SDK, and so on.

Then you can start to do "capacity planning".

I usually follow the following five steps.

The first step is to find out the state of the business and get the business indicators first.

Technical work is most afraid of doing it through the "departmental wall", covering its head and immersing itself in its own "small world".

Therefore, no matter through what way, we must first have an objective understanding of some business indicators, PV, UV data is necessary. You can talk to the business side, you can also use the Baidu Index, Wechat Index and other more macro data for reference and correction.

The second step is to establish the performance index for the related technical interfaces around this business index.

In fact, it is to get a proportional relationship between business traffic and technical performance indicators.

For example, visiting the A page once involves calling interface a twice, interface b once, and interface c three times.

There is a simple way to do it.

First of all, do a good job of data acquisition in each api interface in the system, in order to obtain two data, response time and times and so on.

Then use some pressure testing tools, such as loadrunner, to do a round of full-link stress testing for the current business scenario. The number of simulated users does not need to be very large, because we are just trying to get a proportion.

In this way, after the end of the stress test, you can compare the number of initiated requests recorded in the loadrunner with the data collected by the api interface, and you can get the relationship between each interface and business traffic. By the way, you can also see the error rate, average response time, tp95, tp99 and so on at low pressure.

The third step is to calibrate the standard with the help of past experience.

The real production environment is complex because an api interface is often called in many places.

The purpose of doing the calibration is to bring our estimates closer to the real production environment.

It would be best if there are case data that have been successfully supported in the past. Use the current UV, PV data, interface and traffic ratio to make the same proportion adjustment compared with the current business estimate of UV, PV, interface and traffic ratio.

The following formula can be obtained:

TPS to be satisfied = TPS when successful * (current estimated business traffic / successful business traffic) * (current business interface ratio / successful business interface ratio).

If there are no successful cases, you can analyze any data in the database with the "time" field to find the highest known concurrency value and the corresponding time point. (groupby "time field" in a simple and rude way.)

Then reverse to find the PV and UV of the corresponding time node.

Then compare it with this business indicator estimate to see the proportion of the difference.

The TPS to be satisfied is the highest TPS in history (regardless of whether it can withstand it or not) * (current estimated business traffic / historical high TPS) * (current business interface ratio / historical highest TPS (regardless of whether it can withstand it or not).

Of course, the worst-case scenario is that we didn't pay enough attention to the data in the past, and there was no data to refer to at all.

Then do the data burying point immediately, analyze the runtime data of the current system, and get the current business traffic and the corresponding TPS for a certain period of time. It shouldn't be difficult.

In this way, a "TPS that should be satisfied" can also be calculated.

TPS to be satisfied = the TPS * (current estimated business traffic / traffic for a certain period of time) * proportion of current business interfaces

Finally, a performance indicator of each interface is obtained.

The fourth step to do is to determine how many servers need to be deployed and how many programs can meet these standards.

As mentioned earlier, getting this result is not a simple division because it is not a linear relationship.

So, we need to do the verification.

You can test one, two, three, respectively. For a different number of servers, get a curve like this. (of course, performance optimization is also carried out during this period.)

In this way, based on this declining trend, you can get a theoretical number of programs and servers needed to support the business.

Of course, a theory is a theory after all. To be on the safe side, we still have to reserve a certain amount of elastic space. This is the fifth step. In order not to count too "buckle", did not give themselves a way back.

How much is appropriate to "play"?

Brother Z's suggestion is to analyze the business volume over the past period of time, observe the year-on-year growth of each peak, and take the maximum year-on-year growth as the proportion of the elastic part.

The elastic part does not need to be enabled 100% in advance, but it should be prepared.

At this point, you have completed the five steps of the entire capacity estimation work.

In fact, the data obtained in the end has some other functions. For example, set the number of threads for the program, configure web containers (nginx, tomcat, iis), and so on.

Because in most cases, the parameters will be set too large, and even many friends will set a pat on the head to the max value.

In fact, such a risk is very large, not only the risk of resource depletion, but also a cascade reaction in the distributed system, affecting the upstream system.

All right, let's sum it up.

This time, Brother Z first talked to you about the significance of capacity estimation.

Then, I share my own idea of capacity estimation, which is realized by the five-step method.

Get the traffic index of the business

Obtain the performance index of the relevant interface through the call ratio

Calibrate according to historical data

The estimated number of nodes based on the attenuation curve

Set aside some elastic space

I hope it will be helpful to you.