The things about performance assurance. 04/20 Update SLTechnology News&Howtos

The things about performance assurance.

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

When we first get to know the great promotion of security, there are often such questions: what is the guarantee in the end, and is there no problem or no problem? This seems to be a false proposition. As a guarantee, we should not only firmly believe that what we have done is meaningful, but also have something to do, which requires transforming impossible pseudo-propositions into feasible tasks that can be deepened continuously. When it comes to the root of protection, in fact, we have to face the fight against uncertainty, which comes from all directions. For example, a major earthquake will lead to the interruption of the whole computer room, how to deal with it? For example, how to deal with the departure of the engineer in charge of the core system? For example, if the downstream interface is down, how to deal with it? The system disk is broken and the data is at risk of loss. How to deal with it? I think we all know more or less about the ways to deal with the above problems, and the process of dealing with this uncertainty is disaster recovery, and different 'disasters' correspond to different levels of disaster recovery.

In order to combat these different levels of uncertainty, there are different levels of cost, so there should be standards for usability. This standard is what people often call N 9s. With the increase of N, the cost increases accordingly, so how to save the cost as much as possible on the basis of achieving the availability required by the business? This is also a topic worth thinking about. In addition, 100% minus these N 9s means the so-called mean time of failure (MTBF). Many people only care about those 9s and ignore fault handling time, which is wrong: the faster your fault processing speed, the higher the availability of the system.

There are some usability concepts mentioned above, and let's try to use 'things' to classify them. The "thing" here is the fault, which is divided into: beforehand (before the fault occurs), incident (the fault occurs to the system or people perceiving the fault), during the event (the time between the occurrence of the fault and the fault handling), and afterwards (after the fault is over).

According to the above classification, different stages should have different skills:

1. Beforehand: copies, quarantines, quotas, plans, probing

two。 Incident: monitoring and alarm

3. In the event: demotion, rollback, contingency plan

4. After the event: review, thinking, technical reform

Some of the technical concepts are explained as follows:

Replica: a stateless service cluster is an application of a replica, which can scale horizontally because there is no state, and these stateless servers need a layer of agents for unified scheduling and management, so there is a reverse proxy. When the agent detects a problem with a machine through the heartbeat detection mechanism, it is taken offline, and other "replica" machines continue to provide services; replica technology is also often used in the storage field, such as mysql master / slave switching, rabbitMQ mirror queue, disk RAID technology, partition copies in various nosql, and so on. Almost all systems that ensure high availability have redundant copies.

Isolation: thread isolation, process isolation, cluster isolation, computer room isolation, read and write isolation, movement isolation, crawler isolation, hot spot isolation, hardware resource isolation. In fact, these isolation is a kind of resource isolation, regardless of threads, processes, hardware, computer rooms, clusters, are all a kind of resources; dynamic resources and static resources are only a classification of resources; hot spot isolation is the isolation of hot and non-hot resources; read-write isolation is just how resources are used. The same two resources, one for writing and one for reading. Therefore, the essence of isolation is actually the independent protection of resources. Because each resource is protected independently, one of the resources has a problem and will not affect the other resources, which improves the availability of the overall service.

Quota: quota technology protects the system by limiting the supply of resources, thereby improving overall availability. Current limit is a kind of quota technology, which can avoid service downtime caused by insufficient supply by adjusting the water level of inlet flow. Current limit is divided into cluster current limit and stand-alone current limit. Cluster current limit requires the cooperation of distributed infrastructure, while stand-alone current limit does not. In addition, there are a few other points we need to consider here:

How to set a reasonable current limit value? The setting of current limit value can only be determined after full-link pressure test, proper evaluation of CPU capacity, disk, memory, IO and other indicators and traffic (not necessarily linear relationship), combined with business estimation and operation and maintenance experience.

How to deal with the restricted traffic? There are several processing methods, one is directly discarded, reminding users with mild copywriting; second, silence, commonly known as lossless degradation, refreshes the page with cached content; and third, flood storage, asynchronous blood return, which is generally used in transactional scenarios.

Will it lead to manslaughter? Single machine current limit will lead to manslaughter, especially in the case of unbalanced load, it is easy to cause manslaughter; if the single machine current limit value is too small, it is easy to cause manslaughter.

Pre-plan: generally divided into advance plan (in advance) and emergency plan (in the event). The pre-plan is implemented in advance, such as temporarily switching the system from peak mode to energy-saving mode; the emergency plan is only implemented at the critical moment, mainly for hemostasis, such as one-button disaster recovery switching. The pre-plan technology is generally used in conjunction with the switch, and the push-plan is generally pushed on and off. In addition, the plan can also be combined with current limitation, rollback and demotion, and the formulation of the plan can also find ideas through the analysis of historical faults.

Probe: to find out the availability ability of the current system, in fact, can not improve the system availability, can not do well or even reduce the system availability. Pressure testing and drills are the most common detection techniques. Pressure testing is divided into full-link pressure testing and single-link pressure testing, full-link pressure testing is used for promoting activities such as double Eleven, etc., which requires the overall cooperation of the upstream and downstream systems. Single-link pressure testing generally verifies the function or does simple scene pressure testing to extract performance indicators. The general process of full-link pressure testing is: pressure test target setting and evaluation, pressure test construction, pressure test script preparation and deployment, pressure test data preparation, small flow link verification, notification of upstream and downstream system owner, preheating, pressure test, pressure test result evaluation report, performance optimization. The above process is iterated repeatedly until the stress test goal is reached. Drills are generally divided by scale, such as city-level disaster recovery drills, computer room-level disaster recovery drills, cluster-scale disaster recovery drills (DB clusters, cache clusters, application clusters, etc.), stand-alone fault injection, scenario drills, and so on.

Monitoring and alarm: in general, when there is a failure, the boss will mostly ask: why did you just find out? Why did you solve it? How big is the impact? Even if the fault has a large impact, if you can stop the bleeding quickly, you can save some face when reviewing, on the contrary, if you don't deal with it in time, even a small failure may cost you your job. The sooner you identify the fault, the sooner you can solve the problem, and the eye is to monitor and alarm.

Demotion: the connotation of demotion is rich, we only think from the link point of view. The essence of demotion is to abandon the car to protect the coach, through the temporary abandonment of some functions to ensure the overall availability of the system. Downgrade although on the whole the system is still available, but due to the trade-off, it can be known that all downgrades must be detrimental. There can be no real lossless degradation, and what is often referred to as lossless degradation refers to lossless user experience. Degradation must occur between layers (upstream and downstream), either layer a temporarily does not call layer b, which is called fusing, or layer a temporarily calls layer c (layer c must be reasonable

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.