How to use Hystrix to improve system availability 04/05 Update SLTechnology News&Howtos

How to use Hystrix to improve system availability

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article is about how to use Hystrix to improve system usability. I think it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it.

For today's slightly more complex Internet applications, the servers are basically distributed, a large number of services support the entire system, and there are inevitably a large number of dependencies between services, all of which are connected through the network.

However, the availability of any service is not 100%, and the network is fragile. Will I be dragged to death when a service I rely on is not available? Will I be dragged to death when the network is unstable? These problems that do not need to be considered in the stand-alone environment have to be considered in the distributed environment. Suppose I have five dependent services, and their availability is 99.95%, that is, the unavailable time in a year is more than 4 hours. Does that mean that my availability is 99.95% to the fifth power, 99.75% (nearly one day). In addition, network instability and dependence on services may be more, and the availability will be lower. Considering that the dependent services are bound to be unavailable at some time, and considering that the network is bound to be unstable, how should we design our own services? That is, how to design for errors?

Michael T. Nygard summarizes many patterns to improve system usability in his wonderful book "Release It!", two of which I think are very important:

Use timeout

Use circuit breaker

First, when invoking externally dependent services over the network, you must set a timeout. In a healthy situation, a remote call to a local area can be returned in dozens of milliseconds, but when the network is congested, or when the dependent service is unavailable, the time may be many seconds, or it may be frozen at all. Usually, a remote call corresponds to a thread or process. If the response is too slow or dead, that process / thread will be dragged to death and will not be released in a short period of time, and the process / thread corresponds to system resources. This means that my own service resources will be exhausted, resulting in my own service unavailable. Suppose my service depends on many services, and if one of the non-core dependencies is not available and there is no timeout mechanism, then this non-core dependency can drag my service to death, although in theory I can function healthily in most cases even without it.

Circuit breakers are no stranger to all of us (can you change the fuse? If you don't have a circuit breaker, when the current is overloaded or short-circuited, the circuit continues to open, the wires will heat up, cause a fire and burn down the house. With the circuit breaker, when the current is overloaded, the fuse will burn out first and disconnect the circuit so as not to cause a greater disaster (only at this time you have to change the fuse).

When our service accesses a dependency with a large number of timeouts, it doesn't make much sense to allow new requests to access it, which will only needlessly consume existing resources. Even if you have set a timeout of 1 second, allowing more requests, such as 100, to access the dependency knowing that the dependency is not available will result in a 1-second waste of resources for 100 threads. At this time, circuit breakers can help us avoid this waste of resources. Put a circuit breaker between our own services and dependencies to count the status of access in real time. When the access time-out or failure reaches a certain threshold (such as 50% request timeout, or failed 20 times in a row), turn on the circuit breaker, and subsequent requests will directly return failure without wasting resources. The circuit breaker tries to turn off the circuit breaker (or replace the fuse) according to a time interval (such as 5 minutes) to see if the dependency is back in service.

Timeout mechanisms and circuit breakers can protect our services from the unavailability of dependent services. For more information, please refer to the article "using fuse design patterns to protect software". However, the specific implementation of these two modes still has a certain complexity. Fortunately, Netflix's open source Hystrix framework greatly simplifies the implementation of timeout mechanism and circuit breaker. Hystrix: for distributed systems, provides delay and fault tolerance, isolates access points of remote systems, access and third-party libraries, prevents cascading failures, and ensures that complex distributed systems still have their flexibility in the face of inevitable failures. There is a. Net migration version of https://hystrixnet.codeplex.com/ on Codeplex.

To use Hystrix, you need to encapsulate calls to remote dependencies through Command:

Public class GetCurrentTimeCommand: HystrixCommand

{

Private static long currentTimeCache

Public GetCurrentTimeCommand ()

: base (HystrixCommandSetter.WithGroupKey ("TimeGroup")

.AndCommandKey ("GetCurrentTime")

.AndCommandPropertiesDefaults (new HystrixCommandPropertiesSetter (). WithExecutionIsolationThreadTimeout (TimeSpan.FromSeconds (1.0)) .WithExecutionIsolationThreadInterruptOnTimeout (true)

{

}

Protected override long Run ()

{

Using (WebClient wc = new WebClient ())

{

String content = wc.DownloadString ("http://tycho.usno.navy.mil/cgi-bin/time.pl");

XDocument document = XDocument.Parse (content)

CurrentTimeCache = long.Parse (document.Element ("usno") .Element ("t") .Value)

Return currentTimeCache

}

Protected override long GetFallback ()

{

Return currentTimeCache

}

Then call this Command when needed:

GetCurrentTimeCommand command = new GetCurrentTimeCommand ()

Long currentTime = command.Execute ()

The above is a synchronous call, of course, if the business logic allows and is more performance-oriented, you may choose to make an asynchronous call:

In this case, regardless of WebClient. Does DownloadString () itself have a timeout mechanism (you may find that many remote call APIs do not provide you with a timeout mechanism). After being encapsulated with HystrixCommand, the timeout is mandatory. The default timeout is 1 second. Of course, you can adjust the timeout of Command in the constructor as needed, such as 2 seconds:

HystrixCommandSetter.WithGroupKey ("TimeGroup")

.AndCommandKey ("GetCurrentTime")

.AndCommandPropertiesDefaults (new HystrixCommandPropertiesSetter (). WithExecutionIsolationThreadTimeout (TimeSpan.FromSeconds (2.0)) .WithExecutionIsolationThreadInterruptOnTimeout (true))

When the Hystrix command times out and the Hystrix command times out or fails, it will try to call a fallback, which is an alternative, and to provide fallback for the HystrixCommand, simply override the protected virtual R GetFallback () method.

In general, Hystrix allocates a special thread pool for Command, and the number of threads in the pool is fixed, which is also a protection mechanism. If you rely on many services, you do not want to consume so many threads for the call to one service that there are no threads to call the other services. By default, the size of this thread pool is 10, that is, there can only be one concurrent command. Calls beyond this number have to be queued. If the queue is too long (the default is more than 5), Hystrix will immediately fallback or throw an exception.

According to your specific needs, you may want to adjust the thread pool size of a Command. For example, the average response time to a dependent call is 200ms, and the peak QPS is 200. then the concurrency is at least 0.2x200 = 40 (Little's Law). Considering a certain degree of leniency, the thread pool size may be set to 60:

Public GetCurrentTimeCommand ()

: base (HystrixCommandSetter.WithGroupKey ("TimeGroup")

.AndCommandKey ("GetCurrentTime")

.AndCommandPropertiesDefaults (new HystrixCommandPropertiesSetter (). WithExecutionIsolationThreadTimeout (TimeSpan.FromSeconds)) .WithExecutionIsolationThreadInterruptOnTimeout (true))

.AndThreadPoolPropertiesDefaults (new HystrixThreadPoolPropertiesSetter (). WithCoreSize (60) / / size of thread pool

.WithKeepAliveTime (TimeSpan.FromMinutes (1.0)) / / minutes to keep a thread alive (though in practice this doesn't get used as by default we set a fixed size)

WithMaxQueueSize / / size of queue (but we never allow it to grow this big. This can't be dynamically changed so we use 'queueSizeRejectionThreshold' to artificially limit and reject)

WithQueueSizeRejectionThreshold (10) / / number of items in queue at which point we reject (this can be dyamically changed)

.WithMetricsRollingoriginalWindow (10000) / / milliseconds for rolling number

.WithMetricsRollingwindowBuckets (10))

{

}

Having said so much, the circuit breaker of Hystrix has not been mentioned. In fact, for users, the circuit breaker mechanism is enabled by default, but the programming interface hardly needs to care about this by default, and the mechanism is similar to what was mentioned above. Hystrix will count the command calls and look at the proportion of failures. By default, when the circuit breaker fails more than 50%, the circuit breaker will be turned on. After that, the command call for a period of time will directly return the failure (or go to fallback). Five seconds later, Hystrix tries to turn off the circuit breaker to see if the request responds properly. The following lines of Hystrix source code show how it counts the failure rate:

Public HealthCounts GetHealthCounts ()

{

/ / we put an interval between snapshots so high-volume commands don't

/ / spend too much unnecessary time calculating metrics in very small time periods

Long lastTime = this.lastHealthCountsSnapshot

Long currentTime = ActualTime.CurrentTimeInMillis

If (currentTime-lastTime > = this.properties.MetricsHealthSnapshotInterval.Get () .TotalMilliseconds | | this.healthCountsSnapshot = = null)

{

If (Interlocked.CompareExchange (ref this.lastHealthCountsSnapshot, currentTime, lastTime) = = lastTime)

{

/ / our thread won setting the snapshot time so we will proceed with generating a new snapshot

/ / losing threads will continue using the old snapshot

Long success = counter.GetRollingSum (HystrixRollingNumberEvent.Success)

Long failure = counter.GetRollingSum (HystrixRollingNumberEvent.Failure); / / fallbacks occur on this

Long timeout = counter.GetRollingSum (HystrixRollingNumberEvent.Timeout); / / fallbacks occur on this

Long threadPoolRejected = counter.GetRollingSum (HystrixRollingNumberEvent.ThreadPoolRejected); / / fallbacks occur on this

Long semaphoreRejected = counter.GetRollingSum (HystrixRollingNumberEvent.SemaphoreRejected); / / fallbacks occur on this

Long shortCircuited = counter.GetRollingSum (HystrixRollingNumberEvent.ShortCircuited); / / fallbacks occur on this

Long totalCount = failure + success + timeout + threadPoolRejected + shortCircuited + semaphoreRejected

Long errorCount = failure + timeout + threadPoolRejected + shortCircuited + semaphoreRejected

HealthCountsSnapshot = new HealthCounts (totalCount, errorCount);}

}

Return healthCountsSnapshot

}

Where failure indicates that there is an error in the command itself, success needless to say, timeout is a timeout, threadPoolRejected indicates a command call that is rejected when the thread pool is full, shortCircuited indicates a command call that is rejected after the circuit breaker is opened, and semaphoreRejected uses a semaphore mechanism (rather than a thread pool) to reject command calls.

Original address: http://www.cnblogs.com/shanyou/p/4752226.html

Follow our approach:

1. Click the blue word "dotNET Cross-platform" under the title of the article, or search for "opendotnet" on Wechat.

two。 Old friends click on the upper right corner "…" The logo is shared to moments.

This article is shared from the official account of Wechat-dotNET Cross platform (opendotnet).

If there is any infringement, please contact support@oschina.cn for deletion.

This article participates in the "OSC Source Program". You are welcome to join us and share it.

The above is how to use Hystrix to improve system usability. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.