Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Dubbo Stability case: a Review of Nacos Registry availability issues

2025-01-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

Problem description

As soon as I got home last Thursday night, I received a call from a soft load classmate saying that there was something wrong with the customer's line. As soon as I heard the word "fault", I immediately asked what was going on. After sorting out, I restored the original appearance of the online problem:

The customer uses Dubbo, and the registry uses Nacos. Since the afternoon, there have been calls to report errors, check the log, and found that the Nacos heartbeat request returned 502.

2019-11-15 03 com.alibaba.nacos.naming.beat.sender 02 request xx.xx.xx.xx failed 41.973 [com.alibaba.nacos.client.naming454]-ERROR [com.alibaba.nacos.naming.beat.sender].

Com.alibaba.nacos.api.exception.NacosException: failed to req API: xx.xx.xx.xx:8848/nacos/v1/ns/instance/beat. Code:502 msg:

There is no widespread error report at this time. Subsequently, the user restarted some of the machines and began to report large-scale Nacos connection errors, and a large number of no provider errors began to appear in the call.

Analysis of problems

Generally speaking, there are two possibilities when a heartbeat report error occurs in Nacos:

There is a problem with the user's machine, such as a network failure.

Nacos Server downtime

However, due to the large area of error, it was quickly located that there was something wrong with Nacos Server itself: due to the aging of the disk, the efficiency of IO dropped sharply, Nacos Server could not respond to the client's request, and the client received the 502error response directly. The incident itself is not complicated, it is a bloody case caused by a disk failure in the registry, but from this incident, we can peep into a lot of high-availability problems. Let's talk about the details.

Problem recurrence

Dubbo version: 2.7.4

Nacos version: 1.1.4

Repeat the target: simulate the Nacos Server outage locally and check whether the call to Dubbo will be affected.

Reproduce steps:

Start Nacos Server, Provider and Consumer locally, triggering Consumer to call Provider

Kill-9 Nacos Server, simulates Nacos Server downtime, triggers Consumer to call Provider

Restart Consumer and trigger Consumer to call Provider

Expectations:

All three steps can be called successfully.

Actual results:

1, 2 call succeeded, 3 call failed

The problem is successfully repeated. After restarting Consumer, the call is not successful, and the customer happens to encounter this problem. You may still have some questions about the details. I have conceived some doubts to discuss with you.

Why can Nacos still be called successfully after downtime?

As we all know, when talking about Dubbo, there are three roles that must be talked about: the service provider, the service consumer, and the registry. I don't need to repeat their relationship, but you can get a more comprehensive understanding of their relationship from the connectivity list below:

The registry is responsible for the registration and search of the service address, which is equivalent to the directory service. Service providers and consumers only interact with the registry at startup, and the registry does not forward requests, so there is less pressure.

The service provider registers the services it provides with the registry, which does not include network overhead

The service consumer obtains the address list of the service provider from the registry and invokes the provider directly according to the load algorithm, which includes network overhead.

There is a long connection among registry, service provider and service consumer.

The registry is aware of the existence of the service provider through a long connection, and the service provider is down, and the registry will immediately push the event to notify the consumer.

Registry downtime does not affect running providers and consumers, who cache the list of providers locally

The registry is optional and the service consumer can connect directly to the service provider

Focus on the penultimate item. Dubbo actually caches a list of providers in memory, so that you can easily load balance the address directly from local memory for each call without avoiding accessing the registry for each call. Only when the service provider node is online or offline will it be pushed locally for update. Therefore, after the Nacos goes down, the Dubbo can still be called successfully.

Nacos downtime does not affect the service invocation. Why is there still a call error in the log?

During the downtime, the existing service provider node may suddenly go offline, but because the registry cannot notify the consumer, an error will occur when the client calls the offline IP.

Dubbo can also cover this kind of problem.

Dubbo detects heartbeats at the connection level. When channel itself is unavailable, it will disconnect it even if there is no registry notification, and set a timer to restore its availability when the connection is restored.

In Aliyun's commercial version of Dubbo-EDAS, the "outlier removal" feature is provided, which can instantly remove some problematic nodes at the call level to ensure the availability of the service.

Why do you expect the call to succeed after Consumer restart

It should be clear to everyone that Consumer can still be called successfully after Nacos Server downtime. But why do you expect the call to be successful after the Consumer restart? some people may have doubts. The registry is down, and it must not be connected after the restart. The call should fail. How can you expect success? This is about to involve the local cache of Nacos.

The role of Nacos local cache: when the application and the service registry have a network partition or a complete downtime of the service registry, the application restarts and there is no data in memory. In this case, the application can obtain the last subscription by reading the data from the local cache file.

For example, the following services are defined in the Dubbo application:

You can see the information of all services published by each namespace under / home/$ {user} / nacos/naming/ on this machine. The format of the content is as follows:

{"metadata": {}, "dom": "DEFAULT_GROUP@@providers:com.alibaba.edas.xml.DemoService:1.0.0:DUBBO", "cacheMillis": 10000, "useSpecifiedURL": false, "hosts": [{"valid": true, "marked": false, "metadata": {"side": "provider", "methods": "sayHello", "release": "2.7.4", "deprecated": "false", "dubbo": "2.0.2", "pid": "5275" "interface": "com.alibaba.edas.xml.DemoService", "version": "1.0.0", "generic": "false", "revision": "1.0.0", "path": "com.alibaba.edas.xml.DemoService", "protocol": "dubbo", "dynamic": "true", "category": "providers", "anyhost": "true", "bean.name": "com.alibaba.edas.xml.DemoService", "group": "DUBBO", "timestamp": "1575355563302"} "instanceId": "30.5.122.3#20880#DEFAULT#DEFAULT_GROUP@@providers:com.alibaba.edas.xml.DemoService:1.0.0:DUBBO", "port": 20880, "healthy": true, "ip": "30.5.122.3", "clusterName": "DEFAULT", "weight": 1.0, "ephemeral": true, "serviceName": "DEFAULT_GROUP@@providers:com.alibaba.edas.xml.DemoService:1.0.0:DUBBO", "enabled": true}] "name": "DEFAULT_GROUP@@providers:com.alibaba.edas.xml.DemoService:1.0.0:DUBBO", "checksum": "69c4eb7e03c03d4b18df129829a486a", "lastRefTime": 1575355563862, "env": "," clusters ":"}

Why do you expect the call to succeed after reboot? Because after inspection, it is found that on the machine where there is a problem online, the cache file is normal. Although Nacos Server is down, the local cache file can still be used as a backdrop, so the call is expected to be successful.

Why did not load the local cache file as expected after Consumer restart?

The cache file is normal, and the problem can only occur in the logic of reading the cache file.

Maybe there's something wrong with nacos-client.

Maybe there's something wrong with Dubbo's nacos-registry.

After some investigation, with the assistance of Nacos research and development, we found a parameter of naocs-client: namingLoadCacheAtStart, which controls whether the cache file is loaded at startup, and the default value is false. That is, with nacos-client, local cache files are not loaded by default. Finally locate the cause of the online problem: you need to manually turn on loading the local cache in order for Nacos to load the local cache file.

The pros and cons of setting this parameter to true and false:

Set it to true, giving priority to availability and stability, preferring to accept data that may go wrong, rather than making a complete error in the call because there is no data.

If it is set to false, the availability of Server is considered to be higher, and it is more acceptable to accept data without data and incorrect data.

Both true and false are used to cover extreme situations rather than the norm. For registration discovery scenarios, it may be more appropriate to set it to true, so that you can use the local cache file of Nacos as a background.

Dubbo passes registry parameters

Unified URL model is used to pass parameters in Dubbo. When we need to pass configuration parameters related to registry in the configuration file, we can splice them in the form of key-value pairs. When we want to enable the switch to load registry cache in Dubbo, we can configure as follows:

Unfortunately, the latest version of Dubbo only passes some parameters to Nacos Server, and even if the user is configured with namingLoadCacheAtStart, it will not be recognized by the server, so the local cache cannot be loaded. I modified Dubbo 2.7.5-SNAPSHOT locally, and after passing the above parameters, I can make the calls in stages 1, 2, and 3 successful, which proves that namingLoadCacheAtStart can indeed make Dubbo load local cache files. This issue will be fixed in Dubbo 2.7.5, when the stability of using Nacos in Dubbo will be improved.

Problem summary

This online problem reflects the impact of Nacos registry availability on Dubbo applications, as well as some backtracking logic that the whole system needs to carry out when a component goes down, so that the whole system will not be paralyzed because of a component.

Summarize the shortcomings of existing code and some best practices:

When Dubbo passes registry parameters to Nacos, only some of the parameters are recognized, which will invalidate part of the user's configuration and will be repaired in the next version.

The parameters that affect the stability of the system, such as the switch of nacos-client loading local cache files, should be designed as-D startup parameters or environment variable parameters, so that it is convenient to find problems and stop bleeding in time. For example, in this event, the defective Dubbo code only depends on the passing of parameters and cannot load the local cache file, but if there is a-D parameter, you can forcibly start loading the cache, which greatly reduces the impact of the problem.

Whether namingLoadCacheAtStart is enabled by default needs to be determined according to the scenario. However, in extreme scenarios such as nacos-server downtime, enabling this parameter can minimize the impact of the problem. By the way, Nacos itself also provides a local disaster recovery file, which is different from the local cache file. Interested friends can also check it out.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report