How to master the high availability features of Nacos 10/31 Update SLTechnology News&Howtos

How to master the high availability features of Nacos

2025-10-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the relevant knowledge of "how to master the high availability features of Nacos". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Preface

Service registration discovery is an enduring topic. Zookeeper, the default registry in the early open source of Dubbo, first came into people's attention, and for a long time, people equated the registry with Zookeeper. Perhaps the designers of Zookeeper did not expect that this product had such a profound impact on the field of micro-services, until SpringCloud became popular and its own Eureka came into people's view. It was only then that people realized that there were other options available to the registry. Later, Ali, who is keen on open source, also focused on the field of registry, and Nacos was born.

Registration center

Kirito's thinking when doing registry selection: once I had no choice, now I just want to choose a good registry, it had better be open source, transparent, self-control; not only open source, it also has an active community to ensure that feature evolution can meet the growing business needs, even if problems can be repaired. Its function is also very powerful, in addition to meeting the registration service and push service, it also needs to have the functions needed in the perfect micro-service system; the most important thing is that it should be stable, preferably endorsed by the actual use scenarios of large factories, and prove that this is a product that can stand the actual combat test; of course, cloud native features and security features are also very important.

It seems that Kirito's requirements for registries are really too high, but these various registries appear in front of users, there is always a comparison. As mentioned above, functional features, maturity, availability, user experience, cloud native features, and security are all topics that can be compared. Today's article focuses on the usability of Nacos. I hope you can have a deeper understanding of Nacos with the help of this article.

Highly available introductions

What are we talking about when we are talking about high availability?

System availability reaches 99.99%

In the distributed system, some nodes are down, which still does not affect the overall operation of the system.

Server-side cluster deployment of multiple nodes

These can be considered highly available, and Nacos High availability, which I introduce today, is a series of measures taken by some Nacos to improve the stability of the system. The high availability of Nacos exists not only on the server side, but also on the client side, as well as in some usability-related features. These dots are assembled to form the high availability of Nacos.

Client retry

To unify the semantics, there are generally three roles in the micro-service architecture: Consumer, Provider, and Registry. In today's registry topic, Registry is nacos-server, while Consumer and Provider are both nacos-client.

In a production environment, we often need to build a Nacos cluster, and we also need to explicitly configure the cluster address on the Dubbo:

When one of the machines goes down, in order not to affect the overall operation, the client will have a retry mechanism.

Polling server

The logic is very simple: get the address list and try it one by one before the request succeeds until it succeeds.

This availability guarantee exists on the nacos-client side.

Conformance protocol distro

First of all, give readers a boost, do not need to see the word "conformance protocol" to be persuaded to quit, this section will not discuss the implementation process of the conformance protocol, but will focus on the other highly available related features. Some articles introduce that the consistency model of Nacos is AP + CP, which is easy to be misunderstood. In fact, Nacos does not support two consistency models, nor does it support the switch between the two models. Before introducing the consistency model, you need to understand two concepts in Nacos: temporary services and persistence services.

Temporary service (Ephemeral): it is deleted from the list when the health check of the temporary service fails. It is often used in service registration discovery scenarios.

Persistence service (Persistent): persistent service is marked as unhealthy when it fails in health check. It is often used in DNS scenarios.

The temporary service uses the private protocol distro customized by Nacos for the service registration discovery scenario, its consistency model is AP;, while the persistence service uses the raft protocol, and its consistency model is CP. Therefore, instead of saying that Nacos is AP + CP, it is recommended to add the state of the service node or the constraint of the usage scenario.

What does the distro protocol have to do with high availability? In the previous section, we mentioned that after the nacos-server node goes down, the client will try again, but the premise is missing, that is, the nacos-server can still work properly after one node is missing. The stateful application of Nacos is different from the general stateless Web application. It does not mean that as long as one node survives, it can provide services. It needs to be discussed by case, which is related to the design of its consistency protocol. The workflow of the distro protocol is as follows:

When Nacos starts, it first synchronizes all data from other remote nodes.

Nacos each node is equal and can handle write requests while synchronizing new data to other nodes

Each node is only responsible for part of the data, and regularly sends its own data check values to other nodes to maintain data consistency.

Cluster normal state

As shown in the figure above, each node serves part of the read and write of the service, but each node can receive read and write requests, so there are two read and write situations:

When the node receives the service that belongs to the node, it reads and writes directly.

When the node receives a service that is not the responsibility of the node, it will be routed within the cluster and forwarded to the corresponding node to complete the reading and writing.

When the node downtime occurs, the read and write tasks of some of the services originally responsible for by the node will be transferred to other nodes, thus ensuring the overall availability of the Nacos cluster.

Some nodes are down.

A more complex situation is that the node is not down, but there is a network partition, as shown in the following figure:

Network partition

This situation can damage availability, and the client will show that sometimes the service exists and sometimes the service does not exist.

To sum up, Nacos's distro conformance protocol ensures that in most cases, the machines in the cluster will not damage the overall availability after downtime. This availability guarantee exists on the nacos-server side.

Local cache file Failover mechanism

One of the worst cases of registry failure is the downtime of the entire Server, when the Nacos still has a highly available mechanism.

A classic Dubbo interview question: when the Dubbo application is running, the Nacos registry goes down, will it affect the RPC call. Most of this question should be answered, because Dubbo has an address in memory, on the one hand, it is designed for performance, because it is impossible to read the registry every time RPC is called, and on the other hand, it also ensures usability (although this factor may not be taken into account by the Dubbo designer).

If, on the basis of this, I put forward another Dubbo interview question: the Nacos registry is down and the Dubbo application is restarted, will it affect the RPC call? If you understand the Failover mechanism of Nacos, you should get the same answer as the previous question: no.

Nacos has a local file caching mechanism. After receiving the service push from nacos-server, nacos-client will keep a copy in memory, and then store a snapshot on disk. The default storage path for snapshot is: {USER_HOME} / nacos/naming/

Nacos snapshot file directory

This file has two values: one is to check whether the server has pushed the service normally; the other is that when the client loads the service, if the data cannot be pulled from the server, it will be loaded from the local file by default.

Only if this parameter is passed in when building NacosNaming: namingLoadCacheAtStart=true

This Nacos parameter is supported in Dubbo 2.7.4 and above. How to enable this parameter: dubbo.registry.address=nacos://127.0.0.1:8848?namingLoadCacheAtStart=true

In a production environment, it is recommended to enable this parameter to avoid the stability of service unavailability caused by registry downtime. In service registration discovery scenarios, availability and consistency trade off, we will give priority to availability most of the time.

The attentive reader also noticed that there is a failover folder under {USER_HOME} / nacos/naming/ {namespace} in addition to the cache files, which contains the same folder as snapshot. This is another failover mechanism of Nacos. Snapshot is restored according to the snapshot of the service at some point in history, while the service in failover can be artificially modified to cope with some extreme scenarios.

This availability guarantee exists on the nacos-client side.

Heartbeat synchronization service

Heartbeat mechanism generally exists widely in the field of distributed communication, which is used to confirm the survival status. The design of the general heartbeat request is different from that of the ordinary request, and the heartbeat request is generally designed to be concise enough to avoid performance degradation as much as possible when timing detection. In Nacos, for the consideration of availability, a heartbeat message contains all the service information, which reduces the throughput and improves the availability compared to sending probe information only. Consider the following two scenarios:

All the nacos-server nodes are down and all the service data is lost. Even if the nacos-server resumes operation, it can not restore the service, and the heartbeat contains all the contents can restore the service during the heartbeat to ensure availability.

The network partition appears in nacos-server. Because heartbeats can create services, basic availability is still guaranteed in the event of extreme network failures.

The following is a test of the heartbeat synchronization service, which is tested using the Nacos cluster provided by Ali Cloud MSE

Call OpenApi:curl-X "DELETE mse-xxx-p.nacos-ans.mse.aliyuncs.com:8848/nacos/v1/ns/service?serviceName=providers:com.alibaba.edas.boot.EchoService:1.0.0:DUBBO&groupName=DEFAULT_GROUP" to delete each service in turn

After 5 seconds, the service is registered again, which is in line with our expectations for the heartbeat registration service.

Cluster deployment model is highly available

Finally, the high availability feature of Nacos that I want to share with you comes from its deployment architecture.

Number of nodes

We know that it is definitely not possible to run Nacos in stand-alone mode in a production cluster, so the first question is: how many machines should I deploy? We mentioned earlier that Nacos has two consistency protocols: distro and raft,distro protocols do not have brain fissure problems, so theoretically, the number of nodes is greater than or equal to 2; the voting mechanism of raft protocol is recommended to be 2n+1 nodes. Overall, the selection of 3 nodes is the minimum, followed by throughput and higher availability considerations, you can choose 5, 7, or even 9 node clusters.

Multi-availability zone deployment

For the Nacos nodes that make up the cluster, two factors should be considered as much as possible:

The network delay between nodes can not be very high, otherwise data synchronization will be affected.

The computer room and availability zone where each node is located should be dispersed as much as possible to avoid a single point of failure.

Taking Aliyun's ECS as an example, choosing different availability zones of the same Region is a good practice.

Deployment mode

It is divided into two modes: K8s deployment and ECS deployment.

The advantage of ECS deployment is that it is simple to buy three machines to build a cluster. If you are proficient in Nacos cluster deployment, this is not difficult, but it cannot solve the problem of operation and maintenance. If a node in Nacos has an OOM or disk problem, it is difficult to remove it quickly and cannot achieve self-operation and maintenance.

The deployment of K8s lies in the strong cloud native operation and maintenance capability, which can achieve self-recovery after node downtime to ensure the smooth operation of Nacos. As mentioned earlier, Nacos is different from stateless Web applications, it is a stateful application, so when deployed in K8s, it is often necessary to rely on components such as StatefulSet and Operator to achieve the deployment and operation and maintenance of Nacos clusters.

Highly available best practices for MSE Nacos

Aliyun MSE (Micro Service engine) provides the hosting capability of Nacos cluster and realizes the high availability of cluster deployment model.

When creating a cluster with multiple nodes, the system is assigned to different availability zones by default. At the same time, this is transparent to the user, the user only needs to care about the function of Nacos, and MSE can cover the usability for the user.

The underlying layer of MSE uses K8s operation and maintenance mode to deploy Nacos. In history, there has been a problem that some nodes are down due to misuse of Nacos, but with the help of the self-operation and maintenance mode of K8s, the downtime nodes are pulled up so quickly that users may not be aware of their own downtime.

Let's simulate a node outage scenario to see how K8s can achieve self-recovery.

A three-node Nacos cluster:

Normal state

Execute kubectl delete pod mse-7654c960-1605278296312-reg-center-0-2 to simulate a partial node outage scenario.

Recovering

About 2 minutes later, the node resumes and the role changes, and Leader transfers from Node 2 to Node 1.

Leader reselection after recovery

This is the end of the content of "how to master the high availability features of Nacos". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.