Huya TV's practice and summary in the transformation of micro-service 07/19 Update SLTechnology News&Howtos

Huya TV's practice and summary in the transformation of micro-service

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Compared with words and pictures, LVB provides richer forms of communication between people, and it has a great test on the stability of the platform. How Huya TV (hereinafter referred to as "Tiger teeth"), who advocates "technology-driven entertainment", empowers entertainment in technology. This article will introduce you to Huya's practice in DNS, service registration, CMDB and service configuration center.

The article was shared by Zhang Bo (Community ID:zhangjimmy), head of the middleware team of Tiger basic Security Department, on the Dubbo Meetup Guangzhou Station Salon. Alibaba authorized the release of the middleware. The sharing topics are as follows:

? Why choose Nacos?

? Technical value and Application scenario of DNS-F

? Practice of service registration

? Application and practice of CMDB

? Practice of service configuration

Why choose Nacos?

Tiger's focus on Nacos started with v0.2 (the latest version: Pre-GA v0.8), and we are also involved in the construction of the community, which can be said to be a relatively early enterprise user.

Nacos is a dynamic service discovery, configuration and service management platform that is easier to help build cloud native applications, providing "registry", "configuration center" and "dynamic DNS service" functions. The official account dialog box sends "Nacos" to learn more about what Nacos is.

First of all, in the micro-service scenario of Huya, there are multiple registries at first, each serving a certain part of the micro-service, and there is a lack of one that can integrate multiple registries and connect them one by one, and then implement a large registry that can manage the entire micro-service system.

The following is an excerpt from the comparison of service registry options when we are considering the introduction of Nacos:

Nacos provides DNS-F function, which can be integrated with K8S, Spring Cloud, Dubbo and other open source products to realize the service registration function.

Secondly, in the process of selecting the service configuration center scheme, we hope that the configuration center and the registry can be connected, so that we can save some investment in micro-service governance. Therefore, we also synchronously compared some of the open source solutions of the service configuration center:

For example, Spring Cloud Config Server, Zookeeper and ETCD, overall evaluation, based on the current state of our micro-service architecture and business scenarios, we decided to use Nacos as our service registration and service discovery solution. In the course of using it, we found that with the continuous updating of the community version and the in-depth practice of Tiger teeth, the advantages of Nacos are far more than those found in our research process. Next, I will share the practice of Tiger teeth around DNS-F, Nacos-Sync, CMDB and load balancing.

Technical value of DNS-F

The first technical value of the DNS-F function provided by Nacos is that it makes up for the lack of a global dynamic scheduling capability for our internal microservices. As mentioned just now, Huya has multiple micro-service systems, but none of them have the ability to schedule dynamically globally, because they are all independent. At present, we have integrated four micro-service registries through Nacos, and the ultimate goal is to integrate all micro-services together to achieve global dynamic transfer.

Second, DNS-F solves the end-to-end challenges faced by the service, that is, the problems of large delay, inaccurate resolution and slow fault traction.

How to understand it?

When there are multiple micro-service systems within, the maturity of each system is different. For example, there are some micro-service frameworks that do not support the same computer room or CMDB routing. When a service is registered with multiple IDC centers to invoke its services, even if it is in the same computer room, it may be called to a node that is not in the same computer room. This will cause service delay and inaccurate parsing for no reason.

Even if we do some parsing optimization based on DNS, we still can not completely solve the delay and inaccurate parsing of the service. This is because DNS is the nearest resolution of IP policies, and cannot be routed according to the physical status and physical information of the service. In addition, when there is a problem with a core service, if there is a lack of a unified registry that integrates the information of multiple callers and callees, it is difficult to accurately judge how to pull, resulting in slow fault traction. With Nacos, you can access a unified registry and configuration center to solve these problems. (at present, Huya is still in the process of reforming the micro-service system, and the unified registration center has not been fully realized.)

Third, provide special line flow traction capacity. The flow interworking of Huya's core computer room is realized by using a dedicated line. The characteristics of Direct Connect are physical, and our Direct Connect construction may not be as large as BAT. For example, the redundancy of our Direct Connect capacity is only 50%. Suppose a LVB is extremely popular and the burst traffic is 200 times higher than the usual capacity, which exceeds the construction capacity of Direct Connect. In this case, one service may lead to network failure. However, through the global registry and mobility capabilities, we can balance the traffic to other places, such as moving to the public network, or even to an address that does not exist. Even if there is a problem with a service, it will not affect our global service.

Fourth, support a variety of server scheduling requirements, including the same server room routing, the same machine routing, and the same rack routing, Nacos can do adaptation. In addition, based on the DNS-F function of Nacos, we have also achieved the acceleration of external domain name resolution and the effect of service failures in seconds.

Application scenario of DNS-F

This figure is a concrete implementation of Nacos DNS-F, which actually intercepts DNS requests from the OS layer. If the domain name passing through DNS is an internal service, it will get the result from Nacos Server, and if not, it will forward it to other LocalDNS for resolution.

Take the application scenario where the database is highly available as an example. The switching efficiency of our database is relatively low, relying on the business side to modify the configuration, and the time limit is uncertain, which usually takes more than 10 minutes. (note: our database has actually implemented the function of master / slave, but when there is a problem with a master service, we always have to switch IP. In the process of switching IP, it depends on the cooperation of operation and development, which is a long process.

After the introduction of DNS, when the master has problems, it can be quickly replaced with another master IP to shield faults, and node fault detection and failover can be completed automatically, which does not rely on the cooperation of operation and development and saves time. Of course, there are many solutions to this scenario, for example, using MySQL-Proxy can also solve this problem, but our MySQL-Proxy is still under construction, so we want to solve this problem as soon as possible, so we use the DNS method.

Let's focus on the optimization of LocalDNS based on DNS-F. Tiger has not yet built its own LocalDNS, mostly using some public DNS, which is roughly composed of the following.

There is a problem with this composition. Suppose the service suddenly crashes, and then the service is normal again, which we can't reproduce to find the cause of the crash. Because in many scenarios, it is caused by the timeout of a public DNS request, or even a resolution failure, at that moment, the problem cannot be found because the scene cannot be retained.

According to our monitoring data, the proportion of DNS parsing errors is about 1 ‰, and the timeout rate will be even higher. It means that when using public DNS, there is a 1% chance that the service will time out or fail. If the service is not fault tolerant, an exception will occur. At the same time, the latency of some public DNS parsing is variable. For example, on some poor nodes on Amazon, the delay is relatively high, with an average of more than 30 to 40 milliseconds.

Then we did some optimizations for LocalDNS based on DNS-F. The optimization results are as follows:

The average parsing time decreased from more than 200 milliseconds to less than two milliseconds.

Cache hit rate increased from 92% to more than 99%

The failure rate of analysis used to be 1 per thousand, but now it is basically gone.

The effect of optimization is also reflected in our risk control service. The average delay is reduced by 10ms, and the proportion of service timeout is reduced by 25%. This reduces the risk of unaudited images or text uploaded by users due to delay or service timeout.

Practice of service registration

HUYA's core business is on Tars.

Tars: an open source micro service framework of Tencent.

Tars mainly supports Cellular clients, but the support for Java, PHP and other development languages is poor, which makes it very awkward for our non-C++ business side to call it. After the introduction of Nacos, we use the DNS protocol supported by Nacos to support the full language in the process of service discovery.

Of course, Nacos is not just a registry, it has the ability to integrate multiple data centers and supports the synchronization of multiple data sources. For example, we have supported the synchronization of Taf (an important micro-service system within HUYA), Nacos itself, ZooKeeper, and some service registrations on K8S.

At the same time, based on the two-way synchronization function of Nacos cluster (Nacos-Sync), we realize the synchronization of data values between two domestic availability zones and multiple foreign availability zones, and finally achieve one registration and multiple places to read.

Nacos-Sync is an event mechanism, that is, synchronization tasks can flexibly turn on and off the tasks you want to synchronize through events, then trigger monitoring according to service changes to ensure real-time performance, and finally ensure the final consistency of service data through scheduled full burst synchronization events. At the same time, Nacos-Sync also supports service heartbeat maintenance, that is, the heartbeat of multiple data centers, which can be synchronized remotely using Nacos-Sync proxies. In addition, it also supports the binding of heartbeats to synchronous tasks, which is easy to control flexibly.

Since there are tens of thousands of registered services on Taf, and the amount of synchronization is particularly large, we have made some modifications in Nacos-Sync to ensure the availability of tens of thousands of service synchronization through task fragmentation. The transformation step is to first define the task with the service granularity, then distribute the task load on multiple fragments, and finally use a single fragment and multiple copies to ensure the availability of the task.

Interfacing with CMDB to achieve the nearest access

When services are deployed in multiple data centers or regions, the latency of cross-region service access is often high. The typical network latency between data centers in a city is about 1ms, while the network latency between cities, such as Shanghai to Beijing, is about 30ms. A natural idea at this time is whether service consumers and service providers can have access to the same region.

Nacos defines a SPI interface that contains methods agreed with third-party CMDB. After the user implements the corresponding SPI interface in accordance with the contract, pack the implementation into a Jar package and place it in the Nacos installation directory. Restart Nacos to connect the data between Nacos and CMDB.

In the actual landing process, we connect to Taf in DNS-F, realize the central control interface of Taf on DNS-F, and seamlessly connect the sdk of Taf. DNS-F provides caching of load balancer and instance information, while Nacos provides a query API for load balancer information.

Practice of service configuration

The domain name (www.huya.com) of Huya will be connected to multiple IDC data centers in South China, Central China and North China, and a Nginx will be built in each room to do load balancing. The load balanced traffic will be returned to our backend servers via Direct Connect. In this process, if we modify a configuration in the middle, we need to send it to hundreds of machines responsible for load balancing in multiple data centers. If the configuration is not distributed in time, or if the configuration fails, a failure may occur. At the same time, the machines responsible for balancing services require high resilience. If the capacity cannot be expanded rapidly during the business peak, it is prone to network-wide failures.

The traditional configuration distribution method is to update the configuration by sending files from the server, which takes a long time to update the configuration. Because you need to know the information of the machines responsible for balancing the cluster in advance, you need to synchronize the meta-information before you can access the traffic. It takes a long time to expand the traffic.

After the introduction of Nacos, we adopt the configuration center listening method. Through the client actively listening for configuration updates, the configuration takes effect in seconds. The new expansion service actively pulls the full configuration, and the traffic access time is shortened by 3 minutes +.

Hu Ya's summary on the transformation and upgrade of Nacos

In the process of introducing Nacos, the modifications and upgrades we have done are summarized as follows.

First, on DNS-F, we have added support for pre-caching of external domain names. The monitoring data of Agent is connected to the company's internal monitoring, and the log output is also docked to the internal log service, and then docked with the company's CMDB, and the DNS-F Cluster cluster is implemented. The reason why we are going to build a DNS-FCluster cluster is to avoid invalid DNS service caused by memory, hard disk or version problems. with DNS-FCluster cluster, when there is a problem with the local Agent, we can proxy and parse the DNS request through the cluster.

Second, on Nacos-Sync, we docked TAF registration service and K8S registration service, and solved the problem of multi-data center ring synchronization.

Third, on Nacos CMDB, we extend Nacos CMDB, docking HUYA's own CMDB and internal load balancing strategy.

The author: Zhang Bo, Community ID zhangjimmy,Nacos Committer, head of the middleware team of Tiger basic Security Department, Ali Yun MVP.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.