Ali Yun recalled the incident of "large-scale service interruption in Hong Kong availability Zone C": compensation will be dealt with as soon as possible. 02/14 Update SLTechnology News&Howtos

Ali Yun recalled the incident of "large-scale service interruption in Hong Kong availability Zone C": compensation will be dealt with as soon as possible.

2026-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

CTOnews.com December 25 news, Alibaba Cloud today released "About Alibaba Cloud Hong Kong Region Available Area C Service Interrupt Event Description" said, December 18, Alibaba Cloud Hong Kong Region Available Area C large-scale service interruption event. After re-listing, Alibaba Cloud further explained the fault situation, problem analysis and improvement measures.

CTOnews.com understands that Alibaba Cloud publicly apologizes to all customers affected by the failure in its instructions and handles compensation matters as soon as possible.

Alibaba Cloud said that the interruption of service in Hong Kong Region C had a significant impact on the business of many customers and was also the longest large-scale failure in Alibaba Cloud's operation for more than a decade.

The following is the full text of Alibaba Cloud's "Description of Service Disruption Event in Alibaba Cloud Hong Kong Region Availability Zone C":

On December 18,2022, Beijing time, a large-scale service interruption occurred in Alibaba Cloud Hong Kong Region Availability Zone C. After re-offer, we are here to further explain the fault situation, problem analysis and improvement measures.

Processing process At 08:56 on December 18, Alibaba Cloud Monitor went to Hong Kong Region Available Area C room channel temperature control alarm. Alibaba Cloud engineers intervened in emergency treatment and notified the computer room service provider to conduct on-site investigation. At 09:01, Ali Cloud Monitor sent temperature rise alarm to several rooms in the machine room. At this time, the engineer found abnormal cold machine. At 09:09, the service provider of the equipment room carried out 4+4 active/standby switching and restart for the abnormal cold unit according to the emergency plan, but the operation failed, and the cold water unit could not be restored to normal. At 09:17, according to the fault handling process, start the emergency plan for abnormal refrigeration, and carry out auxiliary heat dissipation and emergency ventilation. Try to isolate and manually restore the chiller control system one by one, but find that it cannot run stably. Contact the chiller equipment supplier to check on site. At this point, due to high temperatures, some servers began to be affected.

Since 10:30, in order to avoid possible high temperature fire protection problems, Alibaba Cloud engineers have continuously reduced the load of the entire computer room computing, storage, network, database and big data cluster. During this period, the operation of the chiller equipment continued for many times, but it could not maintain stable operation.

At 12:30, the chiller equipment supplier arrived, and under the diagnosis of multiple engineers, manual makeup and exhaust operation was carried out on the chiller tower, cooling water pipeline and chiller condenser, but the system still could not maintain stable operation. Aliyun engineers started the server shutdown operation for some high temperature packages. At 14:47, the chiller equipment supplier encountered difficulties in troubleshooting equipment problems, and one of the rooms triggered forced fire sprinklers due to high temperature. At 15:20, after manual adjustment and configuration by the chiller equipment supplier engineer on site, the chiller group control was unlocked and operated independently. The first chiller returned to normal and the temperature began to drop. Engineers then proceeded to operate other chillers in the same way. At 18:55, 4 chillers returned to normal cooling capacity. At 19:02, the server was started in batches and the temperature rise was continuously observed. At 19:47, the temperature in the engine room stabilized. At the same time, Alibaba Cloud engineers began to perform service start-up recovery and perform necessary data integrity checks.

At 21:36, most of the servers in the computer room were started one after another and checked, and the temperature of the computer room was stable. One of the private rooms was not powered on due to fire sprinkler activation. Because maintaining data integrity is critical, engineers have carefully checked the data security of this private server, which takes some time. At 22:50, the data check and risk assessment were completed, and the power supply recovery and server startup were carried out step by step according to the safety of the last package.

Service Impact At 09:23 on December 18, ECS servers in Part C of Hong Kong Region Availability Zone began to go down, triggering downtime migration in the same availability zone. As temperatures continue to rise, the number of affected server outages continues to increase, and customer business is beginning to suffer, extending to more cloud services such as EBS, OSS, RDS, etc. in Hong Kong Availability Zone C.

The failure of Alibaba Cloud Hong Kong Availability Zone C did not directly affect customers 'businesses operating in other Availability Zones in Hong Kong, but affected the normal use of ECS Control Plane in Hong Kong Region. Since a large number of customers in Availability Zone C purchased ECS instances in other Availability Zones in Hong Kong, ECS control service triggered current limiting from 14:49 on December 18, and availability dropped to 20% at the lowest. When customers purchase new ECS instances using RunInstances / CreateInstance API, if custom mirror is specified, some instances will fail to start after successful purchase. Since custom mirror data service depends on OSS service of single AZ redundancy version in availability zone C, it cannot be solved by retry. At this time, some Dataworks and k8s user console operations were also affected by the failure. API fully restored available as of 23:11 of the day.

At 10:37 on December 18, part of the storage service OSS in Alibaba Cloud Hong Kong Availability Zone C began to be affected by downtime. At this time, customers would not notice it temporarily, but continuous high temperature would cause disk failure and affect data security. Engineers stopped the server and interrupted the service from 11:07 to 18:26. Alibaba Cloud provides two types of OSS services in Availability Zone C of Hong Kong Region: one is OSS local redundant LRS service (usually called single AZ redundant service), which is only deployed in Availability Zone C; the other is OSS same-city redundant ZRS service (usually called 3AZ redundant service), which is deployed in Availability Zones B, C and D. In this failure, the OSS city-wide redundant ZRS service was basically unaffected. OSS local redundancy service interruption time in Availability Zone C is relatively long. Since it does not support cross-availability zone handover, it needs to rely on recovery of failed equipment room. Starting at 18:26, the storage servers were restarted in batches. Among them, some servers of single AZ local redundant LRS service need to be isolated due to fire protection problems. Before restoring service, we had to ensure data reliability, spending more time on integrity checks. Until 00:30 on December 19, this part of OSS service (single AZ redundant service) was restored to external service capability.

A small number of Alibaba Cloud network single-availability zone products (such as VPN, PrivateLink and a small number of GA instances) were affected in this failure. At 11:21 on December 18, the engineer started disaster recovery escape in the available area of network products. At 12:45, disaster recovery escape in the available area of most network products such as SLB was completed. At 13:47, NAT products were terminated and escaped. With the exception of the small number of single-availability-zone products described above, network products maintain business continuity during failures, and NAT has minute-level service losses.

Starting from 10:17 on December 18, an unavailable alarm occurred for RDS instances in Part C of Alibaba Cloud Hong Kong Region Availability Zone. As the range of hosts affected by the fault in the availability zone expands, the number of instances with service exceptions increases, and engineers start the database emergency switching plan process. As of 12:30, RDS MySQL and Redis, MongoDB, DTS and other cross-availability zone instances have completed cross-availability zone switching. Some single-AZ instances, as well as single-AZ high-availability instances, are effectively migrated due to reliance on single-AZ data backups. A small number of RDS instances that support cross-availability zone handoffs do not complete the handoff in time. After investigation, this part of RDS instance relies on proxy service deployed in availability zone C of Hong Kong Region. Since proxy service is unavailable, RDS instance cannot be accessed through proxy address. We assist customers in recovering by temporarily switching to address access using RDS master instances. With the recovery of refrigeration equipment in the computer room, most database instances returned to normal around 21:30. For the single-server instances affected by the failure and the high-availability instances where both the primary and backup are located in Availability Zone C of Hong Kong Region, we provide temporary recovery solutions such as clone instances and instance migration. However, due to the limitation of underlying service resources, some instances encounter some abnormal situations during migration and recovery, which take a long time to resolve.

We noticed that customers who were running their business in multiple Availability Zones at the same time were able to keep their business running during this incident. For customers whose business requires absolute high availability, we continue to recommend that you adopt a full-link, multi-availability-zone business architecture design to deal with all possible contingencies.

Problem analysis and improvement measures 1. Cause analysis of long recovery time of cooling system failure: air resistance caused by lack of water in cooling system of engine room, affecting water circulation, resulting in abnormal service of 4 main cooling machines, and failure of starting 4 standby cooling machines due to air resistance of water circulation system shared by main and standby. After the water tray is replenished, due to the group control logic of the cooling system in the engine room, it is impossible to start the chiller independently. The configuration of the chiller is manually modified. After the chiller is adjusted from group control to independent operation, the chiller is started one after another, which affects the recovery time of the cooling system. During the whole process, it took 3 hours and 34 minutes to locate the cause, 2 hours and 57 minutes to replenish water and exhaust, and 3 hours and 32 minutes to unlock the group control logic to start 4 chillers.

Improvement measures: comprehensively check the infrastructure management and control system of the equipment room, expand the coverage at the monitoring data acquisition level, improve the precision, and improve the troubleshooting and positioning speed of faults; at the facility management and control logic level, ensure that the automatic switching logic of the system meets expectations, and at the same time ensure the accuracy of manual switching, so as to prevent internal state deadlock from affecting the recovery of faults.

2. Cause analysis of fire sprinkler triggered due to on-site disposal failure: With the failure of the cooling system of the machine room, the temperature of the private room gradually rises, causing the temperature of the private room of one machine room to reach the critical value to trigger the fire sprinkler system, the power cabinet and multi-column cabinets are flooded, and some machine hardware is damaged, which increases the difficulty and duration of subsequent recovery.

Improvement measures: Strengthen the management of service providers in the computer room, sort out the temperature rise plan and standardized execution actions of the computer room, clarify the plan for shutdown of the business side and forced power shutdown of the computer room under the temperature rise scenario, strive to be simpler and more effective, and strengthen the implementation through regular drills.

3. Analysis of reasons for failure of ECS control operation newly purchased by customers in Hong Kong: ECS management and control system is dual-room disaster recovery in availability zones B and C. After availability zone C fails, availability zone B provides external services. Due to a large number of customers in availability zone C newly purchasing instances in other availability zones in Hong Kong, ECS instances in availability zone C pull traffic introduced by recovery actions, resulting in insufficient management and control service resources in availability zone B. The middleware services that the newly expanded ECS management and control system relies on when it is started are deployed in the availability area C room, resulting in the inability to expand capacity for a long time. ECS controls the dependent custom mirror data service and the OSS service that relies on the single AZ redundancy version of Availability Zone C, resulting in startup failure after the customer purchases a new instance.

Improvement measures: the whole network inspection, overall optimization of multi-AZ product high availability design, to avoid relying on OSS single AZ and middleware single AZ problem. Strengthen disaster recovery drills on Alibaba Cloud's management and control plane to further improve the high availability disaster recovery and escape capability of cloud products.

4. Failure information release is not timely and transparent Cause analysis: After the failure occurs, Alibaba Cloud starts notification means such as nail group and announcement. Due to the slow progress of on-site cold machine processing, effective information is not enough. Status Page information updates do not cause customer confusion in a timely manner.

Improvement measures: Improve the ability to quickly assess and identify the impact of failures and customer impacts. Launch the new version of Alibaba Cloud Service Status Page as soon as possible to improve the speed of information release, so that customers can more easily understand the impact of failure events on various products and services.

Finally, we would like to publicly apologize to all customers affected by the failure and deal with compensation matters as soon as possible. This service interruption event in Hong Kong Region C has a significant impact on the business of many customers, and is also the longest large-scale failure in Alibaba Cloud operation for more than ten years. Stability is the lifeblood of cloud services and is critical to our customers. We will make every effort to learn lessons from this incident, continue to improve the stability of cloud services, and live up to the trust of customers!

Alibaba Cloud

25 December 2022

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.