Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Best practices of Haoyun Changsheng liquid cooling data Center in the New era of liquid cooling

2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

Haoyun Changsheng Guangzhou No.2 Cloud Computing Base is the first large-scale commercial liquid-cooled data center in South China. It uses cold plate liquid cooling technology to reduce the cost and efficiency of AI computing business: improve computing performance by 10%, reduce GPU chip maintenance costs by 50%, and save IB cable investment by 30%.

Low-carbon and digital drive, GPU resources will continue to be hot in the future.

Data center is an important infrastructure base of national information strategy, and its development has a direct impact on the landing of the strategy. Policy, economy, society and technology are all providing new momentum for the high-quality development of the data center industry. The 14th five-year Plan clearly points out that by 2025, the added value of core industries in the digital economy will account for 10% of GDP, while energy consumption per unit of GDP will decline by 13.5% by 2025. From the development plan, China's digital economy needs not only rapid development, but also high-quality development.

In March 2023, Open AI's ChatGPT 4.0 model was released, pushing the application of artificial intelligence to a new level. The model performed "beyond the human level" in many professional tests, "more creative and collaborative than ever before", and "can solve problems more accurately". ChatGPT received more than 1 billion visitors a month. At the same time, various industries are actively exploring the way to combine artificial intelligence with the industry, such as Microsoft connecting ChatGPT to Office 365.Productivity has increased exponentially.

This wave of AI has also hit the base of the computing infrastructure. Artificial intelligence depth neural network algorithm (DNL) needs to deal with a large number of parallel convolution operations, and GPU graphics card can match this characteristic very well. Driven by the business side, coupled with the ban on the sale of the A100, it is difficult to find a GPU card for large model training, and the price change is measured on a daily basis. The 8-card H100 server took only three months from 600000 to 1.5 million. In the future, the trend of GPU shortages is likely to be maintained. OpenAI's ChatGPT GPT 4.0 is trained on about 10000-25000 A100s, while GPT 5.0 will probably require 30000-50000 H100s.

Low-carbon and high-density air withdrawal liquid into

This series of changes in the macro environment have had a lot of impact on the development direction of the data center industry. Can the cold end adapt to this change? In our view, air chill does not well match the changes in business requirements.

First of all, the cold air can not cope with the challenge of PUE very well, and at present, the provinces already have clear guidance on the data center PUE. Taking Guangdong Province as an example, the Guangdong Provincial Department of Industry and Information Technology issued a notice on the overall layout plan of Guangdong 5G Base Station and data Center (IDC) (2021-2025). The PUE of the new data center is not higher than 1.3, which is a very challenging requirement for Guangdong.

Secondly, the heat dissipation efficiency and cooling accuracy of air cooling are not high enough. The power consumption of GPU chips must tend to be high-density, Nvidia GPU A100 / H100 single card power is close to 400W, chip heat flux density is 50W / square centimeter, 4U whole server power is close to 5.5kW / Taiwan, Nvidia's next generation computing card, A800 / H800 computing power is three times that of the previous generation, the price is only twice the original, power consumption is nearly twice, single card power is close to 700W, heat flux density is 87.5W / square centimeter The whole 4U machine is close to the 9kW, the power of the computing hardware is getting higher and higher, and the heat flux density of the chip is getting higher and higher, so the traditional air cooling is difficult to match:

1. The air-cooled cooling efficiency is low, so it is not suitable for high-power cabinets. Air-cooled closed channel supports a reasonable power range of 4~6kW, but a single 4U H800 machine is close to 9kW. At this time, air-cooled refrigeration is a bit inadequate for the heat dissipation of such high-density equipment. In a small number of server scenarios, it can be deployed in an isolated cabinet for emergencies. In large-scale computing scenarios, the cooling effect of this non-intensive deployment mode is not good. Individual customers will open the shell of the GPU server to increase the heat dissipation area. This deployment method has not been verified by professional CFD simulation, so it is not safe and will result in a waste of cabinet resources.

two。 Air-cooled refrigeration is not accurate enough for heat source (GPU) cooling. The heat flux limit of the chip supported by pure air distribution is about 10W / square centimeter, which can not meet the heat dissipation efficiency requirements of H800. When the chip works at high temperature for a long time, it will lead to a decline in performance. Nvidia has the same performance server, and the performance difference between liquid-cooled version and air-cooled version is 10%. At the same time, according to the "ten-degree rule", every ten degrees increase in electronic components from room temperature, the failure rate of electronic components doubles, the life of GPU spare parts increases, which in turn leads to an increase in computing cost during the whole life cycle.

In practice, there are often low channel temperature, but high chip temperature occurs, high temperature operation for a long time, short life and low performance of GPU, resulting in an increase in economic cost and time cost, so it can be seen that air cooling is not the most suitable in computing scenarios. Liquid cooling takes away heat directly through the cold liquid with high specific heat capacity, and this efficient way of heat dissipation gradually enters everyone's field of vision.

Liquid cooling solution is the optimal solution of GPU computing power.

Haoyun Changsheng Guangzhou No.2 Cloud Computing Base is located in Panyu District, Guangzhou City, Guangdong Province, the center of Dawan District and the intelligent automobile industry center (double centers). This project is designed according to the national standard CQC A, and is positioned as the intelligent manufacturing AI computing base. It is the first large-scale commercial liquid cooling data center in South China, supporting power density above 8~19KW and single system PUE below 1.1. Provide reliable digital infrastructure base for intelligent manufacturing and AI super-calculation high-quality development in South China.

Basic principle of cold plate liquid cooling

The basic principle of liquid cooling is a non-contact liquid cooling technology that uses liquid as a heat transfer working medium to flow in the internal flow channel of the cold plate and cools the heat source through heat transfer. In the cold plate liquid cooling system, special liquid cooling server, server chip and other heating devices do not touch the liquid directly, but dissipate heat through the cold plate assembled on the electronic components that need to be cooled, so as to achieve the purpose of accurate refrigeration and make the running temperature of GPU lower.

The mixture of 25% ethylene glycol and deionized water is used on the secondary side to ensure the high efficiency of heat transfer while taking into account the safety and stability. The inlet water temperature is in the range of 35-45 ℃, the effluent temperature is about 45-55 ℃, and the inlet and outlet water temperature is high. The system cools the chip through natural cooling and reduces the PUE of the system. The heat exchange between the primary side and the secondary side is realized through the plate exchange, and the water pump on the secondary side takes the heat out of the plate exchange to the cooling tower.

From the perspective of the whole system, it is different from the traditional cooling method:

1. Less times of heat exchange, 5 times of heat transfer in traditional cooler system, 3 times of liquid cooling of cold plate, less cooling loss.

two。 Accurate heat dissipation, cold plate liquid cooling can cool the GPU chip at a single point, and the specific heat capacity of the cold liquid is 4 times that of the air, the heat transfer efficiency is higher, and it is more friendly to GPU.

3. No compressors, fans and other components, the system PUE is lower, equipment noise is lower.

Compared with the traditional air flow exchange mode, the cold plate liquid cooling has a qualitative leap in comprehensive performance, which is more suitable for the characteristics of computing business. The power density of a single cabinet of the liquid cooling system supports more than 19kW, which can improve the heat dissipation efficiency and reduce the working temperature of GPU by more than 20 ℃.

Of course, Hao Yun Changsheng believes that the best solution at present should be the combination of wind and liquid, the combination of wind and liquid cooling in the channel, liquid cooling to assist GPU in heat dissipation, and air cooling as an auxiliary heat dissipation to take away the heat of other parts; with the mixed deployment of liquid-cooled cabinet and air-cooled cabinet, the customer's ordinary cabinet and computer cabinet can cooperate nearby to improve cooperation efficiency and facilitate maintenance.

Liquid cooling is the rigid demand of computing business.

In the past, it didn't matter what kind of refrigeration was used for end-users. Air-cooled, water-cooled and indirect evaporation were acceptable as long as they could meet the power demand, but in the computing era, the way of thinking may have to change. Because the computing assets are becoming more and more difficult to obtain and more and more expensive, and the matching of refrigeration methods directly affects the online speed of business and the cost of investment.

First, liquid cooling can improve GPU performance by 10% compared with air-cooled environment. According to the setting, the long-term high temperature operation performance of GPU will be reduced, and liquid cooling can provide efficient heat dissipation capacity and improve the performance of GPU. According to the IDCC forum, according to the OPPO computing team, through verification, the server running in liquid cooling mode is about 10% more efficient than air cooling, which means that the same arithmetic power, the learning cycle of liquid cooling is 10% shorter than that of air cooling, and the business can seize the market earlier.

Second, liquid cooling can reduce the cost of IB cable deployment by more than 30%. A single H800 server can reach 4U 9kW, using traditional air-cooled refrigeration, only one cabinet can be placed, and the cabinet needs to be deployed. If the cold plate liquid cooling mode is adopted, two H800 servers can be arranged directly in the single cabinet without the deployment of the cabinet. Take 15 cabinets in a single row of micro-modules as an example, 7 H800 servers need 14 cabinets, and the total cable length is 49A (An is the average cable connection distance between two adjacent cabinets). If each cabinet can hold 2 cabinets, only 4 cabinets are needed (as shown in the following figure). The total cable length 16AIB cable length is saved by more than 50%, while the price of each IB cable is in the 10,000 yuan level, and the longer the length is, the more expensive the price is. Considering the non-linear relationship between price and length, and related to the scenario, the project saves more than 30% of the cable amount.

Comparison of cable length between air-cooled deployment and liquid-cooled deployment

We believe that shorter transmission distance will also increase the rate of data sharing between computing modules. Some customers clearly require that the routing distance from the server to the IB switch cabinet is less than 30 meters.

Third, liquid cooling can reduce the cost of GPU maintenance by 50% and increase the return on investment. For the accurate and efficient heat dissipation of GPU, the liquid-cooled plate can reduce the service temperature of GPU by 20 ℃. According to the "ten-degree rule", the failure rate of GPU can be reduced by at least 50% (based on the air-cooled failure rate), and then the purchase of GPU spare parts will be reduced. The uncertainty of the future GPU market will also lead to the difficulty of GPU procurement and the increase of procurement costs, so maintaining a low GPU failure rate can save investment costs and time costs. It will not affect business continuity because of the shortage of GPU cards.

To sum up, for the end customer, with the iteration of future technology and the increase of GPU power consumption, liquid cooling is no longer an improvement requirement, but a rigid requirement of intelligent computing power.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report