In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >
Share
Shulou(Shulou.com)11/24 Report--
Artificial intelligence, cloud computing, big data and other digital technologies are intertwined to build a new virtual space, which changes production, circulation, science, education, entertainment, social interaction and so on. Driven by technology, the new digital civilization is booming, and the rapid change in arithmetic is one of the important driving forces of this civilization iteration. Since the advent of the computer, in a short period of more than 70 years, its performance has grown from 5000 operations per second to tens of billions of calculations per second of supercomputers, and its performance has increased by hundreds of trillions of times. Even so, in the face of the burgeoning technologies such as generative artificial intelligence and meta-universe, the gap in computing power is still huge.
In the wave information, there is a group of engineers who are driven by curiosity to find ways to improve their computing power in various ways. no matter how big or small the progress is, their pride is driving them to continue to explore the unknown. They even ponder a variety of cross-border technologies like a scientist and use them to solve all kinds of engineering problems. They have divergent thinking, but also have the ability to focus, with the enthusiasm and pursuit of computing innovation, continue to expand the boundaries of digital civilization.
112Gbps High Speed Interconnection, the "Art" of Server Design
Yang Yang, a member of the army of Tide Information AI server engineers, is responsible for the research and development of AI server system architecture, the key of which is to design and develop an open acceleration substrate with ultra-high speed interconnection performance.
"in the past, we were emphasizing how to improve the computing power of a single chip. however, in the era of large models, model training is prone to thousands of cards, and a single chip has been completely unable to carry it. In the new AI supercomputer form, what kind of interconnection architecture can better support the development of large model business is a key research topic." Yang Yang believes that the premise of interconnecting thousands or even tens of thousands of chips and enabling them to work together efficiently is to solve the high-speed direct connection of chips in a single server, which is the "origin" of all problems.
With the efforts of their team, Tide Information defines the industry's first 8-card interconnect AI system that conforms to the OAM (Open acceleration Module) specification, which is an interconnect substrate that follows the open computing standard and achieves the highest single-channel rate 56Gbps in the industry for the first time. The thickness of this substrate is only 3.26mm, but the number of layers is as high as 22 layers, including nearly 1000 high-speed interconnection differential pairs.
At present, 56Gpbs is still the highest rate of chip interconnection under the open acceleration specification. "next, we will sprint for 112Gbps single-channel high-speed interconnection communications," Yang Yang said. "this level of speed increase is equivalent to our step from the 5G era to the 6G era."
The difficulty of 112Gbps high-speed interconnection technology is that when the physical size is almost constant, it is necessary to sacrifice the signal-to-noise ratio to double the interconnection rate between GPU. The impact of the reduction of signal-to-noise ratio is huge, which means that the 112Gbps signal is more sensitive to jitter and noise, that is, to the crosstalk and SCD of the channel (the differential energy of the signal passing through the channel becomes the modal conversion of the common mode energy, the lower the better.) , PN Skew (transmission difference caused by unequal length of internal and external lines), ILD (loss, line loss / impedance influence degree, that is, drift) and other indicators are more stringent.
This not only needs more high-end material support, but also tests the "art" of design. You know, 3-5mm thickness of the substrate is actually laminated design, often contains more than a dozen or even dozens of layers of PCB boards (printed circuit boards), each layer thickness is only about 100 microns, equivalent to a piece of A4 paper. In order to ensure the quality of signal transmission, each group of lines needs to adopt differential pair design, that is, complementary signals of equal length and opposite phase are used to transmit the same signal to reduce noise and EMI (electromagnetic interference), which will double the amount of wiring, which is undoubtedly worse for the substrate whose signal wiring density is close to the limit. Moreover, the width and spacing of the differential alignment must always be consistent, and the design capability is higher when the obstacles on the substrate, such as via holes or smaller devices, are routed around.
Therefore, 112Gbps high-speed interconnection design not only needs to find lower loss resin, glass fiber and smoother copper foil, but also to ensure that these materials can meet the specifications of reliability after processing, and the design and process complexity is extremely high.
In Yang Yang's view, 112Gpbs high-speed interconnection technology needs not only scientific divergence, but also engineering convergence: looking for the possibility of innovation through scientific divergence and "feasibility" through engineering convergence. The possibility space of innovation includes materials, processes, methods, management and operation, etc., while the feasibility is to find "maximization or minimization", which is the process of finding the optimal solution. "just like when it comes to profit, we tend to pursue profit maximization and cost minimization. In many cases, maximization and minimization are unified, and the goal is the same."
The work done by the Yang Yang team can benefit hundreds of chip innovation companies and a larger number of users: with standardized and excellent open acceleration substrates, chip companies can quickly implement product landing and continuous iteration, while users can use a unified and open infrastructure to configure different types of AI acceleration chips according to business needs to accelerate innovation and create a better user experience.
Listening and noise reduction, "Romantic" of server optimization
A server needs to integrate more than 10000 components, including more than 50 types of dedicated chips; it also involves more than 30 technology directions, such as materials science, thermodynamics, battery technology, fluid mechanics, chemistry, and so on; in addition, more than 100 transport protocols are used in a server. In manufacturing, the server needs to go through more than 30 processes, use more than 100 processing and manufacturing processes, and control the control points of more than 200 critical processes.
How to ensure the reliability of the whole system is a very fine and complex project, every detail is related to the whole, and even the sound will affect the reliability of the server. Four or five years ago, a considerable number of data center users encountered almost the same problem: the faster the fan, the more likely the hard drive to fluctuate in performance and, in severe cases, to go offline.
"at first, I thought vibration was the culprit, but later I found out that sound was the culprit." Wave information structure engineer Cathy Wang with unique female sharpness, to create a unique engineer's "romance"-listening to noise reduction.
The team has done a large number of experiments on the performance failure of the hard disk and found that once the noise generated by the fan reaches 120 decibels, it is very easy to cause the magnetic head offset of the hard disk and the decline in the efficiency of reading and writing, which in turn leads to sector failure and even hard disk scrapping and server downtime. " There is an irreconcilable contradiction in the field of structure, that is, after the speed of the fan increases, its noise will develop in the direction of high frequency and loud pressure, and it is that the relationship between sound and speed is growing to the fifth power. so we see a very clear and rapid trend of fan noise growth. The problem of the conflict between the fan and the hard disk, how to establish a hard disk sensitivity model from the point of view of system design has become a difficult point for manufacturers in the industry. " Cathy Wang said.
However, although the root cause of the problem has been found, the process of solving the problem is still tortuous. After trying the impassable paths such as sine wave and 1amp 3 octave, Cathy Wang's team found the most suitable noise bandwidth and simulated a variety of noise sources in the mode of frequency mixing and frequency sweep, which can measure the resonance frequency and sound pressure threshold of the hard disk under the stimulation of 500Hz~10000Hz noise. Based on a large number of mechanical studies and tests, the team found the mathematical law between hard disk performance loss and sound pressure intensity, constructed the industry's first hard disk sensitivity model, and quantified the performance of different hard drives affected by all kinds of noise.
"through our research work, we hope to change performance optimization from experience-led to science-led. With the help of improving basic theories, tools and methods, we can form standard solutions for specific problems and design new reusable knowledge." Cathy Wang said.
The "black box" of the sound in the server was opened in this way. On the basis of determining the noise spectrum that really affects the work of the hard disk in the chassis, the engineers of tide information carry out omni-directional optimization design of the server system. First of all, starting with the source of noise and vibration, the blade shape of the fan is improved by CFD hydrodynamic simulation to restrain the high frequency noise caused by eddy current shedding on the blade surface; secondly, more than 40 opera-style muffling structures are designed in the chassis to effectively eliminate specific high frequency noise. In addition, the servo control algorithm in the hard disk firmware is adjusted to control the noise resonance swing of the hard disk head within 10 nm, which can not only improve the reading and writing efficiency and double the performance, but also realize the safe operation of the server.
Converged Architecture 3.0, the "Dream" of Server Architecture
In the era of large model, after achieving high computing efficiency on a single computer, whether it can maintain a relatively linear performance expansion ratio in hundreds of nodes and thousands of cards has become a key factor in computing cluster system design and parallel strategy design. In the traditional computing architecture, processor scale-out has always been a bottleneck that is difficult to break through, so it is imperative to find a new way out.
Tide Information Architecture engineer Lorne Ci said: "the traditional server is to put all the IT resources into one server." If you need more computing power, more memory, and more IO, you need to stack the servers. In our usual sense, a large-scale data center may have hundreds of thousands or even hundreds of thousands of servers. However, a simple stack can only stack servers of various forms and specifications, which is not substantially helpful to the improvement of the computing power of the data center. Need to make the server IT resources into a pooled form, and then through the software definition to achieve dynamic allocation of resources. "
Therefore, the research direction of the Lorne Ci team is to create a new architecture to integrate similar resources in hardware devices into a resource pool, and different devices can be integrated arbitrarily, and then dynamically perceive the resource needs of business through software, and make use of the ability of hardware restructuring to meet the needs of all kinds of applications.
Tide Information named this new architecture "Fusion Architecture", and put forward this technical concept as early as 2014. The core is to realize physical pooling and dynamic reconfiguration of resources through hardware decoupling, and to achieve business-aware on-demand resource combination and configuration through software definition, so as to meet the flexible scalability and large-scale continuous expansion of the system, and achieve a high degree of coordinated development of software and hardware. Tide information divides the development of the fusion architecture into three stages, namely, "server is computer (Server as a Computer)", "cabinet is computer (Rack as a Computer)" and finally "data center is computer (Data Center as a Computer)".
At present, the prototype system of Fusion Architecture 3.0 has been successfully developed, which realizes the complete decoupling and pooling of core IT resources, such as computing resources, storage resources, memory resources, heterogeneous acceleration resources, etc., and supports asynchronous upgrading of pooled resources, fine-grained multi-host sharing of high concurrent storage, sub-microsecond remote memory sharing access and other characteristics. "A set of systems, N-class applications" can be realized through software definition.
The core of Fusion Architecture 3.0 is to pool the memory resource pool and the computing resource pool. And how to achieve remote memory calls, to achieve low latency and fast response, how to achieve cache consistency. Are major challenges to memory pooling. Lorne Ci said, "now the converged architecture is based on many open bus technologies, including PCIE, CXL, and so on, to build a large memory system and a high-speed and high-performance Internet network, which is of great value for large model training with a surge in the number of parameters and data."
With the successful development of the fusion architecture 3.0 prototype system, tide information has completed an important breakthrough in the field of fusion architecture, realizing the pooling of various IT resources such as cabinet-level computing, memory, storage and interconnection. Among them, memory decoupling realizes sub-microsecond remote memory access, and constructs a logical pool of memory resources that can be shared remotely. This change allows multiple hosts to access the same memory pool, and ultimately greatly improves the efficiency of data exchange. The new architecture breaks the logical architecture and application pattern of existing servers. It focuses on system design and can transform the data center from resource-driven to business-driven. Facing different scenarios such as cloud computing and artificial intelligence, this new architecture and new combination make it possible for the data center to use a set of systems to support many kinds of applications.
In today's emerging era of digital civilization, computing has penetrated into all aspects of our lives. Computing technology is ubiquitous in the family, in the business world, and in the field of scientific research, which has become a part of our daily life. However, we must realize that this is only the starting point of digital civilization, and the importance of computing will be further highlighted in the future. Computing innovation will become the spark of digital civilization, and it will continue to illuminate the way forward. Just as the pioneers of the past ventured forward to open up new continents, today's countless "math pioneers" will continue to lead us into a new realm of the digital age. These pioneers integrate science with engineering and perfectly combine "knowledge" with "action" in order to explore vast and imaginative unknown places.
The road to digital civilization is full of opportunities and challenges. We need more R & D personnel and scientific and technological workers with interdisciplinary knowledge to adopt a series of unprecedented solutions to push computing innovation to a new height, make it continue to shine, and lead us to the next peak of digital civilization.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.