In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/03 Report--
With the continuous expansion of the scale of data center construction and the iterative update of new technologies, the network carrying data center services has become extremely complex. In order to adapt to the development of data center business, the data center network is constantly updating and changing, which brings great difficulty to the operation and maintenance work.
With the continuous expansion of the scale of data center construction and the iterative update of new technologies, the network carrying data center services has become extremely complex. In order to adapt to the development of data center business, the data center network is constantly updating and changing, which brings great difficulty to the operation and maintenance work. Data center downtime is also inevitable, which not only increases the workload of data center operation and maintenance personnel, but also brings huge losses to the data center, even the world-famous Internet giants often enjoy such "treatment".
The Internet giant is constantly down, and the operation and maintenance work has become a difficult problem.
In the early morning of March 3, Aliyun suffered an outage, resulting in the normal use of the corporate website or Internet company APP that purchased the Aliyun service. A large group of programmers, operators and operators have to get out of bed and work. In response to the outage of Aliyun, Shen Jian, a 58 senior architect, said that the accident lasted about 3 hours and was observed for 2 hours afterwards.
From 03:43 in the morning on May 3, Microsoft Azure experienced a massive outage worldwide, which lasted nearly 2 hours and did not fully recover until 05:30. As a result of the Microsoft outage, major Microsoft services, including Azure 365 and DevOps, have experienced usage problems.
On June 25, Amazon confirmed on its official website that there was a downtime in cloud computing services, affecting the network connections of some network users and multiple AWS regions. The failed node is in AWS US East Zone 1, and a total of 33 services are affected, of which 9 are in a completely interrupted state.
Due to frequent outages, it is difficult for operation and maintenance staff to climb to another high-rise building.
Downtime after downtime proves the importance of data center operation and maintenance work, but it seems unavoidable. Nowadays, with the advent of the era of the Internet of everything with the progress of science and technology, the data center plays an important role as an important infrastructure. Although the data center has been developed in China for only more than a decade, it has gone from an ordinary computer room era with only UPS, air conditioning and IT equipment to a full range of services such as Internet, big data, AI, cloud services, and tens of thousands of cabinets. A new era in which new technologies such as natural cooling, wind walls, underwater data centers and liquid-cooled servers are constantly created and applied. As a result, the management of operation and maintenance is facing greater challenges, and the difficulty of operation and maintenance is "Please climb another tall building."
First, the changes in people, organization, and efficiency brought about by very large-scale data centers. In the past, data centers within 10,000 square meters were inspected manually for 2-4 hours, but now hundreds of thousands of square meters, which requires more operation and maintenance personnel to be distributed in different responsibility areas, increasing the difficulty and cost of management; secondly, the voltage level is increased, and the security risk is increased. In the past, the operation and maintenance personnel were exposed to low voltage, but now the power supply equipment, generators and chillers are all high-voltage power supply, so the maintenance safety requirements are improved; in addition, the concentration of scale leads to the concentration of risks and the greater impact of accidents. For example, the data center outage mentioned above has led to the interruption of services and applications around the world, resulting in heavy losses, so the pressure of operation and maintenance management is ahead of time.
Reduce human error and improve the professional skills of operation and maintenance management
According to the data survey, 70% of outages in the data center are caused by human error, so while the scale of the data center continues to expand, operation and maintenance personnel should improve their skills and professional level to cope with the occurrence of data center accidents:
The establishment of a complete personnel skills evaluation system to assess the skills and capabilities of operation and maintenance personnel from many aspects can effectively help operation and maintenance personnel to improve their operation and maintenance skills and promote operation and maintenance personnel to take the initiative to learn and improve automatically.
Online learning of operation and maintenance experience, establish operation and maintenance experience database, realize online operation and maintenance experience sharing and exchange platform, and provide online practice and learning channels for operation and maintenance knowledge.
Online simulation of operation environment, provide operation and maintenance simulation practice operation environment, effectively isolate operation risks, and help to quickly improve the actual level of operation and maintenance.
Online evaluation of theoretical skills, relying on the massive IT cloud platform component question bank, regular assessment, random questions, to achieve online real-time automatic evaluation of theoretical capabilities of operation and maintenance.
Online evaluation of practical skills, build a lightweight online operation and maintenance operation, online programming environment to achieve online real-time automatic evaluation of operation skills and R & D skills.
Improve the efficiency through automatic evaluation, realize the online scientific automatic evaluation of operation and maintenance theoretical skills and practical skills, improve the evaluation efficiency, and ensure the objective and fair embodiment of the ability.
To make up for the lack of manual operation and maintenance, intelligent operation and maintenance came into being.
Today, with the advent of the digital age, the size and capacity of data centers are increasing exponentially, and the resulting operation and management is becoming more and more complex and difficult. since the evolution from script operation and maintenance, tool operation and maintenance to platform operation and maintenance, manpower has been close to the limit. then intelligent operation and maintenance came into being. Nowadays, more and more data center enterprises, such as Tencent, Huawei and JD.com, have begun to invest more R & D efforts in the wave of intelligent operation and maintenance, combining artificial intelligence with operation and maintenance. based on the existing operation and maintenance data (log, monitoring information, application information, etc.), machine learning is used to improve the efficiency of operation and maintenance, thus gradually replacing artificial operation and maintenance. I believe that the data center will be more and more intelligent in the future.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.