In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
Although the phrase "there is no trouble, there is no fault" is rough, but it makes a lot of sense, especially in operation and maintenance. According to the statistics of the relevant consulting institutions, 70% of the failures in the data center are man-made, that is, they are strongly related to human activities, which shows how terrible people are for the data center. Man-made faults can also be divided into intentional and unintentional. The intention is to indicate that they are still determined to do it when they know that some operations will cause failures in the data center. These people often hope to achieve ulterior goals by paralyzing the operation of the data center. This fault accounts for 80% of human failures, and the rest is unintentional.
The data center itself is a complex and huge system, and it is impossible for operators to be proficient in all technical details. When they come into contact with places they are not familiar with or do not understand, the operation is easy to lead to unexpected results. There are a lot of equipment, the software quality is not high, repeated operation will easily lead to software problems, resulting in business interruption. This kind of situation is not common in the data center, where there are tens of thousands of equipment and a large number of equipment, and the problem comes as soon as it moves, so the stable data center should not be changed easily, just let it run at its best.
As we all know, whenever there are some major festivals and events, large data centers will close the network and stop all operations and activities, in order to reduce the failure, reduce the risk of human operation, and reduce the risk of triggering BUG. This method is effective, and there are few other kinds of problems except for some hardware failures.
We all know that the tortoise lives a long life and lives hundreds of young people, just because the tortoise rarely moves and moves slowly, which greatly extends its life span. Data center operation and maintenance staff also like to be quiet and move less carefully, which can minimize the occurrence of failures. The data center of the financial and banking industry has high requirements for reliability. in order to avoid failure, the bank's data center has formulated a strict operating system, and all operations must follow a unified standard. any order issued and changed must be reviewed in advance in the bank, and even verified in the simulation environment before it begins to operate in the current network, and the data center operation of the banking industry is the most standardized. Makes the data center the most reliable.
However, in order to quickly respond to business needs and improve the utilization rate of resources, operators have to toss about frequently, which can not be done without moving. A data center may arrange changes every night, as well as equipment software upgrade, configuration optimization, equipment replacement, and so on. There are always endless changes in the data center, which inevitably leads to some new problems in the course of operation. As a result, the data center is always unable to stabilize, and the business is often affected, which actually violates the tenet of operation and maintenance.
There is too much technical knowledge needed in the data center, covering dozens of disciplines, and no one can master all of them, so it is very difficult to fully master one. At this time, the formulation of the corresponding operation and limited knowledge will always be ill-considered. Once there are omissions, problems may arise in the process of operation. For the change of operation, no one is absolutely sure, everything may have accidents, just like surgery, no matter how small surgery is risky, but also need to be signed by the family, in case of an accident, the operator can be exempted from responsibility.
Since you can't avoid trouble, find a way to keep it from causing problems.
The first thing is to divide and conquer. Divide and conquer is to separate high-risk and low-risk, high-importance and low-risk, simple and complex, frequent and infrequent. In the final analysis, they are doing two things: encapsulating complexity and isolating changes. The separation and governance of operation and maintenance architecture layer is already very common in the industry, such as the separation of application server and database server, the separation of transaction database and user database, and the isolation of production environment and test environment. The data center is composed of many small systems, which should be loosely coupled with each other, preferably isolated. Such a small system failure, the impact is local, will not affect the overall situation.
The second is to manage people. In order to reduce the artificial troubles, it is necessary to strengthen the restraint and management of human beings. Different technical levels of people can do the operation authority is different, a novice to go online operation, must be guided by the old engineer. It is necessary to formulate detailed rules and regulations for the management of personnel, form a binding force on the personnel of operation and maintenance, assess, monitor, and manage the personnel of operation and maintenance, and enhance the sense of responsibility of the personnel of operation and maintenance, with awards and penalties. To formulate strict rules and regulations, general data centers need to provide services 24 hours a day, so it is necessary to give data center personnel sufficient rest time, get to and from work on time, avoid long hours and fatigue work, and reduce the probability of errors.
The third is in charge. When the data center needs to change and optimize operations, it is necessary for the personnel of the operation and maintenance team to conduct an overall discussion and analyze the predicted risks to ensure that the operations will not affect the operation of the business. Each change is a decision made by the discussion of the entire technical team, not an individual behavior, which minimizes technical man-made failures. It is necessary to make a good fallback plan, immediately back out in the event of an abnormal situation, and then make a second change after analyzing the cause. After all, the operation and maintenance personnel are not professional in the equipment, and the internal processing and implementation of the equipment is not very clear. Major changes can invite the technical personnel of the equipment manufacturers to participate and support to reduce the risk of operational errors. Every operation should be fully prepared, and the necessary simulation exercises, advance business transfer, emergency passage preparation and so on are needed, so as to reduce the risk of failure.
"if there is no trouble, there will be no failure" is a good saying of Chrysostom, which sounds very reasonable, but in fact it is very difficult to do it. The data center is originally a place where data flows at a high speed, and business requirements are changing all the time. In order to meet the needs of business deployment and development, it is impossible to make changes and troubles in the data center. "No trouble" is just an ideal state. However, it is true that we should take the initiative to reduce the frequency of data center operations and move as little as possible, which can greatly reduce the probability of failure. People are the most important factor in data center activities, no one is involved in where the data center comes from, but people also bring growth troubles to the data center at the same time, people still play an important role in the process of operation and maintenance. As the operator of the data center, we should always keep the ancestral motto in mind.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.