An example of Digital Transformation of Internet Operation and maintenance 03/20 Update SLTechnology News&Howtos

An example of Digital Transformation of Internet Operation and maintenance

2026-03-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article is to share with you the content of an example of the digital transformation of Internet operation and maintenance. The editor thought it was very practical, so I shared it with you as a reference. Let's follow the editor and have a look.

1. Start with digitalization

Let's start with the digital transformation. In the past few years, the whole industry or the whole environment has been talking about digitization. What is digitization? In my opinion, digitization is the transformation of an enterprise, which may bring faster efficiency and a better user experience. SF has been making a digital transformation in the past few years.

Send out the order with bumper harvest parts, which we all use more. There are many links behind this, including split-point, land transportation, transit and aviation, which is the physical path.

The data flow is more complex, such as task distribution, routing distribution, waybill generation, distribution, and so on. What we have been doing in the past few years is to digitize and online all these things, which will optimize the follow-up path planning and distribution planning. There will be great savings in labor costs and transportation costs.

Have you noticed that before 2017, you will be given a piece of paper to fill in the shipping list when you send it to SF Express? There has been a change since 2017. SF has done a fusion project to put all the paper orders online, that is, all orders issued now are QR codes.

This is the trend of business development. From March to May, it is the stage of slow promotion and trial operation. From May to September, we carried out a rapid promotion of the whole network, replacing all the paper sheets. After more than half a year, all the units of Shunfeng are online and electronic, and the project will be completed in December. For the whole project, it is very successful, and the business volume has been growing. But is it really so plain sailing behind this? From March to October, we encountered so many problems that we couldn't say what we had in mind.

two。 Start-goal

The picture above is a tug-of-war, in which there are many people, similar to our many different positions in a project or in an enterprise, operation and maintenance, development, product, business, promotion, want to do a thing well, the first thing to do is that the goal must be unified.

Business. All of us are to serve the business and create value for the company. If a technician doesn't understand the business, how to talk about creating value for the company. First of all, we must understand the business, consider problems from the perspective of the business, and communicate with the language of the business. Change the perspective and communicate the pure technical language with the project team from the business perspective, so that everyone can think about the problem from the same dimension. The operation and maintenance staff should abandon the concept of executive team. Operation and maintenance team is not an executive team, do not position yourself as an executive team, the executive team is do, but it is only do, we are not do this action, we must generate the maximum value in the entire core value chain. For example, time is spent on infrastructure assessment, cost and security to reflect these values. About success. From different dimensions, there are many definitions of success. from a project point of view, success refers to whether the project is successful or not. When the company is relatively large, it will be divided into many departments, each department has KPI, and each department has to memorize some indicators, which will lead to different positions and different perspectives.

So from the perspective of the project, if the project is successful, we are successful. Many times we have to break down some departmental walls, stand from a different perspective or raise ourselves to another perspective to consider the problem.

3. Initial stage-performance process. Shunfeng in the past few years is a relatively heavy system, the process will be very cumbersome. An examination and approval needs to find N people, make N phone calls to say a lot of things, the statute of limitations is very low. Therefore, some optimizations need to be made in the process, otherwise the whole project will be slow down. Organizational structure. In traditional industries, after the operation and maintenance system is very large, it will be divided into many departments and organizations, for example, the infrastructure will have middleware, systems, networks, storage, and so on, and different professional groups will be responsible for professional areas. At this time, we will face a problem. From the perspective of the project, it will be very troublesome to communicate in various fields, troubleshooting a problem, an anomaly or a fault. It takes a bunch of people to take care of it, and it's very bad in terms of efficiency. Mode of thinking. The relationship between operation and maintenance and development? Is it cooperation or service or is it a matter of passing the buck? I believe this is a problem that we all will encounter.

We encountered the above three problems at the beginning of the project. What is involved is not only purely technical, but also involves some organizational processes, which will affect some of the ways that many people have worked for a long time, which is the most troublesome thing. How are we going to fix this? In fact, it's very simple, it's your boss.

Many people who engage in technology are not very good at making use of the resources we already have. if you encounter the above problems, only your boss can solve them. If you take care of your boss, the boss can give you a lot of resources to push the matter forward, otherwise it will only get stuck. Business pressure pushes you, escalate things to your boss and convince him to help you coordinate resources.

Light flow. Introduce a lightweight process, reduce a lot of approval nodes, and combine with tools to make the whole channel run smoothly. Full stack operation and maintenance team. Break the existing professional group division to create a full-stack operation and maintenance team, with all operational permissions and functions, and be uniformly responsible for the whole failure, problem and event. Will consider the problem from the whole point of view, break the department wall and professional wall, and flatten the organization. A change of thinking. That is, the relationship between operation and maintenance and development should be a cooperative relationship, not a service-level relationship, the status of the two is equal and the goal is always. Therefore, our professional ability must be greatly strengthened and cooperate with developers and products without boundaries in order to produce the greatest value in the project. 4. Outbreak period-stable

The outbreak period is from June to September, when the volume of business broke out from 1 million to 8 million or even 10 million. There will be a lot of problems in this process, such as performance problems and weird problems. In the early stage of the project is rapid promotion and trial and error, will ignore or do not consider some technical risks, will leave a lot of technical debt, which is frequently exposed after the growth of the whole business.

With the upward force of the entire business and the continuous growth of business volume, the pressure will be transferred to R & D and operation and maintenance. If frequent failures occur, the pressure on each level is very great.

Engineer culture is professional, efficient, open, technical and responsible, and we must think that we can do it.

Elasticity is a lifesaver for us, and it expands with the growth of business. Elasticity is divided into two architectures: one is the application architecture, and the other is the infrastructure. The application architecture is a little bit more research and development, and the infrastructure is a little more operational and maintainable.

Application Architecture:

The first is stateless, that is, everything on the cluster does not have any user information. The second is the single point, the single point problem is the barrel principle, how much water a bucket can hold, does not depend on the longest board, but on the shortest board. The third is distributed, which is to facilitate capacity expansion. The fourth is horizontal expansion. The entire system architecture must support horizontal expansion. The most troublesome point is the database, the general practice is to split small tables, large libraries and small libraries, how to divide between small libraries there is no standard practice, according to their own company's business form, such as according to the program, according to the user ID and so on. It is best to implement the database scheme in the early stage, and it will be very painful to migrate the data later.

Infrastructure:

The figure on the left is a rough sketch, where client requests go through a variety of links, such as firewalls, gateway load balancers, databases, and so on. This long chain of links should support horizontal and rapid expansion. Horizontal related to the selection of technical standards, fast is to test the technical architecture ability, in the promotion, the server may be expanded from 100 to thousands, can be quickly delivered or need to be done manually, this is fast.

This is an internal operation and maintenance platform called Weishi. Here we divide a lot of resources into many layers, the lowest layer is hardware, the upper layer is the virtualization layer, and then the upper layer is some component layer, the professional group will make their own component layer into a lot of services, and then connect them all in the form of choreography for external delivery, so that the application of some of our technical resources can be easily implemented.

The release version involves grayscale, a lot of agile iterations, there will be a lot of trial and error in it, the version is very frequent, our system has to support grayscale.

There is a new feature for the business. You can first cut 10% or 5% of the traffic to try it out. It is more intuitive to consider what is considered in the operation and maintenance layer. When the traffic is cut by 5% and 10%, is there any change in the CPU load of the server? if the traffic is cut to 20%, the QPS of the database is 20% to 30% higher than before, and the problem can be found and solved immediately. The role of grayscale is to give the business layer trial and error, but also to the IT layer to leave a lot of room to ensure trial and error, if something goes wrong, we can quickly switch the traffic back.

On the right are some grayscale switching rules. We need to switch according to the environment, according to a certain system, and according to the UIL service string or version number. The more detailed the rules are, the finer the switching will be, and the more secure the comparison will be.

There are generally three modes of service protection: current limiting, circuit breaker and isolation.

The concept of current restriction is that when the traffic explodes, resulting in slow response of the overall application, it needs to be controlled to filter out some redundant or insignificant requests, although it is not good for the user experience. but at least it can ensure the stability of the overall system. There will be this function on the current-limiting load, and it can also be implemented on your own gateway. Circuit breakers are now talking about micro-services. Each module is dismantled very carefully, which will inevitably cause a lot of things to be out of control. If there is something wrong with a certain version, it will cause you to hang up. Through the circuit breaker service, you will not request to return directly when you hang up, which is similar to a downgrade. Isolation, which used to take care of everything in a large thread pool, would burst the entire thread pool if something went wrong with a certain type of request. Isolation is to take apart a large thread pool, different types of requests use different thread pools, and each type of request does not affect each other.

Monitoring is divided into two dimensions: one is infrastructure, the other is business monitoring.

If the dimensions of the infrastructure, such as server CPU, IO, MEM, etc., are national, some wave testing software, including APM, will be used. To be more detailed, monitor the number of calls to each method service, and so on. Business monitoring, the normal monitoring index of the infrastructure does not mean that the business is normal. There must be less cocoa for business monitoring. The service request, response code and response time on each key core link must be set a threshold that exceeds the trigger alarm. Based on these monitoring data, make trend or forecast early warning through algorithms, such as capacity estimation. There is also a burying point, which makes it easy to locate the problem by outputting the whole link. Finally, there is the service link, the existing systems are transferred to each other, some system problems, may affect the surrounding business, so we need a complete link panorama.

On the left is the micro-service diagram, the single application is divided into very fine according to some business rules, distributed on different nodes, a micro-service may be hundreds of nodes, then it is difficult to locate the fault. We need link tracking and a very complete logging system to handle the problem well.

I have my own views on micro-services. The first is the rule of splitting. If the split is not done well, it will be a mess, and in the end, there will be no rules. The second is that microservice needs the support of organizational structure, otherwise the whole microservice is a bit like complicating simple things under the guise of technology.

No system can guarantee 100% no problems, so a contingency plan is needed. Make some downgrade or off switches on the system. In business, it is best to have an emergency plan that is also offline.

The exercise is to verify the effectiveness of the emergency plan. There are two environments for drills: one is done directly in the production environment, and the other is done in a simulated environment. No matter what kind of environment you want to have a real sense of the scene, you should put pressure on the people who participate in the drill. People's ability can also be exercised in the course of the exercise.

There is a need to promote the business, but it is not sure whether the server can support it. The easiest way is stress testing. Pressure test is divided into three cases: single-interface pressure test, production flow playback and simulated flow playback. Single-interface pressure test can not accurately reflect the actual situation.

At this time, we need to replay the production flow, pull down all the operations above the production, and do some pressure tests on the whole environment through the playback tool. The playback tool must support playback in multiples and verify the amount of business estimates for detection. Must also support to be able to create their own data, the existing production above the flow data is still different from the actual promotion.

At the beginning, there are two purposes to do double work: the first is to ensure that the system is more reliable, and the second is to make rational use of disaster recovery resources to avoid waste. One of the most troublesome things about doing double work is to solve most of the requests or requests in a unit in the same computer room as much as possible.

There will be problems with the cross-computer room traffic, especially when a certain computer room is down. There is also the problem of data synchronization such as Redis and DB to ensure cluster data consistency. Through the kafka module, it can be diverted to the corresponding computer room according to the diversion rules.

In terms of diversion, we must support users to request to the front end to be able to do normal diversion operations. The method of shunting operation is to mark the city code in the http request in the APP or browser, and divert the traffic to the corresponding data center according to this marking rule.

This figure is a switch. If there is a problem in one of the computer rooms, we configure it on the OPS platform to switch the entire traffic to other computer rooms.

5. Continuation-value

What is the value of operation and maintenance?

The first quality. Quality is nothing more than some availability, number of failures, average failure duration and user satisfaction rate, which operation and maintenance must achieve.

The second cost. Our effect is to reject waste, there are multiple dimensions, whether resources are fully and reasonably utilized, and whether capacity assessment is digitized. Whether the process is combined with the tool. Whether the manpower is optimized or not, we should find ways to replace the repetitive work.

The third efficiency. The traditional type of operation and maintenance should be transformed to the IT operation direction to transform the solution provider. The other direction is to transform to the development of operation and maintenance, liberating from repetitive work.

The fourth data operation. Operation and maintenance personnel are the most aware of the trend of the business process and data model of the whole company, and they need to do a lot of data analysis, including the embodiment of data operation ability, so as to create greater value for the company.

Thank you for reading! The example of digital transformation of Internet operation and maintenance is shared here. I hope the above content can be of some help to you, so that you can learn more knowledge. If you think the article is good, you can share it and let more people see it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.