Talk about on-line change 07/01 Update SLTechnology News&Howtos

Talk about on-line change

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Please add a link description to the blackboard newspaper (https://mp.weixin.qq.com/s/hGr8P9A0-9RbvW70UxQckQ)).

The author works for JD.com and has an in-depth understanding of stability assurance, agile development, advanced JAVA and micro-service architecture.

Why are we talking about this topic today? Because I made a mistake when I was online recently, and I want to share it with you. If you don't understand the process of the accident, you can go straight down and review it.

The process goes like this: my requirement is to add an optional parameter to the method parameter POJO class, which I define in the last parent class of the POJO class. What is provided is to call the RPC service remotely, so you need to make a JAR package describing the class for the caller to use. This JAR package contains the service interface and the input and output parameter entities.

The 1.1-SHAPSHOT version is used in the joint debugging test. Before launching, I changed the version number to the 1.0-SHAPSHOT version, and updated the 1.0-SHAPSHOT snapshot package in the private server without any changes to the business code. Snapshot package means that each update of the same version will regenerate a timestamped package, and when you build the download package, you will download the latest update of the specified version, for example, you can have both 1.0-SHAPSHOT-2019101309 and 1.0-SHAPSHOT-2019080808 versions. After I went online, the verification was passed, indicating that the version is backward compatible, and even the 1.0-SHAPSHOT-201908080 used by the caller can call my 1.0-SHAPSHOT-2019101309 version of the service normally. After a few hours, the caller went online using the 1.1-SHAPSHOT version of the package. After the full launch, I found that the request could not reach me at all, because there was a serialization exception in the RPC framework, and the caller began to roll back. The caller started building online with the latest 1.0-SHAPSHOT-2019101309 of 1.00.There was still an error. So I was in a hurry and started to roll back. However, I put the historically built package online, that is, I used the 1.0 old 1.0-SHAPSHOT-201908080 package. Of course, there was no doubt that there would still be an error in the call. At this time, I found that the caller had rebuilt when it was rolled back, so I contacted my colleague in the private server to delete the latest 1.0-SHAPSHOT-2019101309 of 1.0, and then informed the caller to rebuild and put it online, and the service resumed at this time.

There are a lot of operational defects on both sides of the process, such as the package I launched is not strictly tested, and it has not even been verified by the regression of both sides. For example, the way I deal with online problems, I don't really need to roll back. For example, I told the version number used downstream, but the test version was still used downstream. For example, there is no grayscale release verification downstream, rebuilt when fallback after an exception, and so on. A seemingly simple online launch has failed, indicating that the online distribution specification is not fully grasped.

Readers may think that the online change is not worthy of in-depth discussion. The launch is nothing more than replacing the package currently running online with certain technical means, or overwriting the configuration information and then restarting the service. And now it can be done with the click of a mouse button. However, there are a lot of details, and if you don't pay attention to it, you will make mistakes like me. Next, I will talk about the areas that need to be paid attention to before, during, after and after the launch.

First of all, before the launch, this launch includes the grouping release of new instances, the release of new machines, and the release on existing machines. The new instance grouping means that you need to carefully compare the configuration with the old grouping, including the journal level configuration. The new machine release means that your network segment may be new, your permission to call public network services may not be available, your dependent system libraries may not be installed, and your IP may not be on the whitelist. These are all problems that I have encountered in my actual work.

When the launch conditions and environment are available, including the machine configuration mentioned above, as well as the launch time, we can apply for launch. In principle, the day before holidays (including weekends), the day of major promotions (such as product launches), and peak traffic periods are not allowed to go online. The content of the online application generally includes background description, operation object, operation steps, CHECKLIST, expected result, rollback plan, and self-test. Any action should be clearly written, rejecting options that are vague or only thought to be feasible in mind. At the same time, your operational changes should also be known to products, testers and other colleagues, not just you, code reviewers, and leaders. Avoid that when other services are affected, others only know who to turn to by looking at the online record or looking through the code submission record. If you are modifying a common code or protocol, it should be known in advance.

Another thing to pay attention to before going online is to make sure that the online package you build includes the code block you merged. Now the online package built by most platforms will have the name of the latest COMMIT_ID submitted to the GIT code repository, which is very intuitive and clear.

When the pre-launch check is completed, you can release and deploy the service instances. Generally, the service instances are deployed in batches according to grouping and data processing room. What I emphasize here is grouping and sub-server room. What I emphasize is not just the proportion of all instances. Of course, we usually deploy in batches according to the proportion of 30%. However, if your proportion happens to hit a certain group or all instances of a data center, it means that the services of the group and the data center are all online, and there is likely to be no service provider. If the launch fails, the situation is not temporary.

The first step of batch deployment is to select one for release and verification in each data center, which helps us to identify problems in a timely manner and avoid proliferation. Some features even require data accumulation to verify, so sometimes they are deployed in time periods, with a small proportion of service instances deployed every hour. Of course, batch deployment is only a means for us to avoid online risks. It does not have the purpose of testing and cannot replace testing, that is, it is prohibited to deploy packages that have not been tested to the production environment, even if the JAR package version number of the RPC service has been modified.

Batch deployment generally needs to verify whether the original functions are affected, whether the business monitoring is abnormal, whether the service instance starts normally, whether the traffic arrives normally, whether there are defects in the effectiveness of the function, whether the program resource consumption is normal, and whether the program performance is normal. I have encountered that the publishing platform showed that the launch was successful, but the service instance OOM crashed. If you mount this instance to the service hastily at this time, there will be an exception alarm waiting for me. I have also encountered colleagues who deploy the newly expanded machine, but forget to mount the traffic, wasting time and machine resources, but fortunately the original machine successfully resists the traffic.

Speaking of which, I would like to add another situation. During deployment, it is found that the connectivity of a machine is abnormal and is in the state of operation and maintenance. It may be that the port used to send the package is affected, but the port for service monitoring is not affected. At this point, you need to remove the traffic from this machine to prevent the traffic from hitting the wrong service version when the status is normal. Why do you want to talk about this separately? in fact, it is to emphasize that we need to make sure that all instances are updated to the new feature version, including to avoid missing instances.

Online change is a high incidence of accidents, when there is a problem, we do not panic, first report to the leader, and then stop the loss and restore service as soon as possible. If an exception occurs when the first verification is sent, the quickest way is to modify the Nginx configuration to transfer the traffic to other normal machines. If you extract the traffic or stop the instance, it will actually be out of sync, because the heartbeat detection of the load balancer in the user access layer may be delayed. If you find an anomaly after the full release, and you can't stop the loss in time according to the emergency plan, you can only choose the rollback service to avoid the secondary impact.

In fact, when users encounter problems, they seldom choose feedback, and most of them are silent, so full verification and full monitoring must be carried out online, and do not wait for feedback from users. The scope of influence may be very large when users come to feedback. Then we need to standardize the work process and output to improve stability and quality.

OK, this is the end of this sharing, if it is helpful to you, welcome to share with your friends.

Author BLOG:www.liangsonghua.me, please add a link description

Author introduction: JD.com senior engineer-Liang Songhua, long × ×, agile development, JAVA advanced, micro-service architecture

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.