In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
Hangzhou Digital Cloud Luo Xingfeng
Hangzhou Digital Cloud Information Technology Co., Ltd., founded in 2011, is a leading provider of digital marketing software products and services in China. for many years, it has been committed to providing consumer brands and retail brands with one-stop digital marketing solutions that integrate software products, data models and professional services. With the rapid development of business and the increasing complexity of product functions, the operation and maintenance department of Digital Cloud often spends a lot of time troubleshooting the problems reported by the business department. In order to solve the business problems found when the system is running more efficiently, Digital Cloud begins to try APM. After nearly a year of use, APM not only helps digital cloud realize automatic information collection and automatic association based on business flow, but also improves the working methods of operation and maintenance and technical departments, improving the working ability of the whole team, and everyone has formed the habit of explaining things and problems with data.
The following is the sharing of APM use experience brought by Luo Xingfeng, Operations Director of Hangzhou Digital Cloud Information Technology Co., Ltd., with the theme of "enhancing business value and creating excellent user experience-APM application and integration sharing".
Luo Xingfeng: my sharing is not quite the same as the starting point of the previous few, everyone in front of us is talking about how to achieve APM, and I want to talk about the application and integration of APM from the perspective of APM users.
As a business-oriented company, ensuring that our business can sell better is the only reason why we choose new technology. Today, I got a cool T-shirt when I checked in. I really wanted to put it on and share it, but when I tried it on, I found that my shoulders were too narrow, so I changed it. Clothing manufacturers often encounter similar problems, but most consumers will not tell him about the clothes. On the other hand, the position of Digital Cloud in the software industry is like a clothing manufacturer. If a button in the software always takes 30 seconds to come out, usually no end user will communicate with the company for this humble "experience problem", but quietly give up the product. For digital cloud, this kind of problem is very much in need of optimization, but unless the customer loyalty is very high, no one will take the initiative to tell us that this "small" problem that may affect the business is worth a large bonus, and the value of APM is reflected here.
From a product point of view, because digital cloud provides complex enterprise applications that are sold on the Internet, the architecture of this application is very complex, and the volume can be described by a quantity, that is, the JVM size is up to 30 gigabytes, in which there is a lot of business data running, and a customer will create a lot of tasks. Assuming that concurrent tasks are up to 100, then this customer will have a hundred content running in the system. And it is different all the time. I don't know which one is wrong at all, and sometimes even the customers don't know, because these tasks run automatically in the background, and there is no interference, and a task may run for two days or a day. And if this task is related to customer marketing, hanging up may cause hundreds of thousands of losses. There is something wrong with the system but did not take the initiative to find that Hangzhou Digital Cloud is not responsible?
As the operation and maintenance of digital cloud, it is impossible for us to do a separate monitoring and alarm for each task, because most of the failures are special cases. So the biggest problem is where and how serious the fault is, and whether the failure is accidental. One of the 1500 customers may have this problem once a month, but what is the reason this time? it is necessary to restore the crime scene, such as the full data mentioned just now, and the data of that click or that job must not be lost at that time. If you lose it, you won't be able to see the crime scene, and you won't know what the situation was at that time. Maybe the external channel is dead, maybe there is something wrong with the data platform, or even the network is off. Digital Cloud is not as powerful and fault-tolerant as Tencent Taobao, so it is impossible to cover everything, so the most important thing is to find the problem and write down the problem at that time, and solve the problem together with the developer. This is the work of operation and maintenance.
We once encountered a performance problem. At that time, some customers reported that a certain feature was slow, but we tested it very quickly, so we thought it was sporadic. From the application side, there was really no big problem. After investigation, we found that the problem occurred at the nginx to app layer, which was affected by the unstable communication bandwidth of Taobao computer room. It was a simple question, but it took two core programmers, three or four ordinary programmers, a month to find out what the problem was. Usually, the easiest problem to locate is that the whole service is dead, and the most difficult to locate is instability. At this time, customers are unable to quickly troubleshoot the situation at that time. They need to find the log of each link, and the log of each link has to be read. And it's hard to connect.
The so-called correlation is to connect all the performance data from the front end to the back end of this click to the database. If there is only one problem, what kind of problem is it? the method of our association at that time is to look at the time and judge whether the log data is related according to the time point, which requires an engineer to look at all the log data from beginning to end. There are very few engineers who can really do this, so we all got together for a month. This is what happened to Digital Cloud in the first year. Although the final solution is very simple, it is really difficult to find the problem. We also encountered a database performance problem. At that time, because the parameters of a click were not set, the read record had to be scanned 80 times in the database, many of which were invalid, but the problem could not be found just by checking the database.
It is relatively simple now. Through nearly a year of cooperation with Cloud Intelligence, Digital Cloud has established a complete APM link, connecting the front-end APP to the underlying application components in series. The biggest advantage of doing so is that all the data accessed at one time can be obtained accurately. It is important that one of the clicks mentioned earlier is a transaction, and it is important to get the performance of all aspects of the assignment from nginx to app. We can first find out whether the access is slow or fast, and set the alarm through the threshold, which is later integrated into the management of an important function. After getting the alarm, we will see where it is slow. Originally, we need to look at it according to the log of the database. However, it is difficult to obtain the RDS log of the cloud service of the application. In many cases, we cannot get the log of the database. In this case, it is a better way to get the interface through the application.
I just said that APM is a technology that helps us find problems. For an application side, it still needs a lot of problems to really apply APM to the enterprise and produce productivity. For example, in what way to manage APM alarms, we can never build another monitoring and management platform or alarm platform on the APM platform, because there is no one, can not set up a group for the external system, leaving two or three people to watch the APM data every day. But we have our own internal monitoring system, and Cloud Wisdom connects the alarm data to our monitoring platform, so that we can use the same workflow to unify the monitoring results of APM, and then send them to the monitoring processing staff through a unified API or alarm SMS.
After getting the APM data, there are two main uses. On the one hand, it presents the APM performance data directly on the platform of Cloud Wisdom, especially the platform-level application monitoring data and correlation data. At that time, we encountered a problem, that is, the amount of data in the cloud is too large, and the relevance is very poor. Through repeated discussions on how to land the APM data, Cloud Wisdom strengthens the ability of product customization. Our APM data gradually form some definable reports, integrate the cloud internal operation and maintenance analysis system, and analyze the performance data every week. As in the case of T-shirts mentioned at the beginning, if through performance analysis, it is found that no one has bought all the small sizes for sale, it means that there may be something wrong with my product design, because user requirements should not be biased to one end. Discover product problems from anomalies in business data. Digital Cloud has the same example, whether the performance of online products can be found through business analysis, which is slow or always report an error timeout, feedback the data to R & D, and continue to follow up how the problem is solved.
The big difference between us and large enterprises is that the operation and maintenance of entrepreneurial companies are not only responsible for their own affairs, but also responsible for product quality and product delivery, and you have to cover functions, even more than R & D. Our report not only points out which address is slow this week, but also needs to know which method is slow or which domain is slow, which is of high business value.
APM is a very cool technology, as long as the early cooperation, the final landing is very simple, because with the help of APM, a little girl who graduated for a year and has no technical background can complete the report. This is very important for start-up companies or manufacturers who apply APM. The little girl who makes the report here often takes the report to find R & D and pat the table. When this thing is changed, R & D does not know that there is this problem online, but this little girl knows. Through APM, operators assume more important responsibilities in the process of business development, and indeed solve a lot of problems, and the speed from finding business fluctuations to solving problems is very fast, which can be solved in two to three days.
Through this incident, the entire operation and maintenance team and even the company's technical team found that our positioning began to develop in the direction of business operation and operation and maintenance, which is very suitable for start-up companies and growth companies. At the same time, several of our operation and maintenance engineers who have worked for two to three years are also making rapid progress, and now they have all become very qualified operation and maintenance structure managers or operation and maintenance managers. the biggest sign of their growth is that they know how to choose and choose. We are very clear about what technologies we are going to do and which technologies we cannot do at a stage.
Why? For example, we will not have a lot of investment in the direction of APM, this is not our own business point, so this piece is given to Cloud Wisdom, and what we need to do is the middle architecture, and there will be no company to help us. So we need to find a way to access all the strict control systems, including external, self-built, and internal products, and use external resources the same as our own resources in order to achieve rapid development. And all we have to do is to settle the bill. I do it myself with how much money I use, how much money I use to find someone else to do it, and what standard to access it with if I find someone else to do it. Only in this way can we achieve a unified method. This is what the operation and maintenance of entrepreneurial companies need to do.
Professionals are very good at doing APM, because we can't make the above things, and when we don't have these × × points at the beginning, it's very difficult for us to use a technology or a method to string the data together from front to back, so we spend the most time correlating all the clues of the crime scene. An old policeman does not rely on a single information, but the ability to aggregate all the information. What APM does is very powerful. On the one hand, it automatically collects all the information, and on the other hand, it automatically correlates all the information. This is not what ordinary enterprises can do, and the implementation process also takes a long time, but the time spent is very worthwhile, because now APM has been able to help us find and solve problems.
All our products have to be stress tested before they are launched, that is, they use heavy loads to beat the application until it crashes, because when e-commerce such as Singles Day Day is booming, the pressure of the product is 100 times higher than usual. the purpose of stress testing is to ensure that the product can run healthily under extreme circumstances. Pressure testing to see which system is the first to hang up, which location which function module is the first to hang up, through the APM test will be easier to focus, so that you can formulate a Singles Day system operation plan, if you can not handle it in the end, you have to downgrade from zero to four o'clock, as long as you lower that point, APM can help us find this point.
In addition, we have less need for manpower, lower barriers, and longer effective working hours for all of us, from being busy doing less important things (such as the positioning problem mentioned above) every day. To now focus on solving problems. In the past, the operation and maintenance requirements for the technical level is very high, because the underlying system support is running on the platform, so we need to have a deep understanding of the group's platform, but now things are much less difficult. In the past, we had to wait passively until the customer complained, and the customer said that the function had failed before finding and solving the problem, but now as long as we do performance analysis and performance alarm every week, the customer does not have too many problems, even when the customer is not in use. We find that there is a performance bottleneck in the new feature, where the alarm has been given, and then tell the product, R & D, and operation that this function needs to be adjusted immediately. This wins a lot of time. Especially in such times as Singles' Day, if we do not launch a certain function in time, we are likely to be preempted by our competitors, and businesses cannot wait to buy their competitors' products. At this time, our products are late and the marketing opportunities are gone. Whether it is the responsibility of research and development or the responsibility of operation and maintenance, it has failed for the company.
Working in a large Internet company like BAT is very focused, such as testing the performance of network protocols, as long as the size of the package can kill the service, and the operation and maintenance department of the startup company also needs to make the company develop better and make better money, so that the employees of the team have a sense of professional achievement, which goes beyond the sense of achievement in solving technical problems, because the company makes money from your technical problems. APM vendors also have a strong sense of professional achievement because their technology can help customers solve problems, and this sense of achievement is also very strong.
It has been about a year since we came into contact with APM. Last year, we didn't have such a thing. At the beginning, both sides just had a vague need. We put this thing on the ground to achieve, and then improve our working methods and improve the working ability of the whole team. We overcome a lot of problems in the middle, and we have formed a new habit of explaining things and problems with data. This is the greatest value of APM.
Finally, let's review today's sharing. For different types of enterprises, APM brings different values. It is more important for startups to help us save time and manpower, speed up the pace of the market, enhance R & D capabilities, enhance operation and maintenance capabilities, and let enterprises grow healthily. The business characteristics of many large companies will be more complex, and even APM will become a new business growth point in the future, and they are likely to make their own APM. But we also see a new direction. APM can not only do application performance analysis based on IT, but also do business performance analysis. Once the product manager came up to me and said, please help me to see which function point in the product is the customer's favorite, and then this user uses it well or not, and the data we did that time came from the data collected by APM. Through cooperation with Cloud Wisdom, we know more and more about APM. We know what data APM collects, what we can do and what we can't do. We continue to deepen our cooperation, because APM must be implemented at the system level. How a business module is, how hot it is, how customers react, and which region has the most clicks, can be obtained by APM, and more value points can be displayed in the future.
The idea of sensor in each link of APM is relatively easy to realize. Many companies achieve monitoring and data acquisition by invading embedded sites, but as a third-party service enterprise, it is not suitable to excessively invade customer systems, so the effect is the same through initiating detection ideas to obtain data. We also do a lot of different things with sensor data, such as the mount point of the hard disk, because we will encounter problems such as different models and network cards, so we can use some small robots to transmit the data. These points may not necessarily determine which APM product you use, but this idea will certainly improve your work. Thank you!
Quan A
Q: just now you shared a lot of positive things brought by APM. I want to know more about what you have adopted and what problems you have encountered that have not been solved, such as whether you can detect them in the background.
Luo Xingfeng: before, the background task could not be monitored, but now it can be done, but this thing has not been done online in large quantities, because the data can be collected by other means, but they have indeed done so.
Q: there are many problems at the beginning of the product release, and the dependence on APM will be very strong. Has the importance of APM changed with the evolution of the product?
Luo Xingfeng: we are not inclined to wait for the product to go online, just run in grayscale, and when it is very stable, we will drop the probe of the back-end APM, because there are too many browser features to decide how the customer writes it, and we may encounter conflicts, but conflicts can be adjusted as long as they can be identified and found. There is nothing wrong with the writing. One advantage of digital cloud products is that customers are toB, and they have very low requirements for browsers. If they are To C users, they have to do a lot of compatibility tests. They often find performance problems, find back-end programmers to locate, and say for a long time that all the data is good. As a result, they find that there is something wrong with the front-end JS, as long as a slight change will be ready. This is the importance of browser monitoring, must really be initiated from the customer side, cloud wisdom this function is actually quite strong. If it is a To C product, we have to do this, we will go forward as far as possible, the user experience of the product is still very important. The product complexity of Digital Cloud is too high. The utilization rate of a customer for the product may only be 30%, 40%, but some customers like this 30%, and some customers like the other 30%, so the product must be flexibly customizable. Allow different ways to use our software, how to balance the features of these products, is the main problem we are facing now.
Cloud Wisdom is a business operation and maintenance solution service provider, with product monitoring bao (www.jiankongbao.com), perspective treasure (www.toushibao.com) and pressure testing treasure (www.yacebao.com), which has provided one-stop application performance monitoring, management and testing services to hundreds of thousands of users in e-commerce, mobile Internet, advertising media, online games, education and health care, finance and securities, government and enterprises, and other industries.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.