In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
Https://www.toutiao.com/a6716402400925581836/
Professor Lin Dahua, co-founder of Shangtang Technology and director of Hong Kong Chinese-Shangtang Joint Laboratory, shared new explorations in computer vision research at the 2018 Global artificial Intelligence and Robotics Summit (CCF-GAIR) in Shenzhen.
In the speech, Lin Dahua summarized, reflected and looked forward to the development of computer vision in the past few years. He said that deep learning ushered in a golden era of computer vision development. Computer vision has made great progress in recent years, but this development is extensive and is piled up with data and computing resources. Whether this development model is sustainable or not is worth pondering.
He pointed out that as computer vision peaked in terms of accuracy, the industry should seek more levels of development. Shangtang's attempts mainly include three aspects: first, to improve the efficiency of the use of computing resources; second, to reduce the labeling cost of data resources; third, to improve the quality of artificial intelligence.
The following is the full content of Lin Dahua's speech:
It is a great honor to be here today to share the work of Hong Kong Chinese-Shangtang Joint Laboratory over the past few years. Just now, several speakers have made a wonderful sharing from a business point of view. I believe everyone has benefited a lot. My speech may be a little different. I am the co-founder of Shangtang, but I am not directly involved in the operation of Shangtang in the business field. If the question you are concerned about is when Shantang will be available, I'm afraid I can't answer it.
But I can tell you that Shangtang Company was not built in a day. Its success depends not only on the efforts of the past three and a half years, but also on the 18-year accumulation of original technology in the laboratory behind it. What this laboratory does is not what Shantang takes to make a profit today, but which direction it should go in the next three, five or even 10 years if it wants to be a great technology company.
Artificial intelligence is developing rapidly, but it is extensive.
The following picture must be familiar to all of you.
In the past eight years, it can be said that computer vision has made a breakthrough, of which the most important technological progress is the introduction of deep learning. There is a very high-level competition in this field-Image Net. Before 2012, the recognition error rate in this competition was relatively high, and computer vision experienced a golden period of four years after the introduction of deep learning technology in 2012. During the four-year golden period, the recognition error rate in the Image Net competition dropped from 20% to nearly 3%, and then stagnated until the competition was suspended last year.
So I would like to ask a question: deep learning has indeed promoted computer vision to make great strides and breakthroughs in the golden period in recent years, but does it mean that the development of computer vision has come to an end? Standing on the basis of today, looking forward to 3 years, 5 years, 10 years, which direction should we study in the future? This is what our laboratory and Shang Tang has been thinking about.
The success of artificial intelligence in the past few years is not accidental, nor is it just the result of the development of algorithms, but is the historical convergence of many factors. The first factor is data, we have a huge amount of data. The second factor is the development of GPU, which has contributed to a big jump in computing power. On the basis of data and computing power, the progress of the algorithm has brought the success of artificial intelligence today, as well as its landing in many application scenarios. The message I want to convey to you is that although we see the success of artificial intelligence and the great progress of algorithms, artificial intelligence is not a magic trick, in a sense, it is a performance improvement supported by a large amount of data and powerful computing power.
Looking back at the brilliant development of artificial intelligence in recent years, we can see that, in a sense, this is a very extensive development. Everyone is in pursuit of accuracy and performance, and Chinese companies are in the top three on all the competition lists. Although we have been on a lot of lists, the profits of the industry are basically earned by standard-setting companies. Is this development model sustainable? This is worthy of our deep consideration.
In addition to accuracy, we should pursue efficiency, cost and quality.
Looking back at the development of deep learning or artificial intelligence in the past few years, I think we still have a lot of work to do and a long way to go.
Next, I would like to share with you a few directions of my thinking: first, learning efficiency, whether we have made full use of existing computing resources? Second, how to solve the cost problem of data and tagging? Third, although we have achieved 99.9% accuracy in the list, can the trained model really meet the needs of our life or social production? These are the problems that we need to solve to promote the better and faster development and landing of artificial intelligence.
First of all, I would like to talk in detail about the first aspect-- efficiency.
As mentioned earlier, we are now following an extensive development path, which relies on the accumulation of data and computing resources in exchange for high performance, which is a competition of resources rather than efficiency. With the development of the industry today, standard-setting companies have made most of the profits. In the face of this situation, how should we develop in the future? To answer this question, first review the current models and technical models to see if there is still room for optimization. The principle of optimization is very simple, that is, good steel is used on the blade.
Give an example to illustrate. We began to enter the field of video two years ago, video requires very high efficiency, because the amount of video data is very large, there are 24 frames in a second, 1500 frames in a minute, which is equivalent to a medium-sized database. It is obviously not appropriate to deal with video in the traditional way of dealing with images.
In 2013 and 2014, most video analysis methods were simple and crude: take out each frame and run a convolution network, and finally put them together to judge. Although computing resources have developed rapidly in the past few years, the GPU video memory is still limited. if each layer is put into the CNN, the GPU memory can only hold about 10 to 20 frames, and the GPU will be filled in a second of video. There is no way to analyze the video for a long time, which is a very inefficient mode.
We know that the repetition between adjacent frames of the video is very high. If you run every frame, in fact, a lot of computing resources will be wasted. After realizing the inefficiency of this repeated computing mode, we changed the sampling method to sparse sampling: no matter how long the video is, it is divided into equal-length paragraphs, and each paragraph takes only one frame. In this way, we can form a complete coverage of the video in time, and the analysis results naturally have high reliability and accuracy. With this network, we won the ActivityNet championship in 2016. Now most video analysis architectures have adopted this sparse sampling method.
After that, we further expand the research field, not only to do video understanding, but also to do object detection in video. This brings new challenges: before we do classification and recognition, we can segment each paragraph and get a general understanding of each paragraph; but object detection cannot do this, and we have to output the position of the object in each frame. Time can't be sparse.
The following picture shows the network where we won the first place in the video object detection project of the 2016 ImageNet competition. The basic approach of this network is to take out the characteristics of each frame, determine its type, adjust the position of the object frame, and then string it together. Each frame needs to be processed. At that time, the most powerful GPU could only process a few frames per second, and it took a lot of GPU to train the network.
We hope to use such a technology in the actual scene to get a real-time object detection framework. If we deal with each frame in the way we just did, it will take 140 milliseconds, and there is no way to achieve real-time. But if you pick sparsely, say every 20 frames, what about the frames in the middle?
You may want to plug it out by interpolation, but we find that this method has a great impact on accuracy. Once every 10 frames, there is a big difference in accuracy. In the new method, we make use of the relationship between frames, through a much less expensive network module, it only takes 5 milliseconds to transfer information between frames, and the detection accuracy can be well maintained. In this way, after we re-changed the path of video analysis, the overall cost has been greatly reduced. There is nothing new here, the network is all those networks, just that we have replanned the computing path of video analysis and redesigned the whole framework.
You can take a look at the results. The above is processed frame by frame for 7 milliseconds, and this is the network we used for the 2016 race, and we improved it to more than 62 frames per second, and its results are more reliable and smoother because it uses the correlation between multiple frames.
Shangtang is also doing autopilot, which requires automatic understanding and semantic segmentation of the scene in the driving process, which is also a very mature field. But people have not paid attention to the point, we are concerned about the accuracy of segmentation, pixel-level accuracy, which is meaningless. When we really do autopilot, what we care about is how fast you can tell that someone is there when you are in front of your car, and then deal with it urgently. Therefore, in the scene of autopilot, the efficiency and speed of judgment are very important. The previous method takes more than 100 milliseconds to process a frame, and if a person does appear in front of the car, it will be too late to react.
Using the method just mentioned, we have rebuilt a model to make full use of the connection between frames, and we can reduce the efficiency of processing each frame from 600 milliseconds to 60 milliseconds. The response speed of this technology to sudden situations is greatly improved. As a matter of fact, a similar method was used here, so I won't go into the technical details.
Just talked about how to improve efficiency, and then talk about how to reduce the cost of data.
Artificial intelligence is to have artificial intelligence first, and how much artificial intelligence there is. Artificial intelligence has today's prosperity, we can not forget behind the silent dedication of thousands of data annotation personnel. Today, Shangtang has nearly 800 markers constantly tagging data day and night, and the tagging teams of some large companies are as many as tens of thousands of people, which is also a huge cost.
How to reduce the cost of data tagging is something we think about every day. Since many things cannot be tagged manually, can we change our way of thinking and find the tagging information contained in the data and scenes?
The picture below shows the results of a study we did last year, which was published on CVPR, which tries a whole new way of learning. In the past, the labeling cost of pictures was very high, and each picture not only had to be marked, but also the target object was framed. For example, to learn to identify animals, you need to mark them manually. When we were young, the process of learning to identify animals was not like this, not by the teacher giving me a picture with a frame to learn, but by watching "Animal World". This led me to think: can the model identify all animals by watching "Animal World"? There are subtitles in the documentary. if you associate it with the visual scene, can the model learn automatically? For this reason, we design a framework, establish the relationship between vision and text, and finally get the results in the following figure.
The picture below shows dozens of animals that we can accurately identify by looking at Animal World and National Geographic magazine without any tagging or human intervention.
In addition, face recognition also needs to label a large amount of face data. There is some data, such as our family photo albums, which are unmarked but contain a lot of information.
Look at the picture below. these are some scenes from the movie Titanic. In the upper left corner of the scene, it is difficult to recognize these two people just by looking at their faces. Looking at the first scene in the upper right corner, we can recognize that the person on the left is Rose, but the man in the suit on the right still can't see it clearly. If we can identify the scene behind the movie, we will find that Jack and Rose often appear in the same scene. Based on this social interaction, we can infer that the man in the black suit may be Jack. In this way, we get a lot of meaningful data without tagging human faces.
We also apply this technology to the field of video surveillance: when a person walks from one end of the street to the other in Shenzhen, the image of his face often changes, but as long as we can track his trajectory, we can tell that the face we have photographed belongs to the same person, which is very valuable information for training face models. This result has just been published in a paper by CVPR.
Finally, let's talk about quality.
The ultimate goal of artificial intelligence is to bring convenience to life and improve the quality of life. However, in recent years, the development of artificial intelligence seems to have stepped into a misunderstanding, thinking that the quality of artificial intelligence is linked to accuracy. I think the quality of artificial intelligence is multifaceted and multi-level, not just accuracy.
Let me show you a few examples. "talking by looking at the picture" is a particularly popular area in recent years, that is, showing a picture to a computer and letting it automatically generate a description. The picture below is the result of our latest method.
You found that we showed three different pictures to the best model, and it would say the same sentence, which scored very high on the standard test, without any problems. But when we put it together with the human description, we find that humans don't talk like this. When humans describe a picture, even in the face of the same picture, different people express it differently. In other words, artificial intelligence ignores other qualities in the pursuit of recognition accuracy, including the naturalness of language and the characteristics of pictures.
In order to solve this problem, we put forward a new method last year. It no longer regards content description as a translation problem, but as a probabilistic sampling problem. It acknowledges the diversity of descriptions and that everyone will say different things when they see the same picture. We hope to learn this sampling process. For details of this model, you can refer to the relevant papers. Only the results are shown here: for the same three pictures, the model generates three sentences that are more vivid and can better describe the characteristics of the picture.
Let's extend it a little more: since the AI model can generate a sentence, can it also generate an action? The picture below shows one of our latest research, which is being done by many AI companies to generate a vivid dance for AI. Here are some simple actions that are automatically generated by the computer and are not described in a program.
Finally, make a summary of the previous sharing. In the past few years, artificial intelligence and deep learning have made rapid progress, which is reflected not only in the accuracy of standard data sets, but also in the landing of business scenarios. But looking back on this period of development, we find that we have forgotten a lot of things in the process of making great progress towards accuracy. Are we efficient enough? Are we overdrawing the cost of data tagging? Can the model we train meet the quality requirements of real life? From these perspectives, I think we are just getting started. Although some important progress has been made in our laboratory and many other laboratories in the world, we are still in our infancy and there is still a long way to go. Above, hope to share with you, thank you!
The following are the highlights of the question and answer session:
Question: I would like to know how Shangtang allocates resources in basic R & D and product landing?
Lin Dahua: that's a very good question. I think this is not a simple allocation problem, but a positive circular process. Our colleagues on the front line will come into contact with a lot of specific landing scenarios and find problems in the scenes. Many of the problems I mentioned earlier are found in the landing scene, which can provide a different perspective for academia. Frontline colleagues are subject to the pressure of product landing, unable to solve these problems, these problems will be transferred to the laboratory to do long-term technical discussion. The result of the discussion will eventually feed the product back to the ground. This makes Shangtang's technology leading and advanced, we not only negotiate data and computing resources with friends, but also have a technologically leading perspective. This is the interaction between our basic research department and the frontline product department.
Question: is it a trend for cv manufacturers and traditional security manufacturers to cooperate in technology? Is the cooperation mode "AI+ security" or "security + AI"?
Lin Dahua: traditional security manufacturers provide integrated solutions and cameras. In the past, they were not very involved in AI technology. Shang Tang is developed from a laboratory, is to start from the academic, and then slowly to the ground. Now cv manufacturers and traditional security manufacturers are moving in the direction of technology landing, and we all converge together. Therefore, I think that deep cooperation between traditional security manufacturers and companies and laboratories that master advanced AI technology is an important trend.
But there are risks: moving forward from the app side and backwards from the technology side, everyone wants to occupy the commanding heights of technology. This requires everyone to establish a trust and win-win mechanism, only in this way can cooperation last long.
Question: in the environment where deep learning is popular, is there any research value for traditional machine learning methods?
Lin Dahua: I am often asked this question when I speak at academic conferences and public places. I think we should not regard deep learning as a world-wide take-all method, in a sense, it is a new research model. When we are finally faced with scenarios and applications, we still have to come up with a set of solutions to the problem. The modeling ability of deep learning is very strong, but it also has shortcomings. For example, when we are faced with a complex problem involving the interaction between different devices and the modeling of multiple variables, the traditional probabilistic learning and stochastic processes may play a role. If you combine it with deep learning, you can achieve a performance breakthrough.
Before I went back to teach in Hong Kong, I spent a long time studying statistical learning and probability graph models. At that time, the probability graph model was very depressed, although it had a lot of data base, but the use of the foundation could not meet the data requirements. In fact, it is a very good model that allows us to model the world in depth. With deep learning, they can be used together to switch simple assumptions of some variables, such as Gaussian distribution, to models constructed using deep networks. In this way, the traditional model will be upgraded and iterated to provide more efficient solutions for our specific problems and applications. So they are not a substitute relationship, but a combined relationship. In recent years, many studies have shown this trend, armed with traditional ideas and methods with in-depth learning, and finally achieved good results.
Question: in recent years, in-depth learning in the image field has encountered some bottlenecks, and there is no breakthrough in the short term. What do you think of it from an academic point of view?
Lin Dahua: actually, I talked about it all the time in my speech. I think we should expand the pursuit a little bit, the goal of machine learning is not only data, there are many aspects of research worthy of our exploration. For example, Shangtang used to focus on the accuracy of face recognition, but later we found many problems, including time cost, data tagging, reliability, model compression and so on. None of these previous studies were involved, but now it has become a very large and promising field. For example, model compression did not have this requirement before, but in the process of practical application, we found that the original method could not solve the problem, so we thought about whether we could compress the model. These ideas derived from reality have opened up some new research directions in recent years. From the perspective of accuracy alone, it has indeed reached a very high level, and there is not much room for further progress. However, in the specific application, there are many new challenges, each challenge is a research direction, there is still a lot of research space.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.