Depth | Zhang Zhengyou: the third generation of computer vision | CCF-GAIR 2019 04/26 Update SLTechnology News&Howtos

Depth | Zhang Zhengyou: the third generation of computer vision | CCF-GAIR 2019

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Https://www.toutiao.com/i6713143613334749703/

This year marks 40 years of artificial intelligence in China. Many things have happened in these 40 years. Listen to Dr. Zhang Zhengyou talk about the past life, present life and possible future of computer vision.

/ tr. by Wang Siying

AI Science and Technology Review: the fourth Global artificial Intelligence and Robotics Summit (CCF-GAIR 2019) was officially held in Shenzhen from July 12 to July 14, 2019. The summit is hosted by the China computer Society (CCF), hosted by Lei Feng.com and the Chinese University of Hong Kong (Shenzhen), and co-sponsored by Shenzhen Institute of artificial Intelligence and Robotics, which is strongly guided by the Shenzhen government. It is a top exchange and exhibition event in the domestic artificial intelligence and robotics academia, industry and investment circles, and aims to create a powerful platform for cross-border exchanges and cooperation in the field of artificial intelligence in China.

On July 12, Dr. Zhang Zhengyou, Director of Tencent AI Lab & Robotics X and host of ACM Fellow, IEEE Fellow, CVPR 2017, gave a report entitled "computer Vision three generations" for the main venue "AI Frontier" of CCF-GAIR 2019. The following is the full text of the conference report made by Dr. Zhang Zhengyou. Thank you Dr. Zhang Zhengyou for his revision and confirmation.

Hello everyone! Thank you very much for Lei Feng's invitation, which gives me this opportunity to share with you. This year marks the 40th anniversary of artificial intelligence in China. Many things have happened in the past 40 years. Lei Feng asked me to talk to you about the past life, this life and the possible future of computer vision. In fact, this report should be given by my good friend, Professor Quan long of the Hong Kong University of Science and Technology. He went abroad a year earlier than I did, and he is still studying computer vision at HKUST. During these years, I have been doing speech processing and recognition, multimedia processing and robots for many years, so my research history in computer vision is not very long. However, Professor Quan long is unable to attend, so I can only make up the numbers and tell you some stories about computer vision.

Lei Feng came to me because I heard that I started to study computer vision relatively early. I graduated from Zhejiang University in 1985 and went to France in 1986. I participated in the development of what may be the first mobile robot in the world to navigate with stereo vision.

Image processing.

In fact, a lot of things happened in 1986. 1986 was my first international conference. It was the ICPR (World pattern recognition Conference) held in Paris. At this conference, I met Professor Wu Lide of Fudan University, who led a Chinese delegation and made a report on the current research situation of pattern recognition in China. They are going to apply for the 1988 ICPR to be held in China.

A key figure to be mentioned here is Professor Fu Jingsun of Purdue University, who is the ancestor of the field of pattern recognition. He was the main chair of the first ICPR in 1973, founded IAPR,1978 in 1976, founded IEEE TPAMI, and served as the first editor. Originally, he supported the convening of ICPR in China in 1988, but unfortunately he died in 1985, so his application in 1988 was unsuccessful. If ICPR can be held in China in 1988, perhaps China's development in pattern recognition and computer vision will be even earlier. Of course, history has no ifs. It has been 30 years since ICPR was held in China. In 2018, under the leadership of Academician Tan Tienu, ICPR was held in China for the first time.

Another important event in 1986 was the return of my French senior, Ma Songde, who founded NLPR (National key Laboratory for pattern recognition). After the establishment of NLPR, it has attracted a large number of foreign scholars to return home, and invited many foreign visiting scholars at the same time. The field of Chinese computer vision began to be in line with international standards. Of course, Ma Songde was an important figure in Chinese science and technology circles and later became vice minister of science and technology. In 1997, he also founded the Sino-French Joint Laboratory, where half of the researchers are French, which is also a feat in China.

When it comes to computer vision, you can't do without an iconic figure, MIT professor David Marr. In 1979, exactly 40 years ago, he proposed the theoretical framework of visual computing. The theoretical framework of Marr has three levels, from what to calculate, to how to express and calculate, to the implementation of hardware.

Specific to 3D reconstruction, Marr believes that from the image to go through several steps, the first step is called primal sketch, that is, image processing, such as edge extraction. So by the mid-1980s, the main work of computer vision was image processing. Perhaps the most famous work is the Canny edge detection operator published by a master's degree student in MIT in 1986, which basically solves the problem of edge extraction. As shown in the following figure, the original image is on the left and the detected edge is on the right.

At that time, there was another famous work done by Shen Jun, a Chinese scientist, who was at the University of Bordeaux in France. He compared different operators. His operator is better than the Canny detector in some images. So by the mid-1980s, when I was studying in France, I was almost done with image processing.

Stereo Vision and 3D Reconstruction

Fortunately, geometric vision is just beginning to rise. There are two representatives, one is Olivier Faugeras from France, he is my doctoral supervisor, and the other is Thomas Huang from the United States, we call him Tom. They are good friends and have written articles together. I have known Tom since 1987, and he has been very helpful to me. He has trained more than 100 PhDs, including many computer vision experts active in Chinese academia and industry, and his contribution to Chinese computer vision is very great.

I am honored to learn from Olivier Faugeras and participate in the development of the world's first mobile robot that uses stereo vision navigation. My first research result in 1988 was published in the second ICCV. On the right is a photo of a meeting in Florida in the United States. At that time, computer vision was not popular, there were only about 200 participants in that ICCV, and there were even fewer Chinese, probably just me, Quan long, and Weng Juyang, a student from Tom. During my PhD, I did a lot of work on 3D dynamic scene analysis, which was integrated into a book in 1992.

Now I'd like to give a simple example of uncertain modeling and computing. I hope you can understand what 3D computer vision is through the following page of PPT.

Probability and statistics are needed here, which is very important, but now people who do vision often ignore it. The following two lines represent two image planes. A white dot on the left image corresponds to a white dot on the right image. Each image point corresponds to a straight line in space, and two straight lines intersect to get a three-dimensional point, which is three-dimensional reconstruction. Similarly, the black dot of the left image corresponds to the black dot of the right image, and the two lines intersect to get a three-dimensional point. But the points of the image are detected and there is noise. We use an ellipse to represent uncertainty, so that a point in the image does not correspond to a line, but a vertebral body. The intersection of two vertebrae represents the uncertainty of the points of three-dimensional reconstruction. As you can see here, the near point is more accurate than the far point. These uncertainties need to be taken into account when we use these 3D reconstruction points. For example, when a robot moves from one place to another, the uncertainty of the data must be taken into account when it needs to estimate its motion.

In the early 1990s, I proposed the ICP algorithm, which aligns different curves or surfaces through the matching of iterative points. This algorithm is also used in many places. What we often hear now is SLAM, which is actually what we used to do to estimate structures from motion, 3D reconstruction, uncertain estimation, ICP. In fact, SLAM theoretically solved it in the early 1990s.

In 1995, I proposed a robust method of image matching and epipolar geometry estimation, and put the program on the Internet, which was used as a reference. This may be the first, or at least one, in the world to put computer vision programs online for others to test with real images. So this algorithm became a general method of computer vision at that time.

In 1998, I proposed a new camera calibration method, which was later called "Zhang's method". Now it has been widely used in 3D vision, robotics and autopilot all over the world, and has also won the IEEE Helmholtz time Test Award.

In 1998, Ma Songde and I summarized the increasingly mature geometric vision, which was published by Science Press as a graduate textbook.

A lot of things happened in 1998, one is the establishment of MSRA (Microsoft Research Asia), the other is the establishment of Tencent. In fact, these two seemingly unrelated institutions have played an inestimable role in the development of computer vision and artificial intelligence in China. MSRA has brought international advanced research methods and ideas to China, trained a large number of excellent Chinese scholars, and invited some foreign research scholars to come to China. Tencent has promoted the development of the Internet in China, because of the Internet, Chinese researchers have access to the world's top research results in almost real time. Therefore, the combination of the two has played an important role in the development of artificial intelligence in China.

An important landmark event in the field of computer vision in China was that ICCV was held in Beijing in 2005, with Ma Songde and Harry Shum as the host of the conference, which marked that the research level of computer vision in China has been internationally recognized. I am also honored to receive IEEE Fellow's certificate from my Tom Huang seniors.

The rise of deep learning

It is possible that the theory of geometric vision has become more mature. In the late 1990s, the research of computer vision began to enter into the detection and recognition of objects and scenes. The main method is traditional features plus machine learning.

At that time, I did geometric vision for a long time, and in 1957, I also began to try to develop the world's first system to use neural networks to recognize facial expressions, using Gabor wavelets. Although facial expression recognition started more than 20 years ago, there was too little data at that time, and it was not until 2016 that we commercialized facial expression recognition technology at Microsoft, which can be used by everyone on Microsoft's cognitive service.

In the era of traditional features and machine learning, a milestone needs to be mentioned, that is, Viola-Jones Detector in 2001. Through the Harr feature plus cascade classifier, face detection can be done very quickly, and it can be done in real time on the machine 20 years ago. This has a great impact on computer vision. The cycle since then is the launch of wave after wave of new data sets, plus a wave of algorithms.

In 2009, a dataset called ImageNet appeared, which was launched by Li Feifei's team at Stanford University. this dataset is very important, not because it is very large, but because it spawned an era of deep learning a few years later.

In 2012, two Geoffrey Hinton students developed AlexNet, using an 8-layer neural network with 60 million parameters, which reduced the error by more than ten percentage points compared with the traditional method, from 26% to 15%, thus opening the era of deep learning in computer vision. In fact, this AlexNet structure is not very different from the neural network used in handwritten digit recognition by Yann LeCun in 1989, but is deeper and larger.

Geoffrey Hinton, Yoshua Bengio and Yann LeCun jointly won the Turing Award in 2018 for their contribution to deep learning. They deserve this award. You know, Geoffrey Hinton put forward backpropagation in 1986 and sat on the bench for 25 years.

There is also a milestone in the era of deep learning. in 2015, he Kaiming and Sun Jian of Microsoft Research Asia proposed ResNet, which uses 152 layers of neural network, and the error on the ImageNet test set is lower than that of people, reduced to less than 4%.

I have also made some contribution in the field of deep learning. In 2014, I teamed up with Tu Zhuowen of UCSD to propose the DSN (Deeply- Supervised Nets) in-depth monitoring network, which is not as influential as ResNet, but it has been cited nearly a thousand times. Our idea is to directly let the output supervise the middle layer, so that the lowest level approaches the function to be learned as much as possible, while also alleviating the gradient "explosion" or "disappearance".

The CVPR2019 that has just passed can be called a grand ceremony for Chinese. Among the organizers, there are many Chinese faces, including Zhu Songchun, the host of the conference, Xi Huagang, and Tu Zhuowen, the heads of the program committee. Of the more than 5,000 contributions, 40% are from the mainland, and the first authors of the best paper award and the best student paper award are also Chinese. So China's computer vision ability is still very strong, which is something to be proud of.

The research of computer vision should return to the original mind.

Now let's review the evolution of computer vision research, from the initial image processing, stereo vision and 3D reconstruction, object detection and recognition, to photometric vision, geometric vision and semantic vision, to deep learning all over the world. This worries me. Deep learning has many limitations.

I think the next step is to return to the original mind, closely combine photometric vision, geometric vision and semantic vision, inject common sense and domain knowledge, multimodal fusion with language, and constantly evolve through learning.

We in Tencent AI Lab have also started to do a little bit of work in this area. For example, our picture-reading project can describe the content of a photo in words. In January 2018, we launched Qzone app to let visually impaired users "see" the picture.

We have also integrated computer vision, speech recognition and natural language processing technologies to develop a virtual human product to explore multimodal human-computer interaction, empower other scenarios and help socialize. We have also developed a two-dimensional virtual human to do game commentary, which can understand the game scene in real time and describe it.

So is today's artificial intelligence really intelligent? Imagine what you would do if someone wanted to cover your eyes. I will dodge. But from the video I just played, it is obvious that the current surveillance system does not behave in this way. Today's artificial intelligence is just machine learning: learning a mapping from a large amount of tagged data.

What is true intelligence? I don't think there is a final conclusion yet, and we don't know enough about our own intelligence. But I agree with Swiss cognitive scientist Jean Piaget that intelligence is what you use when you don't know how to do it. I think this definition is very reasonable. When you can't face it with what you've learned or talent, what you use is intelligence. How to implement an intelligent system? There may be many ways, but I think a very important way is to take the carrier into account and be intelligent with a carrier, that is, a robot.

In the field of robotics, I put forward the A2G theory. An is AI, the robot must be able to see, hear, speak and think, B is Body body, C is Control control, ABC constitutes the basic ability of the robot. D is Developmental Learning, developmental learning, E is EQ, emotional understanding, personification, F is Fle xi ble Manipulation, flexible manipulation. Finally, to achieve GMAG is Guardian Angel, Guardian Angel.

Tencent has made three robots: the unique go robot, the table hockey robot, and the robot dog. Can show you the video of the robot dog, the robot dog has a perceptual system, can bypass obstacles, see suspended obstacles can crawl forward, see a person in front of you can squat down and look at people.

This is the end of my report, Tencent's AI mission is Make AI Everywhere, we will certainly make good use of artificial intelligence, let artificial intelligence benefit mankind, because Tech for Good. Thank you.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.