Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Declare war on existing algorithms! MIT and IBM jointly promote a new data set

2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

On the problem of image classification in the field of artificial intelligence, the most commonly used data set for training and testing is ImageNet, which is also the world's largest "CV exercise database". Recently, a team of MIT and IBM researchers have created a different image recognition data set, ObjectNet, which has stumped the best computer vision model in the world.

It is important to note that the best or strongest here does not refer to a particular model, but to a class of high-performance visual models.

The computer vision model with 97% accuracy is implemented in the ImageNet test, and the accuracy of detection on the ObjectNet data set is reduced to 50% 55%. The main reason why the test results are so "tragic" is that almost all visual models lack stability in complex situations such as object rotation, background transformation, perspective switching and so on.

Andrei Barbu, a research scientist at the Massachusetts Institute of Technology's computer Science and artificial Intelligence Laboratory (CSAIL) and the Center for brain, mind and machines (CBMM), is the study's correspondent author and one of the project's moderators. "We need a dataset that typically represents what you see in real life," he said in an exclusive interview with DeepTech. "without it, who would have the confidence to do computer vision? how can we say that computer vision is ready for the golden age and security-critical applications?"

Andrei Barbu also said that ObjectNet can share with researchers around the world, "just contact us and we will send it to you." (website: https://objectnet.dev/)

Figure | ImageNet (source: ImageNet)

Artificial intelligence uses a neural network composed of neuron layers to find rules in a large number of raw data. For example, after seeing hundreds of pictures of chairs, it learned the shape of chairs.

Stanford University holds a competition every year to invite IT companies such as Google, Microsoft and Baidu to use ImageNet to test their systems. The annual competition also touches the heartstrings of the big companies.

ImageNet was founded by Li Feifei, one of the world's leading computer vision experts, who said in a speech that for cold machines to understand the story behind the photos, they need to see enough "training images" like babies.

ImageNet downloaded nearly 1 billion images from Flickr and other social media sites, and the ImageNet project was born in 2009, with a database of nearly 15 million photos covering 22000 items.

Computer vision models have learned to accurately identify objects in photos, so that some models perform better than humans on some data sets.

Figure | Li Feifei, one of the creators of ImageNet (source: Wikipedia)

However, when these models really come into life, their performance will deteriorate significantly, which brings security risks to self-driving cars and other key systems that use computer vision.

Because even if there are hundreds of photos, it can not fully show the direction and position of the object in real life. The chair can fall to the ground, the T-shirt may be hung on the branch, and the clouds can be reflected on the body of the car. At this point, there will be confusion in identifying the model.

Dileep George, co-founder of Vicarious, an AI company, once said: "it shows that we spent a lot of resources on ImageNet to fit." Overfitting refers to matching the results of a particular data set so closely or accurately that it is impossible to fit other data or predict future observations.

Unlike ImageNet's random collection of photos, the photos provided on ObjectNet have a special background and angle, and the researchers asked freelancers to take pictures of hundreds of randomly placed furniture items, telling them from which angle to shoot and whether to put them in the kitchen, bathroom or living room.

Therefore, the shooting angle of the objects in the data set is very strange, such as chairs turned on the bed, teapots upside down in the bathroom, T-shirts hanging on the back of the chairs in the living room.

Figure | ImageNet (left column) often shows objects on a typical background, with few rotations and few other viewing angles. A typical ObjectNet object is mapped on different backgrounds from multiple viewpoints. The first three columns show the three properties of the chair: rotation, background, and viewing angle. You can see a large number of changes introduced into the dataset as a result of these operations. Due to the inconsistent aspect ratio, the ObjectNet image is only slightly cropped in this image. Most detectors fail to recognize most of the images contained in the ObjectNet (source: paper)

"We created this data set to show people that object recognition is still a problem," said Boris Katz, a research scientist at CSAIL and CBMM at the Massachusetts Institute of Technology. "We need better and smarter algorithms."

Katz and his colleagues will present their results at the ongoing NeurIPS conference, the top international conference in the field of artificial intelligence and machine learning.

Figure | ObjectNet research team. The study was funded by the National Science Foundation, MIT brain, mind and Machine Center, MIT-IBM Watson artificial Intelligence Lab, Toyota Research Institute and SystemsThatLearn@CSAIL Initiative (source: ObjectNet)

In addition, there is an important difference between ObjectNet and traditional image data sets: it does not contain any training images. In other words, there is less chance that the exercise questions overlap with the examination questions, and it is very difficult for the machine to "cheat". Most data sets are divided into training sets and test sets, but the training set usually bears a slight resemblance to the test set, actually giving the model a head start in testing.

At first glance, ImageNet has 15 million pictures and seems huge. But when the training set is removed, it is about the same size as ObjectNet, with about 50, 000 photos.

"if we want to know how algorithms perform in the real world, we should test them on unbiased images that they have never seen before," Andrei Barbu said.

Photo | Amazon's "Turkish Robot" Amazon Mechanical Turk (MTurk) is a crowdsourced online bazaar that enables computer programmers to use human intelligence to perform tasks that computers are not yet competent for. Both ImageNet and ObjectNet mark images through these platforms (source: Amazon Mechanical Turk)

The researchers say the results show that it is still difficult for machines to understand that objects are three-dimensional and that objects can rotate and move into new environments. "these concepts have not been built into the architecture of modern object detectors," said Dan Gutfreund, a co-author of the study and a researcher at IBM.

The test results of the model on ObjectNet are so "tragic", not because there is not enough data, but because the model lacks stability in cognition such as rotation, background transformation, perspective switching, and so on. How did the researchers come to this conclusion? They asked the model to be trained with half the data from ObjectNet and then tested with the other half. Training and testing on the same data set can usually improve performance, but this time, the model has been only slightly improved, indicating that the model does not fully understand how objects exist in the real world.

Therefore, the researchers believe that even designing a larger version of ObjectNet with more perspectives and directions does not necessarily teach artificial intelligence to understand the existence of objects. The goal of ObjectNet is to inspire researchers to come up with the next wave of revolutionary technologies, just like the original ImageNet challenge. In the next step, they will continue to explore why human beings have good generalization ability and robustness in image recognition tasks, and hope that this data set can become an evaluation method to test the generalization ability of image recognition models.

"people put a lot of data into these object detectors, but the returns are diminishing," Katz said. "you can't photograph every angle and every possible environment of an object. We hope that this new data set will lead to a powerful computer vision system in the real world that will not fail unexpectedly."

Photo | Andrei Barbu is a research scientist at the Massachusetts Institute of Technology, focusing on language, vision and robotics, as well as neuroscience. (source: MIT)

DeepTech conducted an exclusive interview with the co-author of the study, Andrei Barbu, a research scientist at CSAIL and CBMM (the following is a transcript of the interview without changing the original intention):

DeepTech: when did this idea come into being and what is the purpose? Can I download and use it now?

Andrei Barbu:ObjectNet was proposed about four years ago. Because even though many datasets, such as ImageNet, are more than 95% accurate, performance in the real world may be much worse than you expect.

Our idea is to introduce excellent experimental designs from other disciplines directly into machine learning, such as physics and psychology. We need a data set that typically represents what you see in real life. Without it, who would have the confidence to do computer vision? How can we say that computer vision is ready for the golden age and security-critical applications?

ObjectNet is ready to use, just contact us and we will send it to you.

DeepTech: how long did it take to collect the actual data? What is the validity of the data?

Andrei Barbu: it took us about three years to figure out what to do, and about a year to collect data. Now we can collect another version more quickly, spanning several months.

We collected about 100000 pictures on Turkish robots, about half of which we saved. Many of the photos were taken outside the United States, so some objects may look strange. Mature oranges are green, bananas come in different sizes, and clothes have different shapes and textures.

DeepTech: how much does it cost? What problems did you encounter when collecting data?

Andrei Barbu: in academia, costs are complex. The cost of manpower is higher than that of Turkish robots, and the cost of Turkish robots alone is considerable.

There are a lot of problems in collecting these data. The process is complex because it needs to be run on different phones; the instructions are complex, and it took us a while to really understand how to interpret the task in a stable way; and data validation is also complex, with little problems almost endless. We need a lot of experiments to learn how to do this effectively.

What is the difference and connection between DeepTech:ObjectNet and Imagenet?

Andrei Barbu: the difference with ImageNet is that: 1. The way we collect images can control the deviation. We tell people how to rotate the object, what background to place the object in, and at which angle to take pictures. In most data sets, the background information of the image will lead to the unconscious "deception" of the machine, which will use their knowledge of the kitchen background to predict that something may be a pan.

2. These photos are not collected from social media, so they are not good-looking photos, and people don't want to share them. We also ensure that images from India, the United States and different socio-economic classes are collected. We also have images of damaged or broken objects.

3. There is no training set.

This was not a big problem 10 years ago, but our approach is so powerful in discovering patterns that no one can recognize it, so we need these changes to avoid simply adjusting our model. to accommodate the bias between training and test sets from the same data set.

DeepTech: what is the impact of not having a training set?

Andrei Barbu: since there is no training set, all methods need to be generalized. They need to be trained on a dataset and tested on ObjectNet. This means that they are much less likely to take advantage of deviations, and they are much more likely to become powerful target detectors. We want to convince everyone, at least in the established field of machine learning, that the group that collects the training set should be separated from the group that collects the test set.

As we have become a data-driven research field, we need to change the methods of data collection to promote the development of science.

The DeepTech:3D object is so complex that I think it's hard to represent. For example, how to represent a rotating chair?

Andrei Barbu: I don't think 3D is very complicated.

Obviously you and I have a certain understanding of the three-dimensional shape of objects, because we can imagine objects from a new point of view.

I think this is also the future of computer vision, and the design of ObjectNet is skeptical about this. It doesn't care about the benchmark on which you build the model, what really matters is that it provides you with a more reliable tool to test whether your model is strong enough.

DeepTech: what's your next research plan?

Andrei Barbu: we are using ObjectNet to understand human vision. There is not much research on large-scale object recognition, and there are still many gaps to be filled. We will show ObjectNet to thousands of people who have a brief demonstration on Turkish robots to let people understand the various stages of human processing images.

It will also help to answer some basic questions about the relationship between human vision and object detectors, such as whether object detectors behave as if humans can only see one object quickly. Our preliminary results show that this is not the case and that these differences can be used to build better detectors.

We are still working on the next version of ObjectNet, which I think will be more difficult for detectors: ObjectNet with partial occlusion. The object will be partially overwritten by other objects. We and many other research groups have reason to suspect that the detector's recognition of occluded objects is not stable enough, but a serious benchmark is needed to stimulate the next wave of progress.

Https://zhuanlan.zhihu.com/p/97000888

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report