In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
2019-11-25 12:54:06
[introduction] this paper proposes that vision-related object relations have higher value in semantic understanding. In the learning expression of visual relationship, we need to focus on visual correlation and avoid learning information that has nothing to do with vision. Because there are a lot of non-visual prior information in the existing data, it is easy to learn simple positional relationship or single fixed relationship, but does not have the ability to further speculate and learn semantic information. As a result, the representation of existing relational data can not significantly improve the performance of semantic-related tasks. Come to New Zhiyuan's AI moments to discuss with AI fans.
This paper proposes that vision-related object relations have higher value in semantic understanding. In the learning expression of visual relationship, we need to focus on visual correlation and avoid learning information that has nothing to do with vision. Because there are a lot of non-visual prior information in the existing data, it is easy to learn simple positional relationship or single fixed relationship, but does not have the ability to further speculate and learn semantic information. As a result, the representation of existing relational data can not significantly improve the performance of semantic-related tasks. This paper proposes to clarify what is worth learning and what needs to be learned in visual relationship learning. And through experiments, it is also verified that the proposed visual correlation data can effectively improve the semantic understanding ability of features.
Data and project website:
Thesis:
Quotation:
In the research of computer vision, perceptual tasks (such as classification, detection, segmentation, etc.) aim to accurately represent the information of a single object; cognitive tasks (such as looking at pictures, question answering system, etc.) aim to deeply understand the semantic information of the whole scene. From a single object to the whole scene, the visual relationship represents the interaction between two objects and connects multiple objects to form the whole scene. Relational data can be used as a bridge and link between object perception task and semantic cognitive task, which has high research value.
Considering the role of relational data in semantics, object relational data should effectively promote the ability of computer vision methods to understand the semantics of the scene. Construct the hierarchical visual understanding ability from single object perception, to relational semantic understanding, to overall scene cognition, from micro to macro, from local to overall.
However, in the existing relational data, due to the existence of a large number of a priori bias information, the features of relational data can not be effectively used in semantic understanding. Among them, positional relations such as ``on'', ``at'' degenerate relational reasoning into object detection tasks, while single fixed relationships, such as ``wear'', ``has'', degenerate relational reasoning into simple deductive reasoning because the combination and collocation of subjects and objects in the data is fixed. Therefore, the existence of a large number of these relational data leads to the learning of relational features more inclined to the perception of a single object rather than the real understanding of the semantics of the scene, which can not make the relational data play a role. At the same time, this semantic and learning a priori bias can not be screened and eliminated by conventional methods based on frequency or rules, which causes the above data-side problems to hinder the development and research of relational semantic understanding. it makes the research of visual object relationship and the goal of semantic understanding farther away.
In this paper, visual correlation hypothesis and visual correlation discrimination network are proposed to construct datasets with higher semantic value. We believe that many relational data do not need to understand the image, and can be inferred from the tag information on the perception of a single object (such as bounding box, class) that should be avoided in relational learning, that is, non-visual correlation. In relational data, the learning and understanding of visual correlation will force the network to get relational semantic information through the visual information on the image, rather than relying on the ability to fit a priori biased label based on the perception of a single object.
In our method, we design a visual correlation discrimination network to distinguish those non-visual correlations that can be inferred only from some tag information through autonomous learning of the network. so as to ensure that the visual relationships with high semantic value remain in the data. In addition, we design a joint training method to consider the relationship, which can effectively learn the information of relationship tags. In the experiment, we verified our idea from two aspects. In relational representation learning, in the task of scene graph generation, our visual correlation effectively widens the performance gap between learning methods and non-learning methods. it is proved that non-visual relationship is a priori bias in relational data and can be inferred by simple methods. On the other hand, by learning the visual correlation, the features we get have better semantic expression and understanding ability. This feature also shows better performance in question and answer system, looking at pictures and speaking, which proves that visual correlation really needs to be learned and is more conducive to improving semantic comprehension.
Methods:
1. Visual correlation discriminant network (VD-Net)
The proposed VD-Net is used to distinguish whether the object relationship is visually related. The network only needs the position information bounding box and category information class of the object, and encodes the two kinds of information without considering the image information. The specific inputs are as follows:
Location coding:
It contains object center point, width and height, position relation information, size information and so on.
For category information, we use the glove feature vector of the category label as the input.
The network settings are as follows:
In order to avoid overfitting, the network design needs to be as small as possible. The network consists of four full connection layers, which are the location coding of the subject and the object and the joint location coding of the two. Which are the category word vectors of the subject and the object, respectively.
Through the study of VD-Net, we can find that the relationship prediction in the existing dataset has a high accuracy. 37% of the tags in VG150 have at least 50% accuracy in VD-Net.
two。 Joint feature learning considering relational information:
The methods we propose are as follows:
Among them, we use Faster-RCNN for feature extraction, which is taken from the RPN part. Network comprehensive learning location, categories, attributes and relationship information. The semantic representation ability of features is further expanded through the information of object relations.
Experiment:
1. Scene graph generation experiment:
Freqency-Baseline is a non-learning method based on statistics of data. In our experiment, VrR-VG obviously widens the performance gap between non-learning methods and learnable methods. It highlights the real performance of each method in the scene graph generation task. At the same time, the experiment also shows that the non-visual relationship is relatively easy. Relatively speaking, in the case of a large number of non-visual relationships, the gap between the content learned by the network and the content directly inferred by the non-learning method based on statistics is limited.
two。
In the question and answer system experiment, by learning the visual correlation, the feature has better performance and has a significant improvement in the index.
In the specific case study, by learning visual relevance, features can provide more semantic information. Some questions which can not be answered correctly by single object information have obvious effect in our method.
3.
In the task of looking at the picture and speaking, the performance of the task is also improved by learning visual relationships.
Through the case analysis of the generated sentences, we can find that our method gives some sentences with distinct semantic relations. Sometimes the sentence as a whole will have more vivid expression and richer interactive information.
Conclusion:
In the learning and application of object relations, we need to pay attention to the learning of visual relationships. The existing relational data can not be effectively used in semantic-related tasks, and the main problem is on the data side rather than the method side. In order to make object relations have more extensive and in-depth references in semantic understanding, we need to first clarify which relationships need to be learned. On the premise of solving what you need to learn, you can go further in the way of how to learn.
Https://www.toutiao.com/i6763103092482245132/
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.