A new perspective to explore the mutually beneficial relationship between target detection and instance segmentation | AAAI 2020 07/09 Update SLTechnology News&Howtos

A new perspective to explore the mutually beneficial relationship between target detection and instance segmentation | AAAI 2020

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Introduction: this article is about the paper "RDSNet: A New Deep Architecture for Reciprocal Object Detec" accepted by AAAI 2020.

This article is the analysis and interpretation of the paper "RDSNet: A New Deep Architecture for Reciprocal Object Detection and Instance Segmentation" employed in AAAI 2020, and the related work has been open source.

Paper link: https://arxiv.org/abs/1912.05070

Code link: https://github.com/wangsr126/RDSNet

Author team: Institute of Automation, Chinese Academy of Sciences & Horizon, in which the first author Wang Shaoru is a Horizon intern and a master's degree student from the Institute of Automation of the Chinese Academy of Sciences.

This paper briefly summarizes the current situation of target detection and case segmentation algorithms, and briefly analyzes the advantages and disadvantages of various methods, based on which a complete framework is proposed to complete the tasks of target detection and case segmentation at the same time. and the two tasks assist each other, and the performance is improved at the same time.

I. background of the question

Target detection and case segmentation are two important tasks in the field of computer vision. in recent years, there have been many excellent algorithms to solve these two problems, and both of them have achieved excellent results. however, few articles deeply analyze the relationship between the two, which leads to errors such as the one shown below:

The result shown in the figure is obtained by Mask R-CNN, and you can see the missing instance mask ((a), (b)) and the inconsistency between the bounding box and the instance mask ((c), (d)) due to the incorrect location of the bounding box. These problems can be well solved in the algorithm proposed in this paper.

2. Introduction of the method

The algorithm framework is shown in the following figure:

In this paper, it is considered that target detection belongs to object level task, which pays more attention to object-level features, and does not require high resolution, but needs more advanced semantic information, while case segmentation task belongs to pixel level task, which needs pixel-by-pixel output, which requires higher resolution and more detailed information.

Therefore, a dual-stream network is designed as shown in the figure. The above object stream focuses on the target detection task, which can be the target detection algorithm of any anchor-based such as SSD, YOLO, RetinaNet and so on (RetinaNet is used in this paper); the following pixel stream focuses on the task of segmentation with high resolution (this paper uses a way similar to PanopticFPN to integrate multi-scale features to get a high-resolution output) Several subsequent actions are the focus of the article, which describes how to make the two tasks complement each other:

"object" auxiliary instance segmentation:

At present, the common case segmentation algorithms are divided into two categories, one is similar to Mask R-CNN 's proposal-based method, which is a direct extension of target detection algorithm, but this kind of method will face many of the problems mentioned above: the resulting instance mask resolution is relatively low and heavily depends on the bounding box of proposal. The other is based on the segmentation algorithm, which first predicts the embedding of each point, and then obtains the mask of each instance through clustering (the points belonging to the same object have similar embedding, through clustering, the points belonging to the same object form a cluster, and the mask of each object is obtained). This kind of method naturally overcomes the shortcomings of proposal-based, but generally can not be trained by end-to-end (metric learning is generally required to train embedding). And limited by the clustering algorithm, the performance is generally limited.

After careful analysis, it is found that the problem of clustering mainly stems from the lack of clustering center, in other words, if we have the center of each cluster, we can abandon the clustering algorithm and train end-to-end; and this "center" should be the embedding of each object, that is to say, it should come from object level, not pixel level! Therefore, the case mask generation algorithm based on correlation filtering proposed in this paper is formed.

Object stream and pixel stream extract the embedding of object and pixel respectively (the way to obtain object embedding is also very simple, just add an additional branch to predict on the basis of classification and regression branches in the detection head of target detection algorithm). The pixel belonging to the same object and its corresponding object have similar embedding, and the similarity is measured by inner product similarity, that is to say, for each detected object, its embedding is used as kernel. The mask of this object can be obtained by performing correlation filtering on pixel embedding.

In addition, this paper also makes full use of the target bounding box obtained by object stream to suppress the noise far from the center of the object, which essentially overcomes the influence of CNN's translation-variant on the instance segmentation task to a certain extent.

Masks assist in target detection:

The location of the boundary box is an important task of target detection, and most of the existing methods use regression to get the location of the boundary box. However, we review the definition of the bounding box and find that it itself is defined by the mask of the object (minimum enclosing rectangle of an object mask)! So, since we can get the mask of the object, why do we rely on the regression algorithm to make it unnecessary (provided that the acquisition of the mask of the object should not depend on the bounding box)? However, through experiments, it is found that the accuracy of directly using the example mask obtained by the above correlation filtering method to generate the bounding box is not too high, even lower than that obtained by the regression method. Through visualization, the author finds that the masks of most objects can provide a very accurate bounding box, but there are also some objects whose mask prediction results are not ideal, resulting in a large offset of the bounding box.

Based on this observation, this paper proposes a boundary box location algorithm based on Bayesian formula. firstly, the boundary box location is defined as a classification task (whether a coordinate is the boundary of an object in the width/height dimension), and the problem is transformed into a given object mask, and the coordinates belong to the prediction of the posterior probability of the boundary box.

Then, by using Bayesian formula, the boundary box obtained by regression is taken as a priori probability P (Xtreasi), while P (M "| Xtreasi) is obtained by taking the maximum and one-dimensional convolution and activation function column by row (row).

The overall process is shown in the following figure:

This method comprehensively considers the advantages of the regression bounding box and the instance mask, and obtains a more accurate bounding box. The specific results can be seen in the following figure, and it is obvious that the bounding boxes obtained by this method can be matched with higher IOU and ground truth bbox.

Third, the experimental results

In this paper, experimental verification is carried out on the COCO dataset.

In the task of case segmentation, this method can achieve a better balance between speed and accuracy in the single-stage algorithm, and the accuracy is similar to that of TensorMask at nearly 3 times the speed, and the improvement of 2.3mAP is achieved on the basis of YOLACT with similar speed.

In the target detection task, this method achieves consistent performance improvement on different backbone with very low computing cost.

It is worth noting that RetinaNet is used as detector in this paper, and its extension to the case segmentation task will not bring a significant increase in computation. If other more advanced target detection algorithms are adopted, its accuracy and speed can be further improved.

4. Some digressions

This is the end of the interpretation of the article, but the author also provides some other perspectives to understand this article:

Anchor-based or Anchor-free?

Anchor-free can be regarded as a hot word in the field of target detection in 2019. This article also touches on the hot spots to analyze the relevance to this paper.

Looking carefully at the algorithm framework proposed in this paper, we can find that object stream is actually anchor-based, while pixel stream is detector in anchor-free:object stream, which can be acted as by many target detection algorithms, including but not limited to SSD, YOLO, RetinaNet, or even two-stage Faster RmurCNN, while pixel stream can not only predict pixel embedding, but also predict additional boundary box corners (similar to CornerNet) or human key points (similar to Assoc). Embed.), or the pixel-level representation of other object instances; and these two branches are linked together by correlation filtering, which to some extent solves the problem of grouping in CornerNet. From this point of view, the framework proposed in this article is a real combination of anchor-based and anchor-free, which may lead to more interesting work in the future.

Bbox or Mask?

As the great god Ross mentioned on ICCV's Tutorial, object detection is a very broad concept, and different object representations also correspond to different level tasks: for example, bbox corresponds to the traditional object detection,mask corresponds to pose estimation,human surfaces corresponds to dense human pose estimation. These tasks are interrelated and correspond to the understanding of objects from different angles and different level. The existing methods either look at these issues independently, or high-level task is directly based on low-level task (for example, Mask R-CNN, two-stage human posture estimation, etc.), but the correlation of these tasks is not limited to this. This article focuses on the relationship between bbox and mask, but it doesn't go to the extreme. From this point of view, object detection still has a lot of room for development.

reference

Kaiming He, et al. "Mask R-CNN." In Proceedings of IEEE International Conference on Computer Vision. 2017.

Wei Liu, et al. "SSD: Single shot multibox detector." In Proceedings of European Conference on Computer Vision. 2016.

Joseph Redmon and Ali Farhadi. "YOLOv3: An incremental improvement." ArXiv preprint arXiv:1804.02767 (2018).

Tsung-Yi Lin, et al. "Focal loss for dense object detection." In Proceedings of IEEE International Conference on Computer Vision. 2017.

Alexander Kirillov, et al. "Panoptic feature pyramid networks." In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2019.

Xinlei Chen, et al. "Tensormask: A foundation for dense object segmentation." ArXiv preprint arXiv:1903.12174 (2019).

Daniel Bolya, et al. YOLACT: Realtime instance segmentation. In Proceedings of IEEE International Conference on Computer Vision. 2019.

Shaoqing Ren, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." In Proceedings of Advances in Neural Information Processing Systems. 2015.

Hei Law and Jia Deng. "CornerNet: Detecting objects as paired keypoints." In Proceedings of European Conference on Computer. 2018.

Alejandro Newell, et al. "Associative embedding: End-to-end learning for joint detection and grouping." In Proceedings of Advances in Neural Information Processing Systems. 2017.

Https://www.leiphone.com/news/201912/TTcH12nhAzBWl8I5.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.