What is the R-CNN model? 04/26 Update SLTechnology News&Howtos

What is the R-CNN model?

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what the R-CNN model is like". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how the R-CNN model is".

Abstract

Target recognition and detection database: PASCAL VOC has been making slow progress 12 years ago, and some new optimization methods only combine the previous methods linearly. The R-CNN proposed by Ross Girshick directly improves the recognition accuracy by 30%. The author mainly makes use of two factors: one is that CNN can be applied to regional candidates to locate and segment objects, and the other is that when there is little labeled training data, the pre-training of auxiliary tasks combined with fine-tuning can significantly improve the performance. (when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.)

1. Introduction

Before R-CNN was proposed, various target detection algorithms are mostly based on SIFT and HOG operators, both of which are blockwise orientation histograms, which can be roughly associated with mammalian vision. But the recognition process of the brain should be multi-layer transmission, so the recognition program should also have a multi-layer structure. Based on this, Fukushima proposed the "neocognitron" method, and Lecun also proposed the "missing algorithm".

In view of the popularity of CNN in the past 13 years, the author believes that the results of CNN in image classification can also be extended to the challenge of PASCAL VOC. In order to achieve the goal, two problems need to be solved:

Use a deep network to locate the target. Target detection first needs to locate the object (localization). Location is generally realized by sliding window detection (intercepting a part of the image with a window and doing a location regression each time), but it is a big challenge for CNN with large receptive field and step size.

Train large-capacity networks with small-capacity tagged data. In fact, the solution has already been mentioned above, which is to conduct supervised training on ILSVRC, a large auxiliary training set, and then domain-specific fine-tuning on PASCAL.

In addition, the author's system is also very effective: The only class-specific computations are a reasonably small matrix-vector product and greedy non-maximum suppression.

The authors also found that even if 94% of the parameters were removed, the accuracy of CNN model detection decreased only slightly. Through a detection and analysis tool, it is found that the location error rate can be significantly reduced by simple bounding box regression.

2. Introduction to R-CNN Model 2.1Model Design

The whole detection system is divided into three parts:

Generate candidate regions for independent classification. The author's method is use selective search to enable a controlled comparison with prior

Using CNN, a fixed-length feature vector is proposed for each candidate region. The input picture is fixed at 2270227, and the mean-subtracted processing is carried out in advance. Then a CNN with 5 convolution layers and 2 full connection layers is used to extract 4096-dimensional feature vectors.

The feature vectors are classified by SVM.

2.2 Test-time detection

At the beginning, the system uses selective search to extract 2000 candidate regions, and warp them to the size of 277-277, then enter the CNN to extract features, and classify them with SVM. Finally, use greedy non-maximum suppression to remove those highly overlapping boxes.

The running time of R-CNN is very short, which can be attributed to two points: 1. CNN is shared for the parameters of each category; 2. Compared with other methods, the eigenvector of 4096 dimensions is very small.

The running result is that even if there are 100k classifications, a graph takes only 10 seconds on multi-core CPU, and the generated low-dimensional feature map only accounts for 1.5GB.

2.3 Training

Supervised pre-training: first, CNN is pre-trained on ILSVRC 2012 (with image-level annotations (i.e., no bounding box labels)), and Caffe is used in the framework. Due to the simplification of the training process, the author accidentally reached the lowest error rate of ILSVRC.

Domain-specific fine-tuning: fine-tuning process, SGD training at a learning rate of 0.001. For a classification, as long as IOU > 0. 5, the border is treated as positive. Each SGD iteration samples 38 positive borders and 96 backgrounds.

Object category classifiers: for a classification, high IOU and IOU are easy to distinguish, but when IOU is in the median, it is difficult to define whether the resulting candidate box contains the object. The author sets a threshold of 0.3, below which all are regarded as background (negative number). In addition, one SVM is optimized for each category. Because there are so many negative samples, the hard negative mining method is also used.

2.4 Results on PASCAL VOC 2010-12

The author submitted two versions, one without bounding box regression (RCNN) and the other with (RCNN BB). The results are as follows:

In short, MAP has improved significantly (from 35.1% to 53.7%) and has a short run time.

Thank you for your reading, the above is the content of "what the R-CNN model is". After the study of this article, I believe you have a deeper understanding of how the R-CNN model is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.