How to analyze the essence of Anchor in Target Detection 04/27 Update SLTechnology News&Howtos

How to analyze the essence of Anchor in Target Detection

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to analyze the essence of Anchor in target detection, many novices are not very clear about this. In order to help you solve this problem, the following editor will explain it in detail. People with this need can come and learn. I hope you can get something.

Guide reading

The following takes face recognition as an example to explain the essence of Anchor in target detection, that is, multi-scale sliding window. By comparing with the traditional detection method, the advantages of Anchor are highlighted, and the code links of Retina-Face and related papers are attached at the end of the paper.

On some of object detection's very famous model, there is a concept that is very difficult to understand at first-Anchor. This Anchor is also called reference boxes on Faster RCNN, that is, the reference frame. The meaning of the reference frame must be to bring a priori knowledge. First, consider the task of target detection. Enter the image and output a rectangular box (Bounding Boxes) containing the target category. For more information, please see the figure below:

So a core issue is the shape and size of this box, which corresponds to a lot of ratio and scale mentioned in various anchor-based articles. Ratio is very simple, which is the ratio of length to width. Scale can be understood as side length. Then why is there such a thing as Anchor? What is its role? In fact, the essence of Anchor is a multi-scale sliding window (sliding window in multi-scale). It seems that no one has understood him in this way. Let's analyze it in detail.

First of all, let's take a look at how the traditional detection is done, such as face, pedestrian and vehicle detection with haar and hog features. These methods were mainstream before CNN-based, but were later defeated by CNN. The specific process is as follows:

1. Generate an image pyramid because the scale of the object to be detected is variable

two。 Use the sliding window to scroll over the picture pyramid to generate a number of candidate areas (as shown in the moving image below)

3. Various feature extraction (hog) and classifier (svm) are used to classify the image information in the candidate regions generated above (for example, whether it is a face or not).

The final result is obtained by 4.NMS non-maximum suppression.

Then CNN can naturally replace step 3 because of its strong feature extraction ability. But because steps 1 and 2 are independent of CNN and require a lot of loop traversal, the speed is limited. And in order to have a good positioning accuracy, there must be more sliding windows with different scale and ratio, which increases the time. So in in-depth learning, we are always talking about end2end, so how to integrate 1x2 steps into it? In fact, when the window is sliding, it is essentially a process of traversing pixels, so we can directly assign several window rectangles with different scale and ratio to each pixel, and the center of these rectangles is the pixel to which they belong. As for the selection of scale and ratio, it can be obtained by k-means clustering based on some prior knowledge or like YOLO-v2. Then the window rectangle with several different scale and ratio assigned to each pixel is Anchor. In fact, the essence is to assign a process based on pixel traversal directly to each pixel to do, and then CNN can directly do a pixel-based Dense Map prediction. Let's visualize the Anchor (only 200 have been visualized here):

It can be seen that 200 anchor has basically covered the whole picture, the general network model's anchor is about tens of thousands, for example, Retina-Face 's anchor is about 25000. So let's go back to step 3. Now it is time to use CNN to classify so many Anchor, such as whether it is a face or not.

So how to judge the classification of these anchor is relatively simple, that is to see whether these anchor and the iou of a given rectangle meet the conditions, for example, iou > 0.5 is considered to be Postive.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.