How to use VarifocalNet to sort the optimal Scheme of candidate boxes 07/19 Update SLTechnology News&Howtos

How to use VarifocalNet to sort the optimal Scheme of candidate boxes

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how to use VarifocalNet to sort the optimal scheme for candidate boxes. Many people may not know much about it. In order to make you understand better, the editor has summarized the following for you. I hope you can get something from this article.

1. Introduction

One of the problems in the current object detection methods is that the classification score of the target can not represent the quality of the prediction of its position, which leads to the low confidence of some boxes with accurate position prediction, which will be suppressed when doing NMS. In order to solve this problem, many methods are proposed, such as using an additional IoU score or centerness score to evaluate the prediction quality of the position, and then multiplying the classification score and the location quality score when doing NMS. But this method is not optimal and may even lead to worse results, as will be said later. If a small network is used to predict location scores, this method is not elegant enough and requires additional computation.

In order to overcome these shortcomings, we can think about a question: can we combine the quality prediction of this position into the classification score, instead of predicting the quality of a location alone? That is, predicting a classification score related to location, or a classification score related to IoU, is called IACS.

Our contributions are as follows:

1. We show that using appropriate scores to accurately sort a large number of candidate boxes is a key factor to improve the performance of dense target detectors.

2. We propose Varifocal Loss to train dense target detector to return to IACS.

3. We propose a new star-shaped feature representation of the target box to predict the IACS and optimize the frame.

4. We have developed a new target detector based on FCOS, called VarifocalNet or VFNet. The schematic diagram of our method is as follows.

two。 motivation

In this section, we study the upper limit of FCOS+ATSS 's performance and show the importance of using the classification score of IoU-aware as a ranking bias. When studying the upper limit of FCOS+ATSS, before doing NMS, we replaced the predicted classification score of dense, the offset of distance and the score of centerness with ground truth value, and then evaluated it on coco val2017. For the classification probability vector, we have two choices, one is to directly set the corresponding category to 1, and the other is to set it to the value of gt and the gt-IoU of the prediction box. For the centerness value, we also consider using its real value or the value of gt-IoU. As a result, as shown in Table 1, the AP of the original FCOS+ATSS is 39.2. When we use the GT value (gt_ctr) for the centerness score when reasoning, we only increase it by 2 points. Similarly, we only increase to 43.5 when we replace the centerness value with the value of gt_IoU (gt_ctr_iou). This shows that the multiplication of category probability and centerness can not be significantly improved.

By contrast, FCOS+ATSS with a gt bounding box achieves 56.1 AP without centerness. However, if the class probability (gt_cls) of the gt_label location is set to 1, it becomes important whether or not to use centerness (43.1 AP vs 58.1 AP), because centerness can distinguish between correct and incorrect bounding boxes to some extent.

The most surprising thing is that if you replace the classification score with gt_IoU (gt_cls_iou), that is, IACS, you don't need centerness when reasoning, and you get an AP of 74.7. The above results show that for most gt targets, there is an accurate bounding box in the pool of large candidate results, so the key is how to select these high-quality test results from the candidate pool. The above results show that IACS is the best measure for selecting candidate results.

3. VarifocalNet

Based on the above findings, we propose to learn an IoU-aware classification score (IACS) to sort the detection results, and then based on FCOS+ATSS, we remove the centerness branch and construct a new dense object detector, called VarifocalNet or VFNet. Compared to FCOS+ATSS, there are three new things: varifocal loss, star bounding box and bounding box optimization.

3.1 Varifocal Loss

We designed Varifocal Loss to train IACS, which evolved from Focal Loss. Focal Loss is defined as follows:

Among them, α is used to balance the weight of positive and negative samples, and to modulate the weight of each sample, so that difficult samples have a higher weight to avoid a large number of simple negative samples dominating the loss during training. We borrow the idea of weighting in Focal Loss, we use Varifocal Loss to train regression continuous IACS, unlike Focal Loss, Focal Loss treats positive and negative samples the same, but we are not equal here, our Varifocal Loss is defined as:

Where p is the predicted IACS,q is the target IoU score, for the positive sample, Q is the IoU between the prediction bounding box and the gt box, for the negative sample, Q is 0. See figure 1 above.

As can be seen from the formula, VFL only attenuates the negative samples, because there are too few positive samples, and we hope to make full use of the supervision signals of the positive samples. On the other hand, inspired by PISA and IoU-balanced Loss, we use Q to weigh the positive samples. if the positive sample has a high gt_iou, then the contribution of loss is greater, so that the training can focus on those high-quality samples. In order to balance the positive and negative samples of the population, we also use α to weigh the negative samples.

3.2 characteristic representation of star-shaped bounding box

We also designed an efficient representation of star-shaped bounding box to predict IACS, using nine fixed sampling points (yellow circle in figure 1) to represent the bounding box by deformable convolution. This representation can obtain the geometric information of the bounding box and the nearby context information, which is very important for the misalignment between the prediction box and the gt box during coding.

Specifically, given a sampling point (xPowery), we first use a 3x3 convolution to return to an initial box. Like FCOS, the bounding box is encoded as a 4D vector (l', t', r', b'), indicating the distance from the sampling point to the four sides, using this distance vector. We heuristically select nine sampling points: (X, y), (xmurl, y), (x, ymert'), (x, ymurl, y), (x, ymurb'), (xmurl, ymert'), (xmurl, ymurb'), (xmurl, ymurb') and (xcopyright, ymerb'). Then, these nine points are mapped to the feature map, using deformable convolution to represent the bounding box. Because these points are selected by hand, no additional prediction is needed, so the calculation is very efficient.

3.3 Optimization of bounding box

Through the optimization step of the bounding box, we further improve the location accuracy of the object. Bounding box optimization is not commonly used in dense object detection, but using star-shaped bounding box representation, we can use bounding box optimization in dense object detector without losing computational efficiency.

We model the optimization of the bounding box as a residual learning problem. For the initial regression box (l', t', r', b'), we first extract the star representation and encode it. Then, we learn four distance scaling factors (△ l, △ t, △ r, △ b) to scale the distance vector, so that the optimized bounding box can be expressed as (l, t, r, b) = (△ l × lager, △ t × t, △ r × r, △ b × b'), which is closer to gt.

3.4 VarifocalNet

Add the above three components, then remove the centerness branch, and you get VarifocalNet. Figure 3 shows the structure of VarifocalNet. The trunk is the same as FCOS, except on the detection header. Positioning subnetwork requires bounding box regression and refine.

3.5 loss function and reasoning

The loss function is as follows:

Among them, bbox' and bbox represent the initial and post-refine prediction bounding boxes respectively, and we use the qi of the trained target for weighting.

Reasoning: reasoning is the direct forward propagation of the image and then the NMS removes the redundant box.

4. Experiment

Training details: initial learning rate 0.01, using linear warmup strategy, warmup ratio is 0.1, using 8 V100 GPU,batchsize is 16. The maximum size of the input image is 1333x800, and only horizontal flipping is used for data enhancement.

4.1Ablation experiment 4.1.1 Varifocal Loss

Different hyperparameters (α, γ) and the effect of loss weighting, such as Table 2, the best effect is γ = 2, α = 0.75.

4.1.2 contribution of each component

The impact of each component is shown in Table 3:

4.2 comparison with other most advanced methods

4.3 generality and benefit

Compared with GFL, apply VFL to other methods to see if it works:

After reading the above, do you have any further understanding of how to use VarifocalNet to sort the optimal scheme for candidate boxes? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.