How to detect targets by locating corners in keypoint-based in CornerNet 04/15 Update SLTechnology News&Howtos

How to detect targets by locating corners in keypoint-based in CornerNet

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What this article shares with you is about how keypoint-based in CornerNet carries out target detection by locating corners. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Without saying much, let's take a look at it.

CornerNet is proposed below, which detects the target by detecting corner pairs, which is equivalent to the current SOTA detection model. Drawing lessons from the method of human posture estimation, CornerNet has created a new framework in the field of target detection. Many subsequent papers have expanded new corner target detection based on the research of CorerNet.

Introduction

Most of the target detection algorithms are related to anchor box. The paper thinks that there are two disadvantages in using anchor box: 1) it is necessary to tile a large number of anchor box on the feature map to avoid missed detection, but only a small part of anchor box is used in the end, resulting in imbalance between positive and negative samples and slow down training. 2) the introduction of anchor box brings additional hyperparameters and special network design, which makes the model training more complicated.

Based on the above consideration, proposes CornerNet, which defines target detection as the detection of the upper left corner and the lower right corner. The network structure is shown in figure 1. The heat maps of the upper left corner and the lower right corner are predicted by the convolution network, and then the two groups of heat maps are combined to output the prediction box, which completely removes the need for anchor box. Through experiments, the paper also shows that CornerNet has the same performance as the current mainstream algorithms, and creates a new paradigm of target detection.

CornerNetOverview

In CornerNet, the target is detected by detecting the upper left corner and the lower right corner of the target, and the convolution network predicts two sets of heat maps (heatmap) to represent the corner positions of different types of targets, corresponding to the upper left corner and the lower right corner respectively. In order to correspond the upper left corner and the lower left corner, an embedding vector is predicted for each corner, and the distance between the two corners belonging to the same target will be very small. In addition, the prediction of offset (offset) is added, and the position of the corner is adjusted slightly.

The structure of CornerNet is shown in figure 4. Using hourglass network as the backbone network, two sets of results are output through two independent prediction modules, corresponding to the upper left corner and the lower right corner, respectively. Each prediction module outputs the heat map, embedding vector and offset for the final prediction through corner pooling.

Detecting Corners

The size of the heat map predicted by is $C\ times H\ times categories C $is the number of categories, excluding the background class. The corner of each GT corresponds to only one positive sample point, and the other points are all negative sample points, but the punishment intensity of the negative sample points within the radius of the positive sample point is reduced rather than the equal punishment of the negative sample points during training. The main reason for this is that negative sample points near the positive sample point can produce a prediction box with a sufficiently high IoU, as shown in figure 5.

The radius is set according to the size of the target to ensure that the resulting prediction box is at least IoU greater than $t $. After setting the radius, penalty attenuation is performed according to the 2D Gaussian kernel $e ^ {-\ frac {x ^ 2 + y ^ 2} {2\ sigma ^ 2}} $, where $x $and $y $are the distance relative to the positive sample point, and $\ sigma$ is 1 sigma$ 3 of the radius. Define $p _ {cij} $as the position $(iMagnej) $about the predicted score of category $c $, and $y _ {cij} $as the score based on the Gaussian kernel, this paper designs a variant of focal loss:

due to the existence of the pooling layer, the original image location $(xjiny) $is usually mapped to $(\ lfloor\ frac {x} {n}\ rfloor,\ lfloor\ frac {n}\ rfloor) $on the feature map, and $n$ is the downsampling factor. When mapping the points in the heat map back to the original map, there may be a loss of accuracy due to pooling, which will greatly affect the IoU calculation of small targets. In order to solve this problem, the paper puts forward the offset prediction, which adjusts the position of the corner slightly before mapping the position of the heat map to the original image.

$osigk$ is the offset value, and $x$ and $y$ are the coordinates of the corner $k$. It should be noted that the network predicts a set of offset values for the upper left corner and the lower right corner, respectively, and the offset values are shared between categories. During training, add smooth L1 loss to the positive sample point to train the offset value of the corner:

Grouping Corners

when there are multiple targets in the picture, it is necessary to distinguish the corresponding relationship between the upper left corner and the lower right corner of the prediction, and then form a complete prediction box. This paper refers to the strategy of human posture estimation. Each corner predicts an one-dimensional embedding vector and judges the corresponding relationship according to the distance between the vectors. Define the embedding vector of the upper left corner of the target $k $, and the embedding vector of the lower right corner of the target $e _ {bauk} $. Use pull loss and push loss to combine and separate corners respectively:

$esigk $is the average of $e _ {tweek} $and $e _ {breadk} $, $\ Delta=1 $, where the pull loss and push loss are the same as the offset, used only for positive sample points.

Corner Pooling

The location of the corner generally has no target information. In order to determine whether the pixel is the upper left corner, it is necessary to find the highest point of the target horizontally to the right and the leftmost point of the target vertically downward. Based on this prior knowledge, this paper proposes corner pooling to locate corners. assumes that we need to determine whether the position $(iMagnej) $is the upper-left corner. First, we define $favot$ and $femerl$ as the input feature graphs of the upper-left corner pooling, and $f _ {t _ {iMagnej}} $and $f _ {ll _ {iMagnej}} $as the feature vectors of the input feature graph at the position $(iMagnej) $. The size of the feature graph is $H\ times Wrangelinger pooling. First, the maximum pooled output vector $t _ {ij} $is applied to the eigenvector of $(iMagnej) $to $(iMagneh) $in $favot$, and the maximum pooled output vector $l{ ij} $is also applied to the eigenvector of $(iMagazine j) $to $(WMJ) $in $fancil$. Finally, $t _ {ij} $and $ll _ {ij} $are added. The complete calculation can be expressed as:

formula 6 and formula 7 adopt element-wise maximum pool.

When is implemented, formula 6 and formula 7 can calculate the whole feature graph efficiently as shown in figure 6, which is somewhat similar to dynamic programming. For the corner pooling of the upper left corner, the input feature map is pre-calculated from right to left and from bottom to top, and each position only needs to pool the maximum element-wise with the output of the previous position, and finally the two feature graphs can be added directly.

The complete prediction module structure of , shown in figure 7, is actually an improved version of residual block, replacing the $3\ times 3$ convolution module with the corner pooling module, and finally outputting the heat map, embedding vector, and offset.

Hourglass Network

CornerNet uses the hourglass network as the backbone network, which is used in human posture estimation tasks. The Hourglass module, as shown in figure 3, first downsamples the features, then upsamples and restores, and adds a number of short-circuit connections to ensure the details of the recovery features. The hourglass network used in this paper contains two hourglass modules, and the following improvements have been made.

Replace the convolution of stride=2 with the maximum pool layer responsible for downsampling

Sampling five times and gradually increasing the dimensions (256,384,384,512)

Upsampling uses two residual modules + nearest neighbor sampling

Short-circuit connection contains 2 residual modules

At the beginning of the network, four stride=2, $7\ times 7$ convolution modules of channel=128 and a residual module of stride=2 and channel=256 dimensions are used for processing.

The original hourglass network would add a loss function to each hourglass module for supervised learning, but the paper found that this had an impact on performance and did not adopt this method.

Experiments

compared the effect of corner pooling.

compared the negative sample point to punish the attenuation effect.

Comparison of the effect of collocation between hourglass network and corner detection by

compares the results of heat maps and offset predictions.

is compared with other types of detection networks.

This is how the keypoint-based in CornerNet carries out target detection by locating corners. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.