Arida Motel makes a new breakthrough in autopilot, achieving both accuracy and speed in 3D object detection. | interpretation of CVPR 2020 papers 07/15 Update SLTechnology News&Howtos

Arida Motel makes a new breakthrough in autopilot, achieving both accuracy and speed in 3D object detection. | interpretation of CVPR 2020 papers

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Recently, a paper by Alibaba Dama Institute was selected into the computer Vision Top meeting CVPR 2020. This paper proposes a general-purpose and high-performance autopilot detector, which can achieve both 3D object detection accuracy and speed for the first time, and effectively improve the safety performance of the autopilot system.

3D object detection needs to output information such as object type, length, width, height and rotation angle in 3D space.

Different from the ordinary 2D image recognition application, the autopilot system requires higher accuracy and speed. The detector not only needs to quickly identify the objects in the surrounding environment, but also accurately locate the position of the objects in the three-dimensional space. However, at present, the mainstream single-stage detector and two-stage detector can not balance the detection accuracy and speed, which greatly limits the safety performance of self-driving.

This time, the Dharma Institute put forward a new idea to integrate the fine-grained characterization of features in the two-stage detector into the single-stage detector. Specifically, in the training, the Dharma Institute uses an auxiliary network to transform the voxel features of the single-stage detector into point-level features, and applies certain supervision signals, and at the same time, the auxiliary network does not need to participate in the calculation in the process of model reasoning, so it ensures the speed and improves the detection accuracy at the same time.

The following is the interpretation of the paper by lead author Chenhang He:

1. Background

Target detection is a traditional task in the field of computer vision. Different from image recognition, target detection not only needs to identify the object on the image and give the corresponding category, but also needs to locate the object through Bounding box. According to the different output results of target detection, RGB image is generally used for target detection, and the way of outputting object types and 2D bounding box on the image is called 2D target detection. Using RGB image, RGB-D depth image and laser point cloud, the detection of object category, length, width, height and rotation angle in three-dimensional space is called 3D target detection.

3D target detection from point cloud data is a key component of autopilot (AV) system. Unlike ordinary 2D target detection, which estimates 2D bounding boxes only from the image plane, AV needs to estimate more informative 3D bounding boxes from the real world to accomplish advanced tasks such as path planning and collision avoidance. This has inspired the recent emergence of 3D target detection methods, which apply convolution neural network (CNN) to process point cloud data from high-end LiDAR sensors.

At present, there are two main architectures of 3D object detection based on point cloud.

Single-phase detector (single-stage): the point cloud is encoded into voxel feature (voxel feature), and 3D CNN is used to predict the object frame directly. The speed is fast, but because the point cloud is deconstructed in CNN, the structure perception ability of the object is poor, so the accuracy is slightly lower. Two-stage detector (two-stage): firstly, PointNet is used to extract point-level features, and candidate regions are used to pool point clouds (Pooling from point cloud) to obtain fine features. It usually achieves high precision but slow speed.

two。 Method

At present, the single-stage detector is mainly used in the industry, which can ensure that the detector can be carried out on the real-time system efficiently. Our proposed scheme transplants the idea of fine-grained description of the features in the two-stage detector to the single-stage detection, and transforms the voxel features in the single-stage detector into point-level features by using an auxiliary network in the training. and apply a certain supervision signal, so that the convolution feature also has the ability of structure perception, and then improve the detection accuracy. When doing model inference, the auxiliary network does not participate in the calculation (detached), which ensures the detection efficiency of the single-stage detector. In addition, we propose an engineering improvement, Part-sensitive Warping (PSWarp), to deal with the "box-confidence-mismatch" problem in single-stage detectors.

Principal network

The detector for deployment, the inference network, consists of a backbone network and a detection head. The backbone network is implemented by 3D sparse network, which is used to extract voxel features with high semantics. The detection head compresses the voxel features into an aerial view representation and runs a 2D full convolution network on it to predict 3D object frames.

Auxiliary network

In the training phase, we propose an auxiliary network to extract convolution features from the middle layer of the backbone network, and transform these features into point-level features (point-wise feature). In implementation, we map the non-zero signal in the convolution feature to the original point cloud space, and then interpolate at each point, so that we can obtain the point-level representation of the convolution feature. Make {(): juni0,... , M} is the representation of convolution feature in space. , N} is the original point cloud, the representation of the convolution feature at the original point is equal to

Auxiliary task

We propose two monitoring strategies based on point-level features to help convolution features obtain good structural perception, a foreground segmentation task and a central point regression task.

Specifically, compared with the PointNet feature extractor (a), the convolution operation and downsampling in the convolution network will cause damage to the point cloud structure (b) make the feature insensitive to the boundary and internal structure of the object. We use the segmentation task to ensure that part of the convolution features will not be affected by the background features when downsampling, so as to enhance the perception of the boundary. We use the center point regression task to enhance the convolution feature's ability to perceive the internal structure of the object (d), so that the potential size and shape of the object can be reasonably inferred in the case of a small number of points. We use focal loss and smooth-l1 to optimize the resolution between segmented tasks and central regression tasks.

3. Engineering improvement

In single-stage detection, the alignment of feature map and anchor is a common problem, which will lead to a mismatch between the predicted location quality and confidence of the bounding box, which will affect that in the post-processing stage (NMS), the box with high confidence but low location quality is retained, while the box with high location quality but low confidence is discarded. In two-stage 's target detection algorithm, RPN extracts proposal, and then extracts features (roi-pooling or roi-align) in the corresponding position on the feature map, when the new feature and the corresponding proposal are aligned. We propose an improvement based on PSRoIAlign, Part-sensitive Warping (PSWarp), which is used to regrade the prediction box.

As shown in the figure above, we first modify the last classification layer to generate K partially sensitive feature maps, using {X_k:k = 1 ~ 2, … K} means that each graph encodes information about a specific part of the object. For example, in the case of K = 4, {upper left, upper right, lower left and lower right} four locally sensitive feature graphs are generated. At the same time, we divide each prediction bounding box into K sub-windows, and then select the center of each sub-window as the sampling point. In this way, we can generate K sampling grids {S ^ k: K = 1pr 2, … , K}, each sampling mesh is associated with the local corresponding feature graph. As shown in the figure, we use the sampler to sample the corresponding local sensitive feature map with the generated sampling grid to generate an aligned feature map. Finally, the feature map that can reflect the confidence is the average of K aligned feature maps.

4. Effect.

Our proposed method (black) is PR Curve on KITTI database, where the solid line is a two-stage method and the dotted line is a single-stage method. We can see that as a single-stage method, we can achieve the accuracy that can only be achieved by a two-stage method.

The effect of the KITTI aerial view (BEV) and 3D test set. The advantage is that it can achieve the detection speed of 25FPS without additional calculation while maintaining the accuracy.

The author introduces:

The first author is Chenhang He, a research intern at Dharma Institute. The other authors are Senior fellow, IEEE Fellow Hua Xiansheng, Senior Research fellow, Department of Computing, Hong Kong Polytechnic University, IEEE Fellow Zhang Lei, Senior algorithm expert Huang Jianqiang and Dama Research Intern Hui Zeng.

Original link: https://developer.aliyun.com/article/752688

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.