Is traditional computer vision technology out of date? No, they are the "new momentum" of deep learning. 05/06 Update SLTechnology News&Howtos

Is traditional computer vision technology out of date? No, they are the "new momentum" of deep learning.

2025-05-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

2019-12-24 11:40:47

Selected from arXiv

Author: Niall O'Mahony and other machine heart compilers

Participation: devil, Zhang Qian

After the rise of deep learning, is the traditional method of computer vision eliminated?

Paper link: https://arxiv.org/ftp/arxiv/papers/1910/1910.13796.pdf

Deep learning expands the boundary of digital image processing. However, this does not mean that the traditional computer vision technology, which developed continuously before the rise of deep learning, has been eliminated. Recently, researchers from the Trelli Institute of Technology in Ireland published a paper to analyze the advantages and disadvantages of the two methods.

The purpose of this paper is to promote the discussion of whether to retain the knowledge of classical computer vision technology. In addition, this paper also discusses how to combine traditional computer vision with deep learning. Several recent hybrid methods are mentioned in this paper, which not only improve the performance of computer vision, but also solve the problems which are not suitable for deep learning. For example, the combination of traditional computer vision technology and deep learning has become popular in many emerging fields, such as full field of vision and 3D vision where the deep learning model has not been fully optimized.

In-depth study of VS traditional computer Vision

The advantages of deep learning

The rapid development of deep learning and the improvement of equipment capabilities (such as computing power, memory capacity, energy consumption, image sensor resolution and optics) have improved the performance and cost-effectiveness of visual applications and further accelerated the expansion of such applications. Compared with traditional CV technology, deep learning can help CV engineers to achieve higher accuracy in image classification, semantic segmentation, target detection and synchronous location and map construction (SLAM). Because the neural network used in deep learning is obtained by training rather than programming, the application of this method requires less expert analysis and fine tuning, and can deal with the huge amount of available video data in the current system. Deep learning is also extremely flexible because CNN models and frameworks can be retrained with custom datasets for any use case, unlike the CV algorithm, which is more domain specific.

Taking the target detection problem of mobile robot as an example, the two kinds of computer vision algorithms are compared.

Traditional computer vision methods use mature CV technology to deal with target detection problems, such as feature descriptors (SIFT, SUR, BRIEF, etc.). Before the rise of deep learning, tasks such as image classification need to use feature extraction steps, that is, "interesting", descriptive or informative small image blocks in the image. This step may involve a variety of CV algorithms, such as edge detection, corner detection or threshold segmentation. After enough features are extracted from the image, these features can form the definition of each target category (word bag). During the deployment phase, search for these definitions in other images. If most of the features in the word bag of another image are found in one image, the image also contains the same targets (such as chairs, horses, etc.).

The defect of traditional CV method is that it is a necessary step to select important features from each image. With the increase of the number of categories, feature extraction becomes more and more troublesome. Determining which features best describe different target categories depends on the judgment and long-term trial and error of CV engineers. In addition, each feature definition needs to process a large number of parameters, all of which must be adjusted by CV engineers.

Deep learning introduces the concept of end-to-end learning, that is, each image in the image data set provided to the machine has been labeled with the target category. Therefore, the deep learning model is "trained" based on the given data, in which the neural network discovers the underlying patterns in the image category and automatically extracts the most descriptive and prominent features for the target category. It is generally believed that the performance of DNN is much better than that of traditional algorithms, although the former has a trade-off between computing requirements and training time. As the best methods in the CV domain use deep learning, the workflow of CV engineers has changed dramatically, and the knowledge and expertise required to manually extract features has been replaced by the knowledge and expertise required to iterate using the deep learning architecture (see figure 1).

Figure 1) traditional computer vision workflow vs b) Deep learning workflow. (figure Source: [8])

In recent years, the development of CNN has had a great impact on the field of CV, and the ability of target recognition has been greatly improved. This outbreak is closely related to the improvement of computing power and the increase of the amount of training data. Recently, the deep neural network architecture has been blowout and widely used in the field of CV, which can be seen from the more than 3000 citations of "ImageNet Classification with Deep Convolutional Neural Networks" in this paper.

CNN uses convolution kernels (also known as filters) to detect features (such as edges) in the image. The convolution kernel is a weight matrix, and these weights are trained to detect specific features. As the name suggests, the main idea of CNN is to spatially convolute the kernel on a given input image to check for features needed for detection. In order to numerically represent the confidence of a feature, the neural network performs a convolution operation, that is, calculating the dot product of the area where the convolution kernel overlaps with it and the input image (the original image area being viewed by the convolution kernel is called receptive field).

In order to promote the learning of convolution kernel weights, the researchers add bias terms to the output of the convolution layer and feed them into the nonlinear activation function. Activation functions are usually nonlinear functions, such as Sigmoid, TanH, and ReLU. The choice of activation functions depends on the nature of the data and classification tasks. For example, ReLU has more biological representations (whether neurons in the brain are active). Therefore, in the image recognition task, ReLU will get better results because it has stronger resistance to the gradient disappearance problem, and it can output more sparse and efficient representations.

In order to speed up the training process and reduce the amount of memory consumed by the network, the convolution layer is usually followed by a pooling layer, which is used to remove the redundant parts of the input features. For example, the maximum pool moves the window on the input, outputting only the maximum value in the window, thus efficiently reducing the redundant parts of the image, leaving important pixels. As shown in figure 2, a deep CNN may have multiple pairs of convolution and pooling layers. Finally, the full connection layer compresses the upper layer into a feature vector, and then the output layer uses the dense network to calculate the score of the output category / feature (confidence or probability). Input this output into a regression function, such as the Softmax function, which maps everything to a vector and the sum of all the elements is 1.

Figure 2:CNN building block. (figure Source: [13])

But deep learning is still a tool in the field of CV. For example, the most commonly used neural network in the field of CV is CNN. So what is convolution? Convolution is widely used in image processing technology. The advantages of deep learning are clear, so this paper will not discuss the current optimal algorithm for the time being. However, deep learning is not a panacea for all problems. The following article will introduce the problems and applications that are more suitable for traditional CV algorithms.

Advantages of traditional CV technology

This section details the reasons why traditional feature-based methods can effectively improve performance in CV tasks. These traditional methods include:

Scale invariant feature transform (Scale Invariant Feature Transform,SIFT) [14] accelerated robust feature (Speeded Up Robust Feature,SURF) [15] feature based on accelerated segmentation test (Features from Accelerated Segment Test,FAST) [16] Hough transform (Hough transform) [17] geometric hash (Geometric hashing) [18]

Feature descriptors (such as SIFT and SURF) are usually used in conjunction with traditional machine learning classification algorithms (such as support vector machine and K-nearest neighbor algorithm) to solve the CV problem.

Deep learning can sometimes go too far, and traditional CV techniques can usually solve problems more efficiently and use fewer lines of code than deep learning. SIFT, or even simple algorithms such as color thresholds and pixel counting, are not specific to a particular category, they are general algorithms that can perform the same operation on any image. In contrast, the characteristics learned by the deep neural network are specific to the training data. In other words, if there is a problem in the construction of the training data set, the network is not good for image processing outside the training data set.

Therefore, algorithms such as SIFT are usually used in applications such as image stitching / 3D mesh reconstruction, which do not require specific category knowledge. These tasks can also be accomplished by training large data sets, but this requires a great deal of research effort, and it is not practical to spend so much effort on a closed application. When faced with a CV application, engineers need to develop a common sense of which solution to choose. For example, two types of products on the assembly line conveyor belt are classified, one is red and the other is blue. The deep neural network needs to collect sufficient training data first. However, using a simple color threshold method can also achieve the same effect. Some problems can be solved using simpler and faster techniques.

What if DNN is not good for data other than training data? When the training data set is limited, the neural network may be over-fitted and can not be effectively generalized. Tuning parameters manually is very difficult because DNN has millions of parameters and the relationships between them are complex. Therefore, the deep learning model is criticized as a black box. Traditional CV technology is fully transparent, and people can judge whether the solution can operate effectively outside the training environment. CV engineers understand the problems to which their algorithms can be migrated, so that if something goes wrong, they can adjust the parameters so that the algorithm can process a large number of images effectively.

Now, traditional CV technologies are often used to solve simple problems, so that they can be deployed on low-cost microprocessors, or by highlighting specific features in the data, enhancing data, or auxiliary data set tagging, to limit the problems that deep learning techniques can solve. Later in this paper, we will discuss how many image transformation techniques can be used in neural network training. Finally, there are many more challenging problems in the field of CV, such as robotics, augmented reality, automatic panoramic mosaic, virtual reality, 3D modeling, motion estimation, video stabilization, motion capture, video processing and scene understanding, which cannot be easily achieved through deep learning, but it can benefit from traditional CV technology.

The Integration of traditional CV Technology and Deep Learning

Traditional CV+ deep learning = better performanc

There is a clear tradeoff between traditional CV technology and deep learning methods. The classical CV algorithm is mature, transparent, and optimized for performance and energy efficiency; deep learning provides better accuracy and versatility, but consumes more computing resources.

The hybrid method combines traditional CV technology and deep learning, and has the advantages of both methods. They are especially suitable for high-performance systems that need to be implemented quickly.

The mix of machine learning metrics and deep networking has become very popular because it produces better models. The implementation of hybrid visual processing can bring performance advantages, and the multiplication accumulation and addition operation is reduced to 13-1000 of the deep learning method, and the frame rate is 10 times higher than that of the deep learning method. In addition, the hybrid method uses only half the memory bandwidth of the deep learning method and consumes much less CPU resources.

Make full use of edge computing

When algorithms and neural networks infer to run on edge devices, the latency, cost, cloud storage, and processing requirements are lower than cloud-based implementations. Edge computing can avoid the transmission of sensitive or identifiable data over the network, so it has greater privacy and security.

The hybrid method which combines traditional CV and deep learning makes full use of the heterogeneous computing power available on edge devices. The heterogeneous computing architecture includes CPU, microcontroller co-processors, digital signal processors (DSP), field programmable gate arrays (FPGA), and AI acceleration devices to reduce energy consumption by assigning different workloads to the most efficient computing engines. The test results show that when performing deep learning inference on DSP and CPU respectively, the target detection delay of the former is 1/10 of that of the latter.

A variety of hybrid methods prove its advantages in edge applications. Using the hybrid method, the data from the edge node sensor can be integrated efficiently.

Problems that are not suitable for deep learning

There are some difficult problems in the field of CV, such as robotics, augmented reality, automatic panoramic mosaic, virtual reality, 3D modeling, motion estimation, video stabilization, motion capture, video processing and scene understanding, which are difficult to implement in a differentiable way through deep learning, but require the use of other "traditional" techniques.

The following introduces some emerging problems in the field of CV, in which deep learning faces new challenges, and the classical CV technology can play a greater role.

3D vision

The memory size of 3D input is much larger than that of traditional RGB images, and the convolution kernel must perform convolution in the 3D input space (see figure 3).

Figure 3 CNN vs 2D CNN. 3D CNN [47]

Therefore, the computational complexity of 3D CNN increases cubic with the resolution. 3D CV is more difficult than 2D image processing because the increased dimensions increase uncertainty, such as occlusion and different camera angles (see figure 4).

The next section deals with solutions that deal with a variety of 3D data representations with new architectures and preprocessing steps designed to address these challenges.

Deep learning of geometry (GDL) extends the deep learning technology to 3D data. 3D data can be represented in a variety of ways, which can be divided into Euclid and non-Euclid. 3D Euclid structured data has the underlying grid structure, which allows global parameterization. In addition, it also has the same coordinate system as 2D images. This makes the existing 2D deep learning paradigm and 2D CNN applicable to 3D data. 3D Euclid data is more suitable for voxel-based analysis of simple rigid objects, such as chairs, airplanes and so on. On the other hand, 3D non-Euclidean data does not have grid array structure, that is, global parameterization is not allowed. Therefore, it is a very difficult task to extend the classical deep learning techniques to this kind of representation, and Pointnet proposed recently [52] has solved this problem.

The continuous shape information which is useful for target recognition is often lost in the process of conversion to voxel representation. Using the traditional CV algorithm, [53] proposes one-dimensional features that can be applied to voxel CNN (voxel CNN). This new rotation invariant feature based on mean curvature improves the shape recognition performance of voxel CNN. This method has achieved great success when applied to the current optimal voxel CNN Octnet architecture, and it has achieved a 1% overall accuracy improvement on the ModelNet10 data set.

SLAM

Visual SLAM is a subset of SLAM that uses a visual system rather than lidar to register road signs in a scene. Visual SLAM has the advantages of photogrammetry (rich visual data, low cost, lightweight and low energy consumption) and does not have the heavy computing workload that post-processing usually requires. Visual SLAM includes the steps of environment awareness, data matching, motion estimation, location update and new landmark registration.

Modeling visual objects in different conditions (such as 3D rotation, scaling, lighting) and using powerful transfer learning techniques to expand representation to achieve zero/one shot learning is a difficult problem. Feature extraction and data representation methods can effectively reduce the number of training samples needed by the machine learning model.

A two-step method is often used in image localization: position recognition + pose estimation. The former uses word bag method to calculate the global descriptor of each image by accumulating local image descriptors (such as SIFT). Each global descriptor is stored in the database, along with the camera pose that generates the 3D point cloud reference image. Similar global descriptors are extracted from query images, and the closest global descriptors in the database can be retrieved by efficient search. The camera pose closest to the global descriptor can help us to roughly locate the query image. In pose estimation, algorithms such as Perspective-n-Point (PnP) [13] and geometric verification are used to calculate the exact pose of the query image more accurately.

The success of image-based location recognition is largely due to the ability to extract image feature descriptors. Unfortunately, when performing local feature extraction on lidar scanned images, there is no algorithm with the performance comparable to SIFT. The 3D scene consists of 3D points and database images. One method is to combine each 3D point with a set of SIFT descriptors to describe the image features in which the point is triangulated. These descriptors are then averaged into a SIFT descriptor to describe the appearance of the point.

Another method is to construct multimodal features based on RGB-D data rather than deep processing. As for the deep processing part, the researchers use the surface normal-based coloring method because it is effective and robust for a variety of tasks. Another alternative method using traditional CV technology proposes a graph-based hierarchical descriptor Force Histogram Decomposition (FHD), which can define the spatial relationship and shape information between the paired structured subparts of an object. The advantage of this learning step is that it is compatible with the traditional word bag framework, resulting in a mixed representation that combines structural features and local features.

360-degree camera

Due to the imaging characteristics of the spherical camera, each image can capture a 360-degree panoramic scene, eliminating the restriction on steering selection. One of the main challenges for spherical images is the serious barrel distortion caused by ultra-wide-angle fisheye lenses, which increases the complexity of traditional human vision-inspired lane detection and trajectory tracking methods. This usually requires additional preprocessing steps, such as a priori calibration (prior calibration) and deworming. An alternative method proposed in [60] regards navigation as a classification problem, thus bypassing the preprocessing step. This method finds the optimal potential path direction based on the original uncalibrated spherical image.

Panoramic mosaic is another open problem in this field. The real-time stitching method [61] uses a set of deformable meshes and the final image, combined with the input of a robust pixel shader. Another method [62] combines the accuracy provided by geometric reasoning (lines and vanishing points) with more advanced data extraction and pattern recognition realized by deep learning techniques (edge and normal) to extract structured data for indoor scenes and generate layout assumptions. In sparse structured scenes, feature-based image registration methods usually fail due to the lack of obvious image features. In this case, direct image registration methods can be used, such as the image registration algorithm based on phase correlation. [23] the image registration technology based on discriminant correlation filter (DCF) is studied, and it is proved that the method based on DCF is better than the method based on phase correlation.

Dataset annotation and enhancement

There are some objections to the combination of CV and deep learning, which can be summed up in one sentence: we need to re-evaluate the approach, whether it's a rule-based approach or a data-driven approach. From the traditional perspective of signal processing, we understand the operational implications of traditional CV algorithms (such as SIFT and SURF), but deep learning can't show these meanings, all you need is more data. This can be seen as a huge advance, but it can also be a retreat. This paper mentions the pros and cons of the debate, but if the future approach is only data-driven, then the research should focus on more intelligent dataset creation methods.

The basic problem of current research is that there is not enough data for advanced algorithms or models for special applications. In the future, the combination of custom data sets and deep learning model will become the topic of many research papers. Therefore, the researcher's output involves not only algorithms or architectures, but also data sets or data collection methods. Dataset annotation is the main bottleneck in deep learning workflow, which requires a lot of manual labeling work. This is particularly evident in semantic segmentation, where each pixel needs to be accurately labeled. [20] many useful semi-automatic flow tools are discussed, some of which make use of ORB features, polygon deformation (polygon morphing), semi-automatic region of interest fitting and other algorithms.

The easiest and most common way to overcome the lack of data and reduce the over-fitting phenomenon of image classification depth learning model is to use label-invariant image transformation (label-preserving transformation) to artificially expand the data set. This process is called dataset enhancement, which refers to the generation of additional training data by clipping, scaling, or rotating based on existing data. It is hoped that the data enhancement step requires very little computation and can be implemented in the deep learning training process, so that the transformed image does not have to be stored on disk. The traditional algorithms used in data enhancement include principal component analysis (PCA), noise addition, interpolation or extrapolation between samples in feature space, and modeling visual context surrounding objects based on segmentation tagging.

Https://www.toutiao.com/i6773845207382229508/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.