In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly explains "what are the five major technologies of computer vision". Interested friends might as well take a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "what are the five major technologies of computer vision"?
Why does ▌ study computer vision?
One obvious answer is that this research field has spawned a large number of fast-growing and practical applications, such as:
Face recognition: Snapchat and Facebook use face detection algorithms to recognize faces.
Image retrieval: Google Images uses content-based queries to search for related images, and the algorithm analyzes the contents of the queried images and returns the results based on the best match.
Games and controls: the more successful game applications that use stereo vision are Microsoft Kinect.
Monitoring: surveillance cameras used to monitor suspicious behavior are found in major public places.
Biometric technology: fingerprint, iris and face matching are still some common methods in the field of biometrics.
Smart cars: computer vision is still the main source of information for detecting traffic signs, lights and other visual features.
Visual recognition is a key component of computer vision, such as image classification, location and detection. The latest developments in neural networks and deep learning have greatly promoted the development of these most advanced visual recognition systems. In this article, I will share five major computer vision technologies and introduce several deep learning models and applications based on computer vision technology.
▌ 1 , image classification
Given a group of images marked as a single category, we predict the category of a new set of test images and measure the accuracy of the prediction, which is the problem of image classification. The problem of image classification needs to face the following challenges ☟☟☟:
Viewpoint change, scale change, intra-class change, image deformation, image occlusion, lighting conditions and background speckle
How do we write an image classification algorithm?
Computer vision researchers have proposed a data-driven method.
The algorithm does not specify each image category of interest directly in the code, but provides many examples for each image category of the computer, and then designs a learning algorithm to view these examples and learn the visual appearance of each category. In other words, first accumulate a training set with tagged images, and then input it into the computer, which will process the data.
Therefore, you can decompose it by following these steps:
The input is a training set of N images with K categories, each of which is marked as one of the categories.
Then, the training set is used to train a classifier to learn the external features of each category.
Finally, the class tags of a group of new images are predicted and the performance of the classifier is evaluated. The class tags predicted by the classifier are compared with their real class tags.
At present, the popular image classification architecture is convolution neural network (CNN), which sends the image into the network, and then the network classifies the image data. The convolution neural network starts with the input "scanner", and the input "scanner" will not parse all the training data at once. For example, if you enter an image with a size of 100 to 100, you don't need a network layer with 10000 nodes. Instead, you only need to create a scan input layer with a size of 10 / 10, scanning the first 10 / 10 pixels of the image. Then, the scanner moves one pixel to the right, and then scans the next 10 * 10 pixels, which is the sliding window.
The input data is fed into the convolution layer instead of the normal layer. Each node only needs to deal with the nearest neighboring nodes, and the convolution layer tends to shrink with the deepening of the scan. In addition to the convolution layer, there is usually a pooled layer. Pooling is a way to filter details, and the common pooling technique is maximum pooling, which uses a matrix of size 2 to 2 to pass pixels with the most specific attributes.
Now, most of the image classification techniques are trained on the ImageNet data set, which contains about 1.2 million high-resolution training images. The test image has no initial annotation (that is, no segmentation or tags), and the algorithm must generate tags to specify which objects exist in the image.
Many existing computer vision algorithms are implemented on ImageNet datasets by top computer vision teams from Oxford, INRIA and XRCE. Generally speaking, computer vision systems use complex multi-level pipes, and early-stage algorithms are manually fine-tuned by optimizing several parameters.
The winner of the first ImageNet competition was Alex Krizhevsky (NIPS 2012), who designed a deep convolution neural network based on the type of neural network pioneered by Yann LeCun. In addition to some maximum pooling layers, the network architecture also contains seven hidden layers, the first few layers are convolution layers, and the last two layers are full connection layers. In each hidden layer, the activation function is linear, which is faster and better than the logic unit. In addition, when nearby units have stronger activity, it also uses competitive standardization to suppress hidden activities, which helps to change the intensity.
In terms of hardware requirements, Alex implements a very efficient convolution network on two Nvidia GTX 580 GPU (more than 1000 fast cores). GPU is very suitable for multiplication between matrices and has a very high memory bandwidth. This allows him to complete the training within a week and quickly combine the results from 10 blocks during the test. If we can transmit the state fast enough, we can distribute the network across multiple cores.
As the kernel gets cheaper and the dataset gets larger, large neural networks are faster than old computer vision systems. Since then, there have been many models that use convolution neural networks as the core and achieve excellent results, such as ZFNet (2013), GoogLeNet (2014), VGGNet (2014), RESNET (2015), DenseNet (2016) and so on.
▌ 2 , object detection
The task of identifying objects in an image usually involves outputting bounding boxes and labels for individual objects. This is different from the classification / positioning task-classifying and locating many objects, not just individual subject objects. In object detection, you have only two object categories, namely, object bounding box and non-object bounding box. For example, in car detection, you must use a bounding box to detect all cars in a given image.
If we use sliding window techniques such as image classification and image location, we need to apply convolution neural networks to many different objects on the image. Because the convolution neural network will recognize each object in the image as an object or background, we need to use convolution neural network in a large number of locations and scales, but it requires a lot of computation!
To solve this problem, neural network researchers suggest using the concept of region, so that we can find "speckle" image areas that may contain objects, thus running much faster. The first model is region-based convolution neural network (R-CNN), and its algorithm principle is as follows:
In R-CNN, we first scan the input image using a selective search algorithm to find possible objects in it, thus generating about 2000 regional suggestions.
Then, run a convolution god network on these regional recommendations
Finally, the output of each convolution neural network is transmitted to support vector machine (SVM), and a linear regression is used to tighten the bounding box of the object.
In essence, we transform object detection into an image classification problem. But there are also these problems: the training speed is slow, a lot of disk space is needed, and the reasoning speed is also very slow.
The first upgrade version of R-CNN is Fast R-CNN, which greatly improves the detection speed by using 2 enhancements:
Feature extraction is performed before the suggested area, so the convolution neural network can only be run once on the whole image.
Instead of creating a new model, the neural network used for prediction is extended by using a softmax layer instead of support vector machine.
Fast R-CNN runs much faster than R-CNN because it can only train one CNN in an image. However, it still takes a lot of time for selective search algorithms to generate regional proposals.
Faster R-CNN is a typical case based on deep learning object detection.
The algorithm uses a fast neural network instead of a slow selective search algorithm: inserting a regional proposal network (RPN) to predict suggestions from features. RPN decides to look at "where", which reduces the computational complexity of the entire reasoning process.
RPN scans each location quickly and efficiently to assess whether further processing is needed within a given area by outputting k bounding box recommendations, each of which has two values-representing the probability that each location contains and does not contain the target object.
Once we have regional suggestions, we send them directly into Fast R-CNN. In addition, we have added a pooling layer, some full connection layers, an softmax classification layer and a bounding box regression.
In short, Faster R-CNN is faster and more accurate. It is worth noting that although the later models have done a lot of work in improving the detection speed, few models can significantly exceed the Faster R-CNN. In other words, Faster R-CNN may not be the easiest or fastest method for target detection, but it is still one of the best performance methods.
In recent years, the main target detection algorithms have shifted to faster and more efficient detection systems. This trend is particularly evident in You Only Look Once (YOLO), Single Shot MultiBox Detector (SSD) and region-based full convolution network (R-FCN) algorithms, which turn to shared computing on the whole image. Therefore, these three algorithms are different from the above three high-cost R-CNN technologies.
▌ 3 , target tracking
Target tracking refers to the process of tracking one or more specific objects of interest in a particular scene. The traditional application is the interaction between video and the real world, observing after the initial object is detected. Now, target tracking is also important in the field of self-driving, such as self-driving companies such as Uber and Tesla.
According to the observation model, target tracking algorithms can be divided into two categories: generation algorithm and discrimination algorithm.
The generation algorithm uses a generation model to describe apparent features and minimizes reconstruction errors to search for targets, such as principal component analysis (PCA).
Discriminant algorithm is used to distinguish objects from background, its performance is more robust, and gradually become the main means of tracking objects (discriminant algorithm is also known as Tracking-by-Detection, deep learning also belongs to this category).
In order to track through detection, we detect the candidates of all frames and use depth learning to identify the desired objects from the candidates. There are two basic network models that can be used: stack automatic Encoder (SAE) and Convolutional Neural Network (CNN).
At present, the most popular network for target tracking using SAE is Deep Learning Tracker (DLT), which uses offline pre-training and online fine-tuning. The process is as follows:
Offline unsupervised pre-training uses large-scale natural image data sets to obtain a general target object representation, and pre-trains the stack denoising automatic encoder. The stack denoising automatic encoder adds noise to the input image and reconstructs the original image, which can obtain more powerful feature representation ability.
The coding part of the pre-training network is combined with the classifier to get the classification network, and then the positive and negative samples obtained from the initial frame are used to fine-tune the network to distinguish the current object and background. DLT uses particle filter as the intention model (motion model) to generate candidate blocks for the current frame. The classification network outputs the probability values of these blocks, that is, the confidence of the classification, and then selects the block with the highest confidence as the object.
In model updates, DLT uses limited thresholds.
In view of the advantages of CNN in image classification and target detection, it has become the mainstream depth model of computer vision and visual tracking. Generally speaking, large-scale convolution neural networks can be trained as classifiers and trackers. The representative tracking algorithms based on convolution neural network are full convolution network tracker (FCNT) and multi-domain convolution neural network (MD Net).
FCNT makes full use of the feature mapping in the VGG model, which is a pre-trained ImageNet data set, and has the following effects:
Convolution neural network feature mapping can be used for location and tracking.
For the task of distinguishing specific objects from the background, many convolution neural network feature maps are noisy or irrelevant.
The higher level captures the semantic concept of the object category, while the lower level coding has more regional features to capture the deformations within the category.
Therefore, FCNT designed a feature selection network to select the most relevant feature mapping on the convolution 4-3 and convolution 5-3 layers of the VGG network. Then, in order to avoid over-fitting of noise, FCNT also designed two additional channels (SNet and GNet) for the selection feature mapping of these two layers: GNet captures the category information of the object, and SNet distinguishes the object from the background with a similar appearance.
The operation of these two networks is as follows: both are initialized with the bounding box given in the first frame to obtain the mapping of the object. For the new frame, cut it and transmit the region of interest in the last frame, which is centered on the target object. Finally, through SNet and GNet, the classifier gets two predictive thermal maps, and the tracker decides which thermal mapping is used to generate the tracking result according to whether there is interference information. The figure for FCNT is shown below.
Unlike FCNT, MD Net uses all the sequences of the video to track the movement of objects. The above network uses irrelevant image data to reduce the training requirement of tracking data, and this idea has some deviation from tracking. The object of one class in this video can be the background of another video, so MD Net put forward the concept of "multi-domain", which can independently distinguish between object and background in each domain, and a domain represents a group of videos containing the same type of objects.
As shown in the following figure, MD Net can be divided into two parts, namely K specific target branch layer and sharing layer: each branch contains a binary classification layer with softmax loss to distinguish objects and backgrounds in each domain; the sharing layer is shared with all domains to ensure a common representation.
In recent years, deep learning researchers have tried to use different methods to adapt to the characteristics of visual tracking tasks, and have explored many ways:
Apply to other network models such as cyclic neural network (RNN) and deep belief network (DBN)
Design the network structure to accommodate video processing and end-to-end learning, and optimize processes, structures and parameters
Or combine deep learning with traditional methods in computer vision or other fields, such as language processing and speech recognition.
▌ 4. Semantic segmentation
The core of computer vision is segmentation, which divides the whole image into pixel groups, and then marks and classifies them. In particular, semantic segmentation attempts to semantically understand the role of each pixel in the image (for example, to identify whether it is a car, motorcycle, or other category). As shown in the picture above, in addition to identifying people, roads, cars, trees, etc., we must also determine the boundary of each object. Therefore, unlike classification, we need to use a model to predict dense pixels.
Like other computer vision tasks, convolution neural network has achieved great success in task segmentation. One of the most popular original methods is to classify blocks through a sliding window, using the image blocks around each pixel to classify each pixel separately. But its computational efficiency is very low because we cannot reuse shared features between overlapping blocks.
The solution is the full convolution network (FCN) proposed by the University of California, Berkeley, which proposes an end-to-end convolution neural network architecture for intensive prediction without any fully connected layer.
This method allows segmentation mapping to be generated for images of any size, and it is much faster than the block classification algorithm. Almost all subsequent semantic segmentation algorithms adopt this paradigm.
However, there is still a problem: convolution on the resolution of the original image is very expensive. To solve this problem, FCN uses downsampling and upsampling within the network: the lower sampling layer is called stripe convolution (striped convolution), and the upper sampling layer is called deconvolution (transposed convolution).
Although the up-sampling and down-sampling layers are used, FCN will generate rough segmentation mapping due to the loss of information during pooling. SegNet is a more efficient memory architecture than FCN, which uses the maximum pooling and codec framework. In the SegNet decoding technology, shortcut/skip connections is introduced from the higher resolution feature mapping to improve the rough segmentation mapping after up-sampling and down-sampling.
At present, the research of semantic segmentation depends on complete convolution networks, such as hollow convolution (Dilated Convolutions), DeepLab and RefineNet.
▌ 5 , instance segmentation
In addition to semantic segmentation, instance segmentation classifies different types of instances, such as marking five cars with five different colors. Generally speaking, the task of classification is to identify what the image containing a single object is, but when we segment the instance, we need to perform more complex tasks. We will see multiple overlapping objects and complex scenes with different backgrounds. We not only need to classify these different objects, but also determine the boundaries, differences and relationships between objects!
So far, we have seen how to use the features of convolution neural network in many interesting ways to effectively locate different objects in the image through the bounding box. Can we extend this technology? In other words, the precise pixel of each object is located, not just with a bounding box? Facebook AI uses Mask R-CNN architecture to explore the problem of instance segmentation.
Just like Fast R-CNN and Faster R-CNN, the bottom layer of Mask R-CNN is that since Faster R-CNN is very effective in object detection, can we extend it to pixel-level segmentation?
Mask R-CNN performs pixel-level segmentation by adding a branch to Faster R-CNN, which outputs a binary mask indicating whether a given pixel is part of the target object: the branch is a full convolution network based on convolution neural network feature mapping. The given convolutional neural network feature mapping is used as input and output as a matrix, in which all positions where pixels belong to the object are represented by 1 and other positions are represented by 0, which is the binary mask.
In addition, when running on the original Faster R-CNN architecture without any modification, the feature mapping region selected by the region of interest (RoIPool) or the region of the original image is slightly staggered. Because image segmentation has pixel-level characteristics, which is different from the bounding box, it will naturally lead to inaccurate results. Mas R-CNN solves this problem by adjusting RoIPool, using the region of interest alignment (Roialign) method to make it more accurate. In essence, RoIlign uses bilinear interpolation to avoid rounding errors, which leads to inaccurate detection and segmentation.
Once these masks are generated, Mask R-CNN combines RoIAlign with the classification and bounding box from Faster R-CNN for precise segmentation:
At this point, I believe you have a deeper understanding of "what are the five major technologies of computer vision". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.