Practical information | how to build a computer vision model using CNN? 05/06 Update SLTechnology News&Howtos

Practical information | how to build a computer vision model using CNN?

2025-05-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

How to use CNNs to build computer vision model? What is the existing dataset? What are the methods of training the model? In the process of trying to understand the most important concepts of computer vision, this paper provides answers to some existing basic questions.

One of the hottest areas in machine learning is computer vision, which has a wide range of applications and great potential. The purpose of its development is to replicate the powerful ability of human vision. But how to achieve it through algorithm?

Let's take a look at the most important data sets and methods in building computer vision models.

Existing dataset

Computer vision algorithms are not magical. They need data to work, and they will only be the same as the data you enter. These are different sources of collecting correct data, depending on the task:

ImageNet is one of the largest and most famous datasets. It is an off-the-shelf dataset containing 14 million images, manually annotated using the concept of WordNet. Throughout the dataset, 1 million images contain bounding box comments.

An ImageNet image with object property comments. Picture source

Another famous example is the DataSet of Microsoft COCO (Common Objects in Contex), which contains 328000 images, including 91 object types, which are easily identified, with a total of 2.5 million tag instances.

Example of an annotated image from a COCO dataset

Although there are not many datasets available, there are several suitable for different tasks

The researchers used CelebFaces Attributes datasets containing more than 200000 celebrity portraits and "bedroom" indoor scene recognition datasets containing more than 3 million images (15620 indoor scene images); and plant image analysis datasets (1 million plant images from 11 different species).

Photo data set, through these large amounts of data, continue to train the model, so that the results continue to optimize.

An overall strategy

Deep learning methods and techniques have profoundly changed computer vision and other areas of artificial intelligence, so much so that its use is considered standard in many tasks. In particular, convolution neural network (CNN) has surpassed the most advanced technological achievements using traditional computer vision technology.

These four steps outline the general method of building a computer vision model using CNN:

Create a dataset of annotated images, or use an existing dataset. Annotations can be image categories (for classification problems), bounding boxes and classes (for object detection problems), or pixel-level segmentation for each object of interest in the image (for instance segmentation problems). Extract features related to the current task from each image. This is the key point of problem modeling. For example, the features used to identify faces, based on facial standards, are obviously different from those used to identify tourist attractions or human organs. Train a deep learning model based on feature separation. Training means providing many images to the machine learning model, and it will learn how to solve the task at hand based on these characteristics. Use images not used in the training phase to evaluate the model. By doing so, you can test the accuracy of the training model. This strategy is very basic, but it can achieve its goal well. This approach, called supervised machine learning, requires a dataset that contains models and phenomena that must be learned. Training object detection model

There are many ways to solve the challenge of object detection. A general method is proposed in the paper "robust Real-time object Detection" (Robust Real-time Object Detection) by Paul Viola and Michael Jones.

Paper portal: "link"

Although this method can be trained to detect a different range of object classes, its original purpose is facial detection. It is so fast and direct, and it is implemented in the fool camera algorithm, which makes real-time face detection almost no processing power.

The core feature of this method is trained by a set of binary classifiers based on Haar characteristics. These features represent edges and lines and are easy to calculate when scanning an image.

Haar features

Although very basic, in certain face situations, these features allow you to capture important elements, such as the distance between the nose, mouth, or eyebrows. It is a monitoring method that requires many positive and negative examples of object types to be identified.

Method based on CNN

Deep learning has become a real game changer in machine learning, especially in the field of computer vision, the method based on deep learning is at the forefront of many common tasks.

Among the various deep learning methods proposed to achieve target detection, R-CNN (regions with CNN characteristics) is particularly easy to understand. In this paper, the author proposes a process of three stages:

Use the region suggestion method to extract possible objects. Use CNN to identify the characteristics of each area. Support vector machine is used to classify each region.

R-CNN Architecture. Picture source

Although the R-CNN algorithm is unknown for the specific method of region suggestion, the method of region suggestion selected in the original work is selective search. Step 3 is important because it reduces the number of candidates, thereby reducing the computational overhead of the method.

The features extracted here are not as intuitive as the Haar features mentioned earlier. To sum up, we use CNN to extract 4096-dimensional feature vectors from each regional proposal. Given the nature of CNN, the input must always have the same dimension. This is usually one of the weaknesses of CNN, and different approaches solve the problem in different ways. For the R-CNN method, the trained CNN architecture needs to input 227x227 pixels to fix the region. Because the size of the proposed area is different, the author's method is simply to distort the image to match the desired size.

An example of a distorted image that matches the input dimensions required by CNN

Although it has achieved good results, the training encountered some obstacles, and finally this method was surpassed by others. Some of them have an in-depth review in the article-"object Detection in Deep Learning: an authoritative Guide."

Https://www.toutiao.com/a6693688027820065292/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.