How to use NTS to understand fine-grained image classification 07/03 Update SLTechnology News&Howtos

How to use NTS to understand fine-grained image classification

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces you how to use NTS to understand fine-grained image classification, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Guide reading

Some problems and challenges of fine-grained image classification.

Computer vision has been excellent since Alexnet won the championship in the ILSVRC competition in 2012. This is a statement that people often encounter when they begin to study this fast-growing technology. The purpose of this blog is to understand the challenging issue of fine-grained visual classification (FGVC), which will be described in detail in the following article.

For Pytorch code implementation, please refer to the following github library: https://github.com/yangze0930/NTS-Net

In the process, people can understand the challenges that may be faced initially, and how to use the interesting architecture of this article to achieve a verification accuracy of 87% from 42% at the beginning (the numbers are based on my experience). The dataset used is FGVC Aircraft variant. Later, I also trained on the Stanford-Cars dataset.

Part I: initial attempt and error fine-grained visual classification

We know that the visual classification task is to create a model to capture the relationship from the input image to the corresponding output class. However, the task of FGVC is not quite the same as a normal classification, because there are more differences within classes than between classes. This is why our goal is to capture distinguishing features in visually similar classes. Finding such features is challenging. In addition, it is expensive to mark the bounding box of the most informative area in the sample.

When you start to deal with the problem, you can use the general method of image classification, even with the standard pre-training model, and fine-tune it to achieve the correct set of task parameters. As described in this article, the datasets linked below are classified at three levels, namely, the manufacturer, family, and variant levels. Fine-grained classification is at the variant level.

At the beginning, I used the standard pre-trained model and tried different adjustments. I got 66% verification accuracy on manufacturer, 42% accuracy on variant, and 87% on variant the first time I tried NTS. For beginners like me, it is very helpful to understand the working code of a complex problem and get the right insight.

Part two: what caused this effect, NTS-NET

As mentioned earlier, it is very expensive to collect label samples of the bounding boxes in the most informative areas of each sample. This is the help provided by this article. It effectively localizes these information areas without the need for bounding boxes.

The model created in this article, called NTS-NET, uses three "agent" that work together to achieve state-of-the-art performance in the benchmark dataset (FGVC aircraft, Stanford Cars, Caltech-UCSD Birds).

These three agents are defined as Navigator, Teacher, and Scrutinizer. Let's discuss their role.

The Navigator agent network navigation model to focus on the most informative areas. For each region in the image, Navigator predicts the amount of information in that region by sorting the losses (described below), and uses these predictions to propose the most informative regions. The question now is: how to get a useful variable-length "area" in the image? Well, there is already an answer to this question, so please wait for me to understand the advanced features of each agent.

Teacher agent evaluates the areas with the largest amount of information proposed by Navigator and provides feedback: for each proposed region, Teacher evaluates its probability as ground-truth class. The confidence assessment instructs the navigator network to use the sort consistency loss function (called "ranking loss" in the code implementation) to propose more informative areas.

When teacher provides more precise oversight, navigator localizes more information areas, benefiting teacher.

Scrutinizer agent examines the proposed region from Navigator and makes fine-grained classification: expanding each proposed region to the same size, agent extracts features from it, combines regional features with the features of the whole image, and carries out fine-grained classification, which is the main method to solve this complex problem.

The information region helps to better represent the object, so the fusion of the features of the information region with the whole image will achieve better performance.

Therefore, the target is the most informative area in the localized object.

Figure 1:NTS model structure

Now let's go back to the question discussed above, that is, how to get a useful variable-length "area" in the image? Navigating to possible areas of information can be seen as the issue of the Regional recommendation Network (RPN) introduced in the R-CNN paper, and I will discuss its relevance here.

Section 1: regional recommendations

Before discussing how the regional recommendation can be implemented on NTS, I should briefly introduce its origin. If you know, please feel free to skip this section.

There are several ways to make regional recommendations:

I) sliding window: in a sliding window, you run a training classifier on all fixed-size sliding image windows, and then run a detector to see what the object is. We can use this algorithm, but the disadvantage is that it may check many of these windows without objects, so the R-CNN algorithm is proposed.

Ii) R-CNN: in this method, the segmentation algorithm is used to obtain the regions that may contain objects, and only run the classifier on these areas. The disadvantage is that the speed is slow because the proposed areas are classified into only one category at a time.

Iii) Fast R-CNN: use a segmentation algorithm for region recommendation. Unlike R-CNN, all suggested regions are classified simultaneously using a convolutional sliding window.

Iv) Faster R-CNN: use area suggestion network, or RPN for short, which requires anchors (bounding boxes distributed throughout the image with different sizes, scales and aspect ratios) and ground truth bounding boxes to suggest information areas, rather than traditional segmentation algorithms.

Section 2: what are the regional recommendations in NTS?

In this article, the default anchors are placed throughout the image, and NTS-model has learned the most informative anchors from these anchors through custom losses in code implementation (because we didn't use annotated borders). These anchors define the coordinates of the suggested areas given by Prosposal_Net (or Navigator Network) as defined in the code, and use NMS (non-maximum suppression) to remove redundancy (overlapping areas) and give top_n recommended areas.

Figure 2:TOP 3 the most informative area learned by NTS, the FGVC Aircraft dataset

Figure 2:TOP 3 the most informative area learned by NTS, the Stanford-Cars dataset

Section 3: description of various losses in NTS-NET

There are many custom losses used in this paper, and then accumulated in the total loss, namely the original loss, concat loss, rank loss and part_cls loss.

Note: the loss terms used in the code are different from those in the paper, namely the loss of Navigator, Teacher, and Scrutinizer. Here I use the loss used in the code.

Total loss= Raw_loss+Rank_loss+Concat_loss+Part_cls_loss

Note: we use the RESNET50 model as the feature extractor for the original image and the suggested region.

RAW LOSS: this is the cross-entropy loss of image classification based on RESNET network parameters. We raw loss the features of the original image, and then combine them with the features of our proposed region image for fine-grained classification. The output here is the label of the image.

CONCAT LOSS: in the Scrutinizer network, we input the original image feature and the suggested region feature CONCAT into this classification cross-entropy loss China, and output the image label.

PART LOSS (LIST LOSS): it is used as feedback to the navigator network, because here we find the cross-entropy loss between each proposed image and its ground truth class.

RANK LOSS: use the top_n rpn score (suggestion area feature) and the corresponding loss as part of the feedback loss received by each suggestion area, and for each suggestion area, all the loss that precedes the suggestion area is added to the rank loss so that rank loss can be optimized.

PART_CLS LOSS: this is the cross-entropy loss between some features and tags. Some features are extracted from the RESNET-50 using the part_images defined in the code, which is generated from the original image using the coordinates of the top_n suggested area.

Now part loss and part_cls losses are the same, but part_cls losses contribute to total losses, while other loss do not, and part_loss is also used as guidance / feedback in rank loss.

On how to use NTS to understand fine-grained image classification is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.