How to use Mask-RCNN to overcome over-fitting in the application of case segmentation 07/02 Update SLTechnology News&Howtos

How to use Mask-RCNN to overcome over-fitting in the application of case segmentation

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to use Mask-RCNN in the case segmentation application to overcome fitting, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, hope you can get something.

Guide reading

Use only 1349 images to train Mask-RCNN, with code.

Introduction

Advances in computer vision have brought many promising applications, such as self-driving cars or medical diagnostics. In these tasks, we rely on the ability of machines to identify objects.

We often see four tasks related to target recognition: classification and location, target detection, semantic segmentation and case segmentation.

In the classification and location, we are interested in assigning class tags to the objects in the image and drawing a bounding box around the targets. In this task, the number of targets to be detected is fixed.

Object detection is different from classification and location, because here we do not presuppose the number of objects in the image. We start with a fixed set of target categories, and our goal is to assign class tags and draw a bounding box each time one of these categories appears in the image.

In semantic segmentation, we assign a class label to each image pixel: all pixels belonging to grass are marked as "grass" and pixels belonging to sheep are marked as "sheep". It is worth noting that, for example, this task will not make a difference between the two sheep.

Our task is case segmentation, which is based on target detection and semantic segmentation. In target detection, our goal is to mark and locate all instances of the target in predefined categories. However, instead of generating a bounding box for the detected target, we further identify which pixels belong to the target, just like semantic segmentation. Unlike semantic segmentation, instance segmentation draws a separate mask for each target instance, while semantic segmentation uses the same mask for all instances of the same class.

In this paper, we will train an instance segmentation model on a small Pascal VOC data set, of which only 1349 images are used for training and 100 images are used for testing. The main challenge here is to prevent the model from overfitting without using external data.

Data processing.

The annotations are in COCO format, so we can use functions in pycocotools to retrieve class tags and masks. There are 20 categories in this dataset.

Here are some visual displays of training images and related mask. Different shadows for mask represent different masks for multiple instances of the same target category.

Images vary in size and aspect ratio, so we resize each image 500x500 before entering it into the model. When the image size is less than 500, we optimize the image so that the length of the maximum edge is 500, and add the necessary zeros to get the square image.

In order to make the model well generalized, especially in such a limited data set, data enhancement is the key to overcome over-fitting. For each image, the probability level is 0.5, the scale is 0.9 to 1 times, the Gaussian blur is 0.5, the standard deviation is random, the contrast random adjustment scale is between 0.75 and 1.5, the brightness random adjustment scale is between 0.8 and 1.2, and a series of random affine transformations such as scaling, translation, rotation, clipping.

Mask-RCNN

We use Mask-RCNN implemented by matterport for training. Although the results may be good, we will not use MS COCO's pre-training weights to show how we can get good results with only 1349 training images.

Mask-RCNN was proposed in the 2017 Mask-RCNN paper and is an extension of Faster-RCNN by the same author. Faster-RCNN is widely used in target detection, and the model generates bounding boxes around the detected objects. Mask-RCNN further generates the target's mask.

I will briefly introduce the model architecture below.

First of all, we use a trunk model to extract relevant features from the input image. Here, we use the ResNet101 architecture as the backbone. The image is transformed from a tensor (500jue 500jue 3) to a feature graph (32p 32jol 2048).

The previously acquired features are then input into a regional recommendation network (RPN). RPN scans the area of the feature map, called anchors, and attempts to determine the area that contains the target. These anchor vary in size and aspect ratio. RPN assigns a category to each anchor: foreground (positive sample anchor) or background (negative sample anchor). Neutral anchor refers to the anchor that does not affect training.

Positive sample anchors (left), neutral anchors (middle), negative sample anchors (right)

The suggestion layer then picks the anchor that is most likely to contain the target, and optimizes the anchor box to get closer to the target. When too many anchor points overlap, only the one with the highest foreground score (non-maximum suppression) is retained. This gives us the area of interest (ROI).

For each target area selected by the ROI classifier, the model generates the mask of the 28x28. In the process of training, the ground truth mask is reduced, the predicted mask is used to calculate the loss, and in the process of reasoning, the generated mask is enlarged to the size of the boundary box of ROI.

Transfer learning

Especially in the case of limited data, the key to faster and better training model is transfer learning. The Imagenet dataset is a huge corpus of natural images, similar to our images. Therefore, we can initialize the weights of the Resnet101 backbone model to the weights trained in advance on the Imagenet. This will improve the accuracy of the feature map we get, thus improving the whole model.

In order to fine-tune the model pre-trained on Imagenet, we first train only model heads. Then we train layers from ResNet level 4 and above in the remaining epochs. This training program also helps to minimize overfitting. We can not fine-tune the first layer, because we can reuse the model to extract the weight of features from the natural image.

Result-Detection of pipeline visualization

The mAP obtained on our test set is 0. 0. 53650 . Here are some visualization results from the model output of randomly selected test images:

We can also see the output of different steps of the algorithm. Next, we have the score of top anchors before the bounding box is refined.

Next, we have a refined bounding box and non-maximally suppressed output. These suggestions are then entered into the classification network. Note that here, we have some boxes that contain some goals, such as logos, which do not fall within the target category we define.

Run the classification network on the proposed area, get the detection of positive samples, generate class probability and boundary box regression.

After the bounding box is obtained and refined, the instance segmentation model generates mask for each detected target. Mask is soft masks (with floating-point pixel values) and is 28x28 in size during training.

Finally, the predicted mask is resized to the bounding box, and we can overlay them on the original image to visualize the final output.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.