What are the SOTA methods on the four semantic segmentation datasets on Cityscapes 04/17 Update SLTechnology News&Howtos

What are the SOTA methods on the four semantic segmentation datasets on Cityscapes

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What are the SOTA methods on the four semantic segmentation datasets Cityscapes? aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

1 introduction to Cityscapes dataset

Cityscapes evaluation dataset, namely urban landscape dataset, which was promoted and released by Mercedes-Benz in 2015, is currently recognized as one of the most authoritative and professional image segmentation datasets in the field of machine vision. Cityscapes has 5000 finely labeled images of driving scenes in an urban environment (2975 driving 500 val,1525test). It has 19 categories of dense pixel annotations (97%coverage), 8 of which have instance-level segmentation. The specific category names are shown in Table 1 below.

Table 1 Category names in the Cityscapes dataset

2 Deep High-Resolution Representation Learning for Visual Recognition (HRNet) 2.1 motivation

The current semantic segmentation methods face three challenges, the first of which is that the resolution based on FCN will lose information from high to low.

Semantic segmentation methods require high-resolution features. Figure 1 shows several classical methods based on FCN. What they have in common is a low-resolution feature map through a network, and then restored to high-resolution by up-sampling or deconvolution.

Fig. 1 several classical structures based on FCN method

These look different, but the core ideas are similar in nature. These methods have a disadvantage, the resolution from high to low will lose information!

2.2 Model structure and core code

In order to solve the problem in 2.1, the author's team (MSRA and Chinese Academy of Sciences) put forward a method, the core idea is "not to restore high resolution, but to maintain resolution". As shown in figure 2 below is a basic high-resolution holding network structure, which parallels feature map with different resolutions, with one branch with the same resolution and different branches with different resolutions. A path (the slash in the figure) is added between different branches to form a high-resolution network.

Figure 2 basic high-resolution network structure

The mechanism in figure 2 consists of four stage, each with a blue background of one stage. In the SOTA method, HRNet-W48 is used, and its structure diagram is shown in figure 3.

Fig. 3 HRNet-W48 structure diagram

HRNet V2-W48 is composed of four stage (blue, green, red and yellow background areas in figure 3) with stem net in the head (white background area in figure 3) and segment head at the tail (not shown in the picture). The following is an introduction to stem net, four stage, and segment head in order.

(1) stem net

Stem net consists of two Bottelneck, which is the same as the structure of Resnet. After two Bottelneck, the dimension of the input image is changed from H*W*3 to (Hmax 4) * (WR4) * 256.

(2) 4 stage

The individual components on each stage are configured as shown in Table 2 below, taking hrnet_48 as an example

The stage is connected by transition_layer, and the stage is composed of repeated basic units HighResolutionModule.

The HighResolutionModule consists of a branch and a fuse_layers at the end of the branch.

Each branch consists of duplicate basicblock, as shown in Table 2.

Table 2 HRNet-W48 model configuration table

Transition layer between A:stage: complete channel conversion and dimension downsampling between stage, that is, the slash between the straight line and slash stage connected between different background colors in figure 3, pointing without any processing.

Figure 4 transition layer build code between stage

B: build stage

Each stag is made up of several repeated HighResolutionModule, so the core of the architectural stage is to build the HighResolutionModule. Building the HighResolutionModule consists of two steps: building the branch and building the fuse_layers at the end of the branch.

Build branches: the four consecutive basicblock of figure 3 is one branch.

Figure 5. Branch build code within HighResolutionModule

Build the fuselayer:

The blue box in the following figure illustrates the processing of the fuselayer layer:

Figure 6 fuselayer layer

Figure 6. Fuselayer layer construction code in HighResolutionModule

3 Object-Contextual Representations for Semantic Segmentation (OCR) 3.1 motivation

The second of the three challenges faced by current semantic segmentation methods is that they do not solve the object context information well.

Contextual features: each pixel in the image can not be isolated, a pixel must have a certain relationship with the surrounding pixels, a large number of pixels are related to each other to produce a variety of objects in the image, so the contextual feature refers to a certain relationship between pixels and surrounding pixels. When it comes to image semantic segmentation, when judging which category a pixel in a certain position belongs to, not only the gray value of the pixel is considered, but also the pixels adjacent to it are fully considered.

The current method to analyze the context information is shown in figure 7. For example, the red dot is our focus, and the surrounding green dots are sampled. We can see that the green dots are divided into two parts, one belongs to the car, and the other belongs to the background. There is no distinction between the current methods.

Figure 7 context information diagram

So what are we gonna do? When we find this object, we need to help with the pixel representation of the surrounding objects. Therefore, we need to take out the pixel belonging to object around the red pixel as the context, as shown in figure 8 below:

Fig. 8 object area context information diagram

3.2 Model structure and core code

Core idea: OCR proposes a new relational context method, which enhances the description of pixel features by learning the relationship between pixels and object region features according to the results of rough segmentation. The structure of the model is shown in the following figure.

Fig. 9 OCR model structure diagram

Calculation steps:

STEP1: the result of rough segmentation is obtained.

The last output FM from backbone is connected with a set of conv operations, and then the cross-entropy loss is calculated.

STEP2: get the feature of the object area.

As you can see in figure 9, this step requires matrix multiplication by the tensor of two branches:

Tensor1:pixel representation, the last layer FM of the backbone network, with a dimension of b × c × h × w-> b × c × hw

The dimension of Tensor2:soft object region,FM after softmax is b × k × h × w-> b × k × hw.

The output of multiplying Tensor1 and tensor2 is b × k × c and b × k × c is the representation of the region feature of the object in figure 9.

Fig. 10 object area feature calculation code

STEP3: get the pixel-region relation.

As you can see in figure 9, this step requires matrix multiplication by the tensor of two branches:

Tensor1:pixel representation, the last layer FM of the backbone network, with a dimension of b × c × h × w-> b × c × hw

The feature of the object region in Tensor2:STEP2, with a dimension of b × k × c

In the code, the dimensions of two tensor are transformed, and the dimensions of the two tensor are b × key × hw and b × key × k respectively. After multiplying two tensor, the expression of pixel-region relation is b × k × h × w.

STEP4: calculates the final object feature context representation.

As you can see in figure 9, this step requires matrix multiplication by the tensor of two branches:

Pixel-region relation obtained in Tensor1:STEP3 with a dimension of b × k × h × w

The feature of the object region in Tensor2:STEP2, with a dimension of b × k × c

Multiply the two features to get the object context feature, that is, the red block in figure 10.

Figure 11 related code in step2-step4

4 SegFix: the motivation of Model-Agnostic Boundary Refinement for Segmentation (SegFix) 4.1SegFix

The third challenge faced by the FCN-based method is the inaccurate edge segmentation. Figure 12 below shows the error diagram of the segmentation result. The first column in figure 12 below shows the split GT diagram, and the second / third / fourth column shows the error diagrams of DeepLabv3 / HRNet / Gated-SCNN, respectively. These examples are tailored from Cityscapes val set. We can see that there are many errors on the fine boundary for all three methods.

Fig. 12 error diagram of model segmentation result

Based on the results of empirical observation, that is, the label prediction of internal pixels is more reliable, so using the prediction of internal pixels instead of the initial unreliable prediction of boundary pixels may improve the edge segmentation effect of the model. A novel model-independent post-processing mechanism is proposed, which reduces the segmentation result by replacing the label of the boundary pixel with the label of the corresponding internal pixel, thus reducing the boundary error.

4.2 Model structure and core code

According to the description in 4.1, it is only natural that two questions arise: (1) how to determine the edge and (2) how to associate edge pixels with internal pixels. This is done with the help of an edge prediction branch and a direction prediction branch. After obtaining good boundary and direction prediction, we can directly use it to optimize the partition map predicted by the existing methods. So another problem is how to apply the existing prediction of the correlation direction of the edge to the actual prediction optimization. This is mainly with the help of a coordinate offset branch. These three branches make up the main structure of SegFix, which is shown in figure 13.

Fig. 13 SegFix model structure diagram

Edge prediction branch:

Direction prediction branch:

Get the true value:

Coordinate offset branch:

5 Hierarchical Multi-Scale Attention for Semantic Segmentation5.1 motivation

Large-scale objects will be segmented better on smaller resolution feature images, while small-scale objects need fine details to deduce the segmentation results, so the prediction results will be better on higher-resolution feature images. And the paper also gives an example to analyze the reasons for this situation, as shown in the following figure.

Fig. 12 Segmentation performance of objects of different sizes at different resolutions

Therefore, this paper uses the method of attention mechanism to let the network learn how to best combine multi-scale reasoning and prediction. A very intuitive approach is to enter pictures of different resolutions and let the network learn what kind of objects should use what kind of resolution.

5.2 Model structure

Fig. 13 hierarchical multi-scale attention mechanism

Training stage:

The attention mechanism proposed above is similar to one of the previous methods (Attention to scale: Scale-aware semantic image segmentation, method on the left in figure 13). A dense mask is learned for each scale, and then different-scale predictions are combined. These multi-scale predictions get the final result by multiplying pixels between mask and then summing pixels between different scales.

In the layering method in this paper, we learn the relative attention masks between adjacent scales instead of learning all attention masks for each fixed scale set. When training the network, only the adjacent scale pairs are trained. As shown in fig. 13 above, a set of feature graphs from lower scale is given to predict a dense correlation attention between two image scales. In the experiment, in order to get the scaled image pair, we use an input image and then sample it twice using the scale scale 2, so that there is a 1x input and a 0.5x scaling input, of course, other scale-down scales can also be selected. It should be noted that the network input itself is a rescaled version of the original training image because we use image scaling enhancement during training. This allows the network to learn to predict the relative attention of a range of image scales.

During the training process, the given input image is scaled by factor r, where r = 0.5 means downward sampling is carried out by factor 2, r = 2.0 means upward sampling is carried out by factor 2, and r = 1 means no operation. For the training process, select r = 0.5 and r = 1.0. Therefore, for the training and inference of the two scales, taking U as the bilinear upsampling operation, ∗ and + as pixel-level multiplication and addition respectively, the equation can be formalized as follows:

The calculation steps of attention weight α in the above formula:

1) get the output augmentations of the OCR module, that is, the positive blue block in figure 9.

2) after seeing several consecutive conv-bn-relu, the vector with dimension b × 1 is obtained.

3) after sogmoid the vector of b × 1, we get an attention weight α on batch.

Reasoning stage:

In the reasoning phase, hierarchical applications learn attention to combine N different scale predictions. The combination of each scale gives priority to a lower scale, and then gradually rises to a higher scale, because they have more global context information, and higher scale predictions can be used for scale that needs to be improved.

In multi-scale reasoning, the combination of each scale is in the order of {2.0memt 1.5pint 1.0pence 0.5}.

This is the answer to the question about what are the SOTA methods on the four semantic segmentation datasets Cityscapes. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.