Baidu PaddlePaddle Open Source Video Classification Model Attenti 07/12 Update SLTechnology News&Howtos

Baidu PaddlePaddle Open Source Video Classification Model Attenti

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Attention Cluster Model

Video classification problem has a wide range of applications in video tagging, surveillance, autonomous driving and other fields, but it is also one of the important challenges facing computer vision.

Most video classification problems are based on CNN or RNN. CNN is known to have played a significant role in the field of images. It has a good feature extraction ability, through convolution layer and pooling layer, can extract features in different regions of the image. RNNs have a strong ability to acquire time-dependent features.

Attention Cluster is designed using only CNN models, not RNNs, mainly based on the following characteristics of video:

Figure 1 Analysis of video frames

First, successive frames of a video often have some similarity. As you can see in Figure 1 (top), the frames are almost identical except for the action of the shot. Therefore, for classification, it may be sufficient to focus on these similar features as a whole, without having to look at their detailed changes over time.

Second, local features in video frames are sometimes sufficient to express the category of the video. For example, in Figure 1 (middle), the action of brushing teeth can be distinguished by some local features, such as toothbrush and sink. Therefore, for classification problems, the key lies in finding key local features in frames rather than in finding temporal cues.

Finally, in the classification of some videos, the temporal order of frames is not necessarily important for classification. For example, in Figure 1 (bottom), you can see that although the frame order is disturbed, you can still see that this belongs to the category of "pole vault".

Based on the above considerations, the model does not consider time-dependent cues, but uses the Attention mechanism. It has the following benefits:

1. The output of Attention is essentially a weighted average, which avoids redundancy caused by some duplicate features.

2. For some local key features, Attention can give them higher weight. This will improve classification capabilities through these key features.

3. The input to Attention is an unordered collection of arbitrary size. Disorder satisfies our observation above, and arbitrarily sized inputs improve the generalization of the model.

Of course, there is another feature of some video local features, which is that it may be composed of multiple parts. For example, in the pole vault in Figure 1 (bottom), jumping, running, and landing contribute to this classification at the same time. Therefore, if only a single Attention unit is used, only a single key information of the video can be obtained. If multiple Attention units are used, more useful information can be extracted. Attention Cluster is born! In the process of implementation, Baidu computer vision team also found that a simple and effective "shifting operation" of different Attention units can increase the diversity of different units, thus improving accuracy.

Let's look at the structure of the entire Attention Cluster.

The whole model can be divided into three parts:

1. Local feature extraction. Feature extraction of video by CNN model. The extracted features are represented by X, as shown in Formula (1):

（1）。X has L dimensions and represents L distinct features.

2. Local feature integration. Global features are obtained based on Attention. The output of Attention is essentially a weighted average. As shown in formula (2), v is the global feature output by an Attention unit, and a is the weight vector, which consists of two fully connected layers, as shown in formula (3). In practice, v is generated using a Shifting operation, as shown in equation (4), where α and β are learnable scalars. It adds an independent learnable linear transformation to the output of each Attention unit and then performs L2-normalization, so that each Attention unit tends to learn different components of features, so that Attention Cluster can learn different distributed data better and improve the learning characterization ability of the whole network. Because Attention clusters are used, the outputs of the Attention units are combined to obtain multiple global features g, as shown in Equation (5). N represents the number of clusters.

3. Global Feature Classification. After stitching multiple global features, the final single-label or multi-label classification is performed by conventional fully connected layers and Softmax or Sigmoid.

Training Attention Cluster with PaddlePaddle

PaddlePaddle's open-source Attention Cluster model uses the 2nd-Youtube-8M dataset. This dataset has been extracted using the InceptionV3 model on ImageNet training set.

If you run the sample code for this model, PaddlePaddle Fluid V1.2.0 or later is required.

Data preparation: First download the training set and test set using the official links provided by Youtube-8M, or download using the official script. Once the download is complete, there will be 3844 training data files and 3844 validation data files (TFRecord format). To apply to PaddlePaddle training, you need to convert the downloaded TFRecord file format to pickle format. Please use the script dataset/youtube8m/tf2pkl.py provided by PaddlePaddle to convert the script.

Training Set: us.data.yt8m.org/2/frame/train/index.html

Test Set: us.data.yt8m.org/2/frame/validate/index.html

Official script: research.google.com/youtube8m/download.html

Model training: Once the data is ready, start training by doing the following (Method 1), and we also provide a quick start script (Method 2)

#Method 1

#Method 2

Users can also download published models on Paddle Github and specify weight storage paths for finetune development through--resume.

Data preprocessing description: The model reads the extracted rgb and audio data from Youtube-8M dataset. For each video data, 100 frames are evenly sampled, and this value is specified by the seg_num parameter in the configuration file.

Model settings: The main configurable parameters of the model are cluster_nums and seg_num parameters. where cluster_numbers is the number of attention units. When cluster_nums is 32 and seg_num is 100, batch_size=256 can be run on a single card on the Nvidia Tesla P40.

Training strategies:

Using Adam Optimizer, initial learning_rate=0.001

Weight decay is not used during training

Parameters are primarily initialized using MSRA

Model evaluation: Model evaluation can be performed in the following ways (Method 1), and we also provide a quick-start script (Method 2):

#Method 1

#Method 2

When evaluating using scripts/test/test_attention_cluster.sh, you need to modify the--weights parameter in the script to specify the weights you want to evaluate.

If the--weights parameter is not specified, the script downloads published models for evaluation

Model Inference: Model inference can be performed with the following commands:

The model inference results are stored in AttentionCluster_infer_result in pickle format.

If the--weights parameter is not specified, the script downloads the published model for inference

Model accuracy: When the model takes the following parameters, the metrics on Youtube-8M dataset are:

Parameter value:

Accuracy of assessment:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.