In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Author & Editor | Guo Bingyang
1 introduction
When friends use public data sets to build image classification models, have they noticed the problem that the number of samples in different categories of each data set is almost the same? This is because there is little difference in the number of samples in different categories, which has little impact on the performance of the classifier, which can fully reflect the performance of the classification model while avoiding the influence of other factors. On the contrary, if the number of samples between categories is too large, it will have a certain impact on the learning process, resulting in poor performance of the classification model. This is the category imbalance problem (Class Imbalance) that will be discussed in this article.
Category imbalance refers to the large difference in the number of training samples of different categories in the classification task, which is usually caused by the difficulty of sample collection or the lack of sample examples, which often appears in disease category diagnosis, fraud type discrimination and other tasks.
Although in the field of traditional machine learning, the problem of category imbalance has been studied in detail, but in the field of deep learning, with the development of deep learning, the related exploration has experienced a process of first suppression and then rise.
In the early days of the birth of the back propagation algorithm, the research on deep learning is not yet mature, but there are still relevant researchers who have studied the influence of the number of samples on gradient propagation. It is concluded that the category with a large number of samples plays a leading role in the weight of back propagation. This phenomenon will quickly reduce the error rate of more categories at the initial stage of network training, but with the increase of the number of iterations of training, the error rate of fewer classes will increase [1].
In the following ten years, due to the limitation of computing resources and the difficulty of data collection, the related research has not been further explored until recent years, and the category imbalance in the field of deep learning has also been studied more deeply.
This article will summarize the relevant solutions involved at present, which are divided into three aspects: data level, algorithm level, data and algorithm mixed level, and only enumerate the representative solutions for readers' reference.
2 method summary
1. Method based on data level
The method based on the data level mainly deals with the data sets participating in the training in order to reduce the impact of category imbalance.
Hensman [2] proposed the method of lifting samples (over sampling), that is, for the category with a small number of categories, some pictures are randomly selected to copy and add to the images contained in the category, until the number of pictures in this category is equal to the maximum number of classes. Through experiments, it is found that this method has greatly improved the final classification results.
Lee et al. [3] proposed a two-stage (two-phase) training method. First of all, a threshold N is set according to the distribution of the dataset, which is usually the minimum number of samples contained in the category. Then the categories whose number of samples are larger than the threshold are randomly selected until the threshold is reached. At this time, the data set extracted according to the threshold is trained as the training sample of the first stage, and the model parameters are saved. Finally, the first stage model is used as the pre-training data, and then trained on the whole data set, which improves the final classification result to a certain extent.
Pouyanfar [4] proposed a method of dynamic sampling (dynamic sampling). This method draws lessons from the idea of lifting samples, dynamically adjusts the data set according to the training results, randomly deletes the samples for the categories with better results, and randomly copies the categories with poor results. In order to ensure that the classification model can learn relevant information every time.
2. The method based on algorithm level
The method based on the algorithm level mainly improves the existing deep learning algorithm and eliminates the impact of category imbalance by modifying the loss function or learning method.
Wang et al [5] proposed mean squared false error (MSFE) loss. This new loss function is improved on the basis of mean false error (MFE) loss. The specific formula is shown in the following figure:
MSFE loss can well balance the relationship between positive and negative examples, thus achieving better optimization results.
Buda et al. [6] proposed the method of output threshold (output thresholding) to improve the class imbalance by adjusting the output threshold of network results. According to the composition of the data set and the probability value of the output, the model designer manually designs a reasonable threshold to reduce the output requirements of the categories with a small number of samples and make the prediction results more reasonable.
3. Hybrid method based on data and algorithm.
The methods at the above two levels can achieve better improvement results. if the two ideas are combined, can they be further improved?
Huang [7] put forward the method of Large Margin Local Embedding (LMLE), which uses quintuple sampling method (quintuplet sampling) and tripleheader hinge loss function to extract sample features better, and then put the features into the improved K-NN classification model to achieve better clustering results. In addition, Dong et al. [8] combines the idea of difficult case mining and category correction loss function, and also improves the data and loss function.
Due to the limited space and time, this article lists only the typical solutions for each category. At the same time, the relevant literature on solving the problem of category imbalance is also collected. the screenshot is as follows:
The specific name can be used for reference [9].
3 references
[1] Anand R, Mehrotra KG, Mohan CK, Ranka S. An improved algorithm for neural network classification of imbalanced training sets. IEEE Trans Neural Netw. 1993 Ting 4 (6): 962-9.
[2] Hensman P, Masko D. The impact of imbalanced training data for convolutional neural networks. 2015.
[3] Lee H, Park M, Kim J. Plankton classification on imbalanced large scale database via convolutional neural networks with transfer learning. In: 2016 IEEE international conference on image processing (ICIP). 2016. P. 3713-7.
[4] Pouyanfar S, Tao Y, Mohan A, Tian H, Kaseb AS, Gauen K, Dailey R, Aghajanzadeh S, Lu Y, Chen S, Shyu M. Dynamic sampling in convolutional neural networks for imbalanced data classification. In: 2018 IEEE conference on multimedia information processing and retrieval (MIPR). 2018. P. 112-7.
[5] Wang S, Liu W, Wu J, Cao L, Meng Q, Kennedy PJ. Training deep neural networks on imbalanced data sets. In: 2016 international joint conference on neural networks (IJCNN). 2016. P. 4368-74.
[6] Buda M, Maki A, Mazurowski MA. A systematic study of the class imbalance problem in convolutional neural
Networks. Neural Netw. 2018 Battle106 Pluto 249-59.
[7] Huang C, Li Y, Loy CC, Tang X. Learning deep representation for imbalanced classification. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). 2016. P. 5375-84.
[8] Dong Q, Gong S, Zhu X. Imbalanced deep learning by minority class incremental rectification. In: IEEE transactions on pattern analysis and machine intelligence. 2018. P. 1-1
[9] Justin M. Johnson and Taghi M. Khoshgoftaar.Survey on deep learning with class imbalance.Johnson and Khoshgoftaar J Big Data. (2019) 6:27
Summary
The above are the relevant solutions to the problem of category imbalance. For details, you can read reference Review 9. I believe you will gain more experience by reading more detailed articles!
Https://www.toutiao.com/a6727841366342107655/
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.