How to use python to improve the performance of data imbalance Model 04/18 Update SLTechnology News&Howtos

How to use python to improve the performance of data imbalance Model

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to use python to improve the performance of data imbalance model". The explanation content in this article is simple and clear, easy to learn and understand. Please follow the idea of Xiaobian slowly and deeply to study and learn "how to use python to improve the performance of data imbalance model" together!

data set

There are three labels in the training data, labeled [1, 2, 3], which means that the problem is a multi-classification problem. The training dataset has 17 features and 38829 independent data points. In the test data, there are 16 unlabeled features and 16641 data points. The training dataset is very unbalanced, with most of the data being Class 1 (95%), while Class 2 and Class 3 have 3.0% and 0.87% of the data, respectively, as shown in the figure below.

algorithm

After preliminary observation, it is decided to adopt Random Forest (RF) algorithm because it is superior to Support Vector Machine, Xgboost and LightGBM algorithms. RF was chosen for this project for several reasons:

Machine forest is robust to overfitting.

Parameterization is still very intuitive;

In this project, there are many successful use cases using random forest algorithms for highly unbalanced datasets;

The individual has previous algorithm implementation experience;

To find the *** parameter, use the scikit-sklearn implementation of GridSearchCV to perform a grid search on the specified parameter values, more details can be found on my Github.

To address the data imbalance problem, three techniques are used:

A. Using integrated cross validation (CV):

In this project, cross-validation was used to verify the robustness of the model. The entire dataset is divided into five subsets. In each cross-validation, four of these subsets are used for training and the remaining subset is used to validate the model, in addition to making predictions on the test data. At the end of cross validation, five test prediction probabilities are obtained. ***, Average the probabilities for all categories. The training performance of the model is stable, with stable recall and f1 scores on each cross-validation. This technology has also helped me achieve a very good score (top 1%) in Kaggle competitions. The following code snippet shows an implementation of integrated cross-validation:

B. Set category weights/importance:

Cost-sensitive learning is one of the ways to make random forests more suitable for learning from very unbalanced data. Random forests have a tendency to favor most categories. Thus, costly penalties for misclassification of minorities may be useful. Because this technique improves model performance, I assign a high weight (i.e., higher misclassification costs) to minority groups. The category weights are then incorporated into the random forest algorithm. I determine category weights based on the ratio of the number of datasets in category 1 to the number of other datasets. For example, the ratio between the number of Class 1 and Class 3 datasets is approximately 110, while the ratio between Class 1 and Class 2 is approximately 26. Now that I've slightly modified the quantities to improve the performance of the model, the following code snippet shows the implementation of the different class weights:

Over-predict a Label than Under-predict:

This technique is optional and has been found to be very effective in improving performance in a small number of categories. In short, if a model is misclassified as class 3, the technique penalizes the model to the maximum extent, and penalizes class 2 and class 1 to a lesser extent. To implement this method, I changed the probability threshold for each category, setting the probabilities for category 3, category 2, and category 1 in increasing order (i.e., P3= 0.25, P2= 0.35, P1= 0.50) so that the model was forced to overpredict the categories. A detailed implementation of the algorithm can be found on Github.

final result

The following results show how these three techniques can help improve model performance:

1. Results using integrated cross validation:

2. Results using integrated cross-validation + category weights:

3. Results using integrated cross-validation + category weights + overprediction labels:

Thank you for reading, the above is "how to use python to improve the performance of the data imbalance model" content, after the study of this article, I believe we have a deeper understanding of how to use python to improve the performance of the data imbalance model, the specific use of the situation also needs to be verified by practice. Here is, Xiaobian will push more articles related to knowledge points for everyone, welcome to pay attention!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.