How to improve the classifier by python 07/06 Update SLTechnology News&Howtos

How to improve the classifier by python

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to improve the classifier by python". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "how to improve the classifier with python"!

When it comes to machine learning classification tasks, the more data used for training algorithms, the better. In supervised learning, these data must be marked according to the target class, otherwise, these algorithms will not be able to learn the relationship between independent variables and target variables. However, when building a large markup dataset for classification, two problems arise:

Tagging data can be time-consuming. Suppose we have 1000000 dog images, and we want to input them into the classification algorithm in order to predict whether each image contains Boston dogs. If we want to use all these images to supervise classification tasks, we need someone to look at each image and determine if there are Boston dogs.

Tagging data can be expensive. Reason one: we may have to pay for the effort to search for 1 million photos of dogs.

So, can these untagged data be used in classification algorithms?

This is the opportunity to develop your talents in semi-supervised learning. In the semi-supervised method, we can train the classifier on a small amount of labeled data, and then use the classifier to predict the unlabeled data.

Because these predictions may be better than random guesses, untagged data predictions can be used as "pseudo tags" in subsequent classifier iterations. Although there are many styles of semi-supervised learning, this special technique is called self-training.

Self-training

At the conceptual level, the principle of self-training is as follows:

Step 1: split the marked data instance into a training set and a test set. Then, a classification algorithm is trained for the marked training data.

Step 2: use a trained classifier to predict the class tags of all untagged data instances. Among these predicted class tags, the one with the highest accuracy is considered to be a "pseudo tag".

(several changes in step 2: a) all predicted tags can be used as "pseudo tags" at the same time, regardless of probability; or b) "pseudo tag" data can be weighted by the confidence of the prediction. )

Step 3: connect the "pseudo-tag" data with the correctly marked training data. The classifier is retrained on the combined "pseudo-tag" and correct tag training data.

Step 4: use a trained classifier to predict the class label of the tagged test data instance. Use the metric you choose to evaluate the performance of the classifier.

(you can repeat steps 1 through 4 until the prediction class tag in step 2 no longer meets a specific probability threshold, or until there is no more untagged data retention. )

Okay, are we clear? Fine! Let's explain it with an example.

Example: using self-training to improve classifier

To demonstrate self-training, I used Python and surgical_deepnet datasets

This data set is used for binary classification and contains 14.6 k + surgical data. These attributes are various measurements such as bmi and age, while the target variable complexing records whether the patient has complications due to the operation. Obviously, it is most beneficial for both health care and insurance providers to accurately predict whether patients will have complications as a result of surgery.

Import library

For this tutorial, I will import numpy, pandas, and matplotlib. I will also use the LogisticRegression classifier in sklearn and the f1_score and plot_confusion_matrix functions for model evaluation

Import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import f1_scorefrom sklearn.metrics import plot_confusion_matrix

Load data

# load data df = pd.read_csv ('surgical_deepnet.csv') df.info () RangeIndex: 14635 entries 0 to 14634Data columns (total 25 columns): bmi 14635 non-null float64Age 14635 non-null float64asa_status 14635 non-null int64baseline_cancer 14635 non-null int64baseline_charlson 14635 non-null int64baseline_cvd 14635 non-null int64baseline_dementia 14635 non-null int64baseline_diabetes 14635 non-null int64baseline_digestive 14635 non-null int64baseline_osteoart 14635 non-null int64baseline_ Psych 14635 non-null int64baseline_pulmonary 14635 non-null int64ahrq_ccs 14635 non-null int64ccsComplicationRate 14635 non-null float64ccsMort30Rate 14635 non-null float64complication_rsi 14635 non-null float64dow 14635 non-null int64gender 14635 non-null int64hour 14635 non-null float64month 14635 non-null int64moonphase 14635 non -null int64mort30 14635 non-null int64mortality_rsi 14635 non-null float64race 14635 non-null int64complication 14635 non-null int64dtypes: float64 (7) Int64 (18) memory usage: 2.8 MB

The attributes in the dataset are numeric and there are no missing values. Since my focus here is not on data cleaning, I will continue to divide the data.

Data partition

In order to test the effectiveness of self-training, I need to divide the data into three parts: the training set, the test set, and the unmarked set. I will split the data in the following proportion:

1% training

25% test

74% unmarked

For the untagged set, I'll simply discard the target variable complexing and pretend it never existed.

Therefore, in this case, we believe that 74% of the surgical cases have no information about complications. I did this to simulate the fact that in actual classification problems, most of the data available may not have class tags. However, if we have class tags for a small portion of the data (1% in this case), we can use semi-supervised learning techniques to draw conclusions from data that has never been tagged.

Next, I randomize the data, generate an index to partition the data, and then create tests, training, and untagged partitions. Then I checked the size of each episode to make sure everything went according to plan.

X_train dimensions: (146,24) y_train dimensions: (146,24) X_test dimensions: (3659, 24) y_test dimensions: (3659,) X_unlabeled dimensions: (10830, 24)

Class distribution

The number of samples in most categories ((complications)) is more than twice that in a few categories (complications). In the case of such an unbalanced class, I think accuracy may not be the best evaluation index.

F1 score is selected as the classification index to judge the effectiveness of the classifier. The effect of F1 score on category imbalance is more robust than accuracy, which is more appropriate when the category is approximately balanced. F1 scores are calculated as follows:

Precision is the proportion of correct prediction in the positive example of prediction, and recall is the proportion of correct prediction in the real positive example.

Initial classifier (supervisor)

In order to make the results of semi-supervised learning more real, I first use the labeled training data to train a simple Logistic regression classifier and predict the test data set.

Train f1 Score: 0.5846153846153846Test f1 Score: 0.5002908667830134

The F1 score of the classifier is 0.5. The confusion matrix tells us that the classifier can well predict operations without complications, with an accuracy of 86%. However, it is more difficult for classifiers to correctly identify operations with complications, with an accuracy of only 47%.

Prediction probability

For the self-training algorithm, we need to know the probability of Logistic regression classifier prediction. Fortunately, sklearn provides the .forecast _ proba () method, which allows us to see the probabilities of predictions that belong to either category. As shown below, in the binary classification problem, the total probability of each prediction is 1.0.

Array ([[0.93931367, 0.06068633], [0.2327203, 0.7672797], [0.93931367, 0.06068633],..., [0.61940353, 0.38059647], [0.41240068, 0.58759932], [0.24306008,0.75693992])

Self-training classifier (semi-supervised)

Now that we know how to use sklearn to obtain the prediction probability, we can continue to encode the self-training classifier. The following is a brief overview:

Step 1: first, train the Logistic regression classifier on the labeled training data.

Step 2: next, use the classifier to predict the labels of all untagged data and the probability of these predictions. In this case, I only use "pseudo-tags" for predictions with a probability greater than 99%.

Step 3: connect the "pseudo-tag" data with the marked training data, and retrain the classifier on the connected data.

Step 4: use the trained classifier to predict the labeled test data and evaluate the classifier.

Repeat steps 1 through 4 until there are no more predictions with a probability greater than 99%, or no untagged data retention.

The following code uses a while loop to implement these steps in Python.

Iteration 0Train f1: 0.5846153846153846Test f1: 0.5002908667830134Now predicting labels for unlabeled data...42 high-probability predictions added to training data.10788 unlabeled instances remaining.Iteration 1Train f1: 0.7627118644067796Test f1: 0.5037463976945246Now predicting labels for unlabeled data...30 high-probability predictions added to training data.10758 unlabeled instances remaining.Iteration 2Train f1: 0.8181818181818182Test f1: 0.505431675242996Now predicting labels for unlabeled data...20 high-probability predictions added to training data.10738 unlabeled instances remaining.Iteration 3Train f1: 0.847457627118644Test f1: 0.5076835515082526Now predicting labels for Unlabeled data...21 high-probability predictions added to training data.10717 unlabeled instances remaining....Iteration 44Train f1: 0.9481216457960644Test f1: 0.5259179265658748Now predicting labels for unlabeled data...0 high-probability predictions added to training data.10079 unlabeled instances remaining.

After 44 iterations, the self-training algorithm cannot predict more untagged instances with a 99% probability. Even if there are 10830 untagged instances at the beginning, 10079 instances are still untagged (and not used by the classifier) after training.

After 44 iterations, the score of F1 increased from 0.5 to 0.525! Although this is only a small increase, it seems that the training has improved the performance of the classifier on the test data set. The top panel of the figure above shows that most of this improvement occurs in the early iterations of the algorithm. Similarly, the bottom panel shows that most of the "pseudo tags" added to the training data appear in the first 20-30 iterations.

The final confusion matrix showed that the classification of operations with complications improved, but the classification of operations without complications decreased slightly. With the increase in F1 scores, I think this is an acceptable improvement-it may be more important to identify surgical cases that lead to complications (real cases), and it may be worth increasing the false positive rate to achieve this result.

Warning words

So you might think: is it risky to use so much untagged data for self-training? The answer is, of course, yes. Keep in mind that although we include "pseudo-tag" data with tagged training data, some "pseudo-tag" data is bound to be incorrect. When enough "pseudo tags" are incorrect, the self-training algorithm reinforces bad classification decisions, and the performance of the classifier actually gets worse.

This risk can be mitigated by using test sets that the classifier does not see during training, or by using probability thresholds predicted by "pseudo-tags".

At this point, I believe you have a deeper understanding of "python how to improve the classifier". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.