How to understand the metric choice of machine learning model 07/15 Update SLTechnology News&Howtos

How to understand the metric choice of machine learning model

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to understand the measurement choice of machine learning model". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "how to understand the metric choice of machine learning model".

Define

Before discussing the pros and cons of each approach, let's take a look at the basic terminology used in the classification problem. If you are already familiar with the term, you can skip this section.

Recall rate or TPR (true case rate): number of items correctly identified as positive cases in all positive cases = TP/ (TP+FN)

Specificity or TNR (true countercase rate): the number of items correctly identified as counterexamples in all countercases = TN/ (TN+FP)

Accuracy: in projects identified as positive examples, the number of items correctly determined as positive examples = TP/ (TP+FP)

False positive rate or type I error: number of items mistakenly identified as positive in all counterexamples = FP/ (FP+TN)

False counterexample rate or type II error: number of items mistakenly identified as counterexamples in all positive cases = FN/ (FN+TP)

Confusion matrix

F1 measure: the harmonic average of accuracy and recall. F1 = 2*Precision*Recall/ (Precision + Recall)

Accuracy: percentage of the total number of items correctly classified (TP+TN) / (Nipp)

ROC-AUC score

The probability explanation of ROC-AUC score is that if a positive case and a negative case are randomly selected, according to the classifier, the probability that the positive case is higher than the negative case is given by AUC.

Mathematically, it is calculated from the area under the sensitivity curve (TPR).

FPR (1-specificity) Ideally, we want to have high sensitivity and specificity, but in practice, there is always a tradeoff between sensitivity and specificity.

Some of the important features of ROC-AUC are

The value can range from 0 to 1. However, the auc score of the random classifier with balanced data is 0.5

ROC-AUC score has nothing to do with the classification threshold set. F1 score is different, in the case of probability output, F1 score needs to be determined by a threshold.

Log loss

Logarithmic loss is a precision measurement that combines the concept of probability confidence given by the following binary class expressions:

It takes into account the uncertainty of your forecast, based on the difference between it and the actual label. In a worst-case scenario, suppose you predict a probability of 0.5. Therefore, the logarithmic loss will become-log (0.5) = 0.69.

Therefore, we can say that, considering the actual probability, anything higher than 0.6 is a very bad model.

Comparison of case 1Log loss with ROC and F1 metrics

Taking case 1 as an example, model 1 does better in predicting absolute probability, while the probability value predicted by model 2 increases sequentially. Let's verify it with the actual score:

If the log loss is taken into account, the log loss given by model 2 is the highest, because the absolute probability is very different from the actual label. However, this is completely inconsistent with the F1 and AUC scores, according to which model 2 has a 100% accuracy.

In addition, you can note that F1 scores vary with different thresholds, and when the default threshold is 0.5, F1 prefers model 1 to model 2.

Corollary from the above example:

If you care about the absolute probability difference, use logarithmic loss.

If you are only concerned with the prediction of a particular class and do not want to adjust the threshold, use AUC score

F1 scores are sensitive to thresholds, and you need to adjust them before comparing the model.

Case 2 how do they deal with category imbalances?

The only difference between the two models is their predictions of observations 13 and 14. Model 1 is better at classifying observations 13 (label 0), while Model 2 is better at classifying observations 14 (label 1).

Our goal is to see which model can better capture the differences in unbalanced class classification (label 1 has a small amount of data). In problems such as fraud detection / spam detection, there are always few labels for positive examples, and we hope that our model can correctly predict positive examples, so we sometimes prefer models that can correctly classify these positive examples.

Obviously, in this case, the log loss fails, because according to the log loss, the performance of the two models is the same. This is because the log loss function is symmetric and unclassified.

F1 measure and ROC-AUC score are superior to model 1 in selecting model 2. So we can use these two methods to deal with class imbalances. But we have to dig further to see how they treat category imbalances differently.

In the first example, we see very few positive tags. In the second example, there are almost no negative tags. Let's take a look at how F1 metrics and ROC-AUC distinguish between these two situations.

ROC-AUC scores deal with a few negative tags in the same way as a few positive tags. An interesting thing to note here is that the score of F1 is almost the same in model 3 and model 4, because the number of positive tags is so large that it is only concerned with the misclassification of positive tags.

The inference drawn from the above example:

If you are concerned with a small number of classes and do not need to care whether it is positive or negative, choose the ROC-AUC score.

When will you choose the F1 metric instead of ROC-AUC?

When you have a small number of positive categories, then F1 scores are more meaningful. This is a common problem in fraud detection because there are few positive tags. We can understand this statement through the following example.

For example, in a dataset with a size of 10K, model (1) predicts 5 out of 100 real cases, while another model (2) predicts 90 out of 100 real cases. Obviously, in this case, model (2) does better than model (1). Let's see if both F1 scores and ROC-AUC scores can capture this difference.

F1 score of model (1) = 2 * (1) * / 1.1 = 0.095

F1 score of model (2) = 2 * (1) * (0.9) / 1.9 = 0.947

Yes, the difference in F1 scores reflects the performance of the model.

ROC-AUC=0.5 of model (1)

ROC-AUC=0.93 of model (2)

ROC-AUC also gave Model 1 a good score, which is not a good performance indicator. Therefore, for unbalanced datasets, be careful when choosing roc-auc.

Which metric should you use for multiple classifications?

We also have three types of non-binary categories:

Multi-class: a classification task with more than two classes. Example: divide a set of fruit images into any of the following categories: apples, bananas, and oranges.

Multi-tag: classifies the sample into a set of target tags. Example: Mark the blog as one or more topics, such as technology, religion, politics, and so on. Tags are independent, and the relationship between them is not important.

Hierarchy: each category can be combined with similar categories to create metaclasses, which can be combined again until we reach the root level (a collection of all data). Examples include text classification and species classification.

In this blog, we only discuss the first category.

As you can see in the table above, we have two types of metrics-micro-average and macro-average-and we will discuss the pros and cons of each metric. The most commonly used metrics for multiple classes are F1 metric, average precision, and log loss. At present, there is no mature ROC-AUC multi-category score.

Multiple types of log losses are defined as:

In the micro-averaging method, the real cases, false positive examples and false counterexamples of different sets in the system are summarized, and then they are applied to get statistical data.

In the macro averaging method, take the average of the accuracy and recall rate of the system on different sets.

If there is a category imbalance, the micro-average method is used.

At this point, I believe you have a deeper understanding of "how to understand the measurement choice of machine learning model". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.