How to understand the Metrics of algorithms in python 04/09 Update SLTechnology News&Howtos

How to understand the Metrics of algorithms in python

2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you a measure of how to understand the algorithm in python. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

1 performance measurement of machine learning algorithm

Here we want to evaluate the effectiveness of this algorithm. There are many kinds of metrics for evaluation, and different scenarios use different metrics. The details are as follows.

1.1 algorithm Evaluation Metrics

First of all, these are supervised learning, that is to say, marked data. How to measure and evaluate machine learning algorithms should be divided into classification and regression problems to be discussed respectively.

For classification, we use Pima Indians onset of diabetes dataset. The algorithm uses logical regression. Note that logical regression is not a regression algorithm but a classification algorithm. A sigmond function is used to do the processing, and the result is between 0 and 1.

For regression, the data set of Boston house prices is used. The algorithm takes linear regression as an example.

1.2 Measurement of classification

For the classification problem, there are many metrics that can be used to evaluate the quality of this algorithm, but the focus is slightly different. The following are discussed separately.

Classification Accuracy. Classification accuracy

Logarithmic Loss. Logarithmic loss function

Area Under ROC Curve. ROC, AUC

Confusion Matrix. Confusion matrix

Classication Report. Classified report

1.2.1 Classification Accuracy classification accuracy

The accuracy of classification is that the number of correct predictions accounts for the proportion of Su oh's predictions. The most conventional parameters are also the most useless? Because it only applies to situations with the same number of categories, which is not common. The evaluation is too single. The detailed examples are briefly described.

For example, the following:

Accuracy is the most common and basic evaluation metric. However, in the case of binary classification and the imbalance between positive and negative examples, especially when we are more interested in minority class, the evaluation of accuracy basically has no reference value.

What fraud detection (fraud detection), cancer testing, are consistent with this situation. Take Chestnut: in the test set, there are 100 sample,99 counterexamples and only 1 positive example. If my model indiscriminately predicts a counterexample for any sample, then the number of accuracy of my model is correct / the total number of accuracy is 99%. You use this model with an accuracy of 99% to predict the new sample, and it can't tell a positive example. No, no, no. Some people call it accuracy paradox.

It is not used much, just for understanding.

1.2.2 logarithmic loss function

The logarithmic loss function is also a way to evaluate the accuracy of prediction, and the value of the variable is between 0 and 1. Look at some examples, which are often used to judge the performance of logical regression. Examples are as follows.

# Cross Validation Classification LogLossfrom pandas import read_csvfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LogisticRegressionfilename = 'pima-indians-diabetes.data.csv'names = [' preg', 'plas',' pres', 'skin',' test', 'mass',' pedi', 'age',' class'] dataframe = read_csv (filename, names=names) array = dataframe.valuesX = array [:, 0:8] Y = array [: 8] kfold = KFold (n_splits=10, random_state=7) model = LogisticRegression () scoring= 'neg_log_loss'results = cross_val_score (model, X, Y, cv=kfold, scoring=scoring) print ("Logloss:% .3f (% .3f)")% (results.mean (), results.std ()) # Logloss:-0.493 (0.047)

I didn't understand why it was negative, so I looked it up briefly. Some people say that there is a problem with negative ones. And if the value is close to zero, then it is a better choice. Https://stackoverflow.com/questions/21443865/scikit-learn-cross-validation-negative-values-with-mean-squared-error Yes, this is supposed to happen. The actual MSE is simply the positive version of the number you're getting.

Https://stackoverflow.com/questions/21050110/sklearn-gridsearchcv-with-pipeline

Those scores are negative MSE scores, i.e. Negate them and you get the MSE. The thing is that GridSearchCV, by convention, always tries to maximize its score so loss functions like MSE have to be negated.

MSE (mean square deviation, variance): let's understand the following Mean squared error. Maybe, as someone said, there are actually some problems.

1.2.3 AUC-area under ROC curve

ROC (receiver operating characteristic curve) is a curve, such as the following.

In ROC space, the more convex the ROC curve is, the better the effect is.

What does AUC mean? it's the area below. If you are so smart, you must think that the larger the area surrounded by the ROC curve, the better the performance of the classifier. The area under this curve is called AUC (Area Under the Curve). Because the area of the entire square is 1, so 0

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.