What are the methods of cross-validation in Python 04/17 Update SLTechnology News&Howtos

What are the methods of cross-validation in Python

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of "what are the methods of cross-validation in Python". The editor shows you the operation process through an actual case. The operation method is simple, fast and practical. I hope this article "what are the methods of cross-validation in Python" can help you solve the problem.

1. What is cross-validation?

Cross-validation is a statistical method used to estimate the performance of machine learning models. it is a method to evaluate how the results of statistical analysis can be extended to independent data sets.

Second, how did it solve the fitting problem?

In cross-validation, we generate multiple small training test partitions of the training data and use these splits to adjust your model. For example, in the standard k-fold cross-validation, we divide the data into k subsets. Then, we iterate the training algorithm on a subset of KMel, and use the remaining subset as the test set. In this way, we can test our model on data that is not involved in the training.

In this article, I will share the seven most commonly used cross-validation techniques and their advantages and disadvantages, and I provide code snippets for each of them.

These techniques are listed below:

HoldOut cross-validation

K-Fold cross-validation

Hierarchical K-Fold cross-validation

Leave P Out cross-validation

Leave a cross-validation

Monte Carlo (Shuffle-Split)

Time series (rolling cross-validation)

1. HoldOut cross-validation

In this cross-validation technique, the whole data set is randomly divided into training set and verification set. As a rule of thumb, nearly 70% of the entire dataset is used as a training set and the remaining 30% as a validation set.

Advantages:

1. Fast execution: because we have to split the data set into the training set and the validation set once, and the model will be built only once on the training set, it can be executed quickly.

Disadvantages:

Not suitable for unbalanced datasets: suppose we have an unbalanced dataset with classes "0" and "1". Assume that 80% of the data belongs to the "0" class, and the remaining 20% of the data belongs to the "1" class. Training-test segmentation is carried out when the size of the training set is 80% and the size of the test data is 20% of the dataset. It may occur that all 80% of the data of the "0" class is in the training set, while all the data of the "1" class is in the test set. So our model doesn't summarize our test data very well because it hasn't seen "1" data before.

A large amount of data cannot train the model.

In the case of a small dataset, some of it will be reserved for testing the model, which may have important features that our model may miss because it does not train the data.

Code snippet

From sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_scoreiris=load_iris () X=iris.dataY=iris.targetprint ("Size of Dataset {}" .format (len (X) logreg=LogisticRegression () X-ray recording force, X-ray recording, writing, etc.) logreg.fit (x_train) Y_train) predict=logreg.predict (x_test) print ("Accuracy score on training set is {}" .format (accuracy_score (logreg.predict (x_train), y_train)) print ("Accuracy score on test set is {}" .format (accuracy_score (predict,y_test)

2. K-fold cross verification

In this K-fold cross-validation technique, the whole data set is divided into K equal-sized parts. Each partition is called a "fold". So, because we have K parts, we call it K folding. The 90% discount is used as the verification set, and the remaining Kmuri 1% discount is used as the training set.

The technique is repeated K times until each fold is used as a validation set and the rest is used as a training set.

The final accuracy of the model is calculated by taking the average accuracy of the k-models verification data.

Advantages:

The entire dataset is used as both a training set and a validation set:

Disadvantages:

Not used for unbalanced datasets: as discussed in the case of HoldOut cross-validation, it is also possible that all samples of the training set have no sample form class "1" and only class "0" in the case of K-Fold validation. The validation set will have a sample of class "1".

Not suitable for time series data: for time series data, the order of samples is very important. But in K-fold cross-validation, the samples are selected in random order.

Code snippet:

From sklearn.datasets import load_irisfrom sklearn.model_selection import cross_val_score,KFoldfrom sklearn.linear_model import LogisticRegressioniris=load_iris () X=iris.dataY=iris.targetlogreg=LogisticRegression () kf=KFold (n_splits=5) score=cross_val_score (logreg,X,Y,cv=kf) print ("Cross Validation Scores are {}" .format (score)) print ("Average Cross Validation score: {}" .format (score.mean ()

3. Hierarchical K-fold cross verification

Hierarchical K-Fold is an enhanced version of K-Fold cross-validation and is mainly used for unbalanced datasets. Just like K-fold, the entire data set is divided into K-fold of equal size.

In this technique, however, each fold will have the same target variable instance ratio as the entire dataset.

Advantages:

Very effective for unbalanced data: each fold in hierarchical cross-validation represents all categories of data at the same rate as the entire dataset.

Disadvantages:

Not suitable for time series data: for time series data, the order of samples is very important. However, in hierarchical cross-validation, the samples are selected in random order.

Code snippet:

From sklearn.datasets import load_irisfrom sklearn.model_selection import cross_val_score,StratifiedKFoldfrom sklearn.linear_model import LogisticRegressioniris=load_iris () X=iris.dataY=iris.targetlogreg=LogisticRegression () stratifiedkf=StratifiedKFold (n_splits=5) score=cross_val_score (logreg,X,Y,cv=stratifiedkf) print ("Cross Validation Scores are {}" .format (score)) print ("Average Cross Validation score: {}" .format (score.mean ()

4. Leave P Out cross-validation

Leave P Out cross-validation is a detailed cross-validation technique in which p samples are used as verification sets and the remaining np samples are used as training sets.

Suppose we have 100 samples in the dataset. If we use pumped 10, in each iteration, 10 values will be used as the validation set and the remaining 90 samples will be used as the training set.

This process is repeated until the entire data set is partitioned on the verification set of the p-sample and the NMEP training sample.

Advantages:

All data samples are used for training and verification samples.

Disadvantages:

Long computation time: the calculation time will be longer because the above techniques are repeated until all samples are used as validation sets.

Not suitable for unbalanced data sets: same as K-fold cross-validation, if we have only one class of samples in the training set, then our model will not be extended to the validation set.

Code snippet

From sklearn.model_selection import LeavePOut,cross_val_scorefrom sklearn.datasets import load_irisfrom sklearn.ensemble import RandomForestClassifieriris=load_iris () X=iris.dataY=iris.targetlpo=LeavePOut (paired 2) lpo.get_n_splits (X) tree=RandomForestClassifier (Average Cross Validation score: {}) print ("Average Cross Validation score: {}" .format (score.mean ()

5. Leave a cross-validation

Leaving a cross-validation is a detailed cross-validation technique, in which one sample point is used as the verification set and the other one sample is used as the training set.

Suppose we have 100 samples in the dataset. Then in each iteration, one value will be used as the validation set and the remaining 99 samples as the training set. Therefore, the process is repeated until each sample of the dataset is used as a verification point.

It is the same as LeavePOut cross-validation using pair1.

Code snippet:

From sklearn.datasets import load_irisfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import LeaveOneOut,cross_val_scoreiris=load_iris () X=iris.dataY=iris.targetloo=LeaveOneOut () tree=RandomForestClassifier (Average Cross Validation score: {}) print ("Average Cross Validation score: {}" .format (score.mean ()

6. Monte Carlo Cross validation (Shuffle Split)

Monte Carlo cross-validation, also known as Shuffle Split cross-validation, is a very flexible cross-validation strategy. In this technique, the data set is randomly divided into training set and verification set.

We have decided on the percentage of datasets to be used as training sets and the percentage to be used as validation sets. If the sum of the percentage increases in the size of the training set and the validation set is not 100, the remaining dataset will not be used for the training set or validation set.

Suppose we have 100 samples, of which 60% are used as training sets and 20% as verification sets, then the remaining 20% (100-(60-20)) will not be used.

This split will repeat the "n" times we must specify.

Advantages:

1. We are free to use the size of the training and verification sets.

two。 We can choose the number of times to repeat without relying on the number of times we fold.

Disadvantages:

A small number of samples may not be selected for the training set or validation set.

Not suitable for unbalanced data sets: after we have defined the size of the training set and the validation set, all the samples are selected randomly, so the training set may not have the data category settings in the test. and the model cannot be summarized as invisible data.

Code snippet:

From sklearn.model_selection import ShuffleSplit,cross_val_scorefrom sklearn.datasets import load_irisfrom sklearn.linear_model import LogisticRegressionlogreg=LogisticRegression () shuffle_split=ShuffleSplit (test_size=0.3,train_size=0.5,n_splits=10) scores=cross_val_score (logreg,iris.data,iris.target,cv=shuffle_split) print ("cross Validation scores:n {}" .format (scores)) print ("Average Cross Validation score: {}" .format (scores.mean ()

7. Cross-validation of time series

What is time series data?

Time series data is data collected at different time points. Because data points are collected in adjacent time periods, there may be a correlation between observations. This is one of the features that distinguish time series data from cross-sectional data.

How to carry out cross-validation in the case of time series data?

In the case of time series data, we cannot select random samples and assign them to training sets or verification sets, because it is meaningless to use values in future data to predict the values of past data.

Because the order of data is very important for time series related problems, we split the data into training sets and verification sets according to time, also known as "forward chain" method or rolling cross-validation.

Let's start with a small part of the data as a training set. Based on this set, we predict later data points and then check for accuracy.

Then the prediction samples are included as part of the next training data set, and the subsequent samples are predicted.

Advantages:

One of the best technologies.

Disadvantages:

Not suitable for validation of other data types: like other techniques, we choose random samples as training or verification sets, but the order of the data is very important in this technique.

Code snippet:

Import numpy as npfrom sklearn.model_selection import TimeSeriesSplitX = np.array ([[1,2], [3,4], [1,2], [3,4], [1,2], [1,2], [3,4]]) y = np.array ([1,2,3,4,5,6]) time_series = TimeSeriesSplit () print (time_series) for train_index, test_index in time_series.split (X): print ("TRAIN:", train_index, "TEST:" Test_index) X_train, X_test = X [train _ index], X [test _ index] y_train, y_test = y [train _ index], y [test _ index] about "what are the methods of cross-validation in Python"? Thank you for your reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.