The problems in the implementation of K-fold cross-validation by python and the difference between KFold and StratifiedKFold 07/01 Update SLTechnology News&Howtos

The problems in the implementation of K-fold cross-validation by python and the difference between KFold and StratifiedKFold

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is about python to achieve K-fold cross-validation of the problems and what is the difference between KFold and StratifiedKFold, the editor thinks it is very practical, so share it with you to learn, I hope you can get something after reading this article, say no more, follow the editor to have a look.

The division method of training set and test set affects the value of the final model and parameters to a great extent. In general, K-fold cross-validation is used for model tuning to find the super-parameter values that optimize the generalization performance of the model, and the performance of the current model algorithm can be tested at the same time.

When the k value is large, there will be more data for model training in each iteration, which can get the minimum deviation and prolong the algorithm time.

When the k value is small, the computational cost of repeated fitting performance evaluation of the model on different data blocks is reduced, and the accurate evaluation of the model is obtained on the basis of average performance.

80% discount implementation code

It is usually implemented with the following modules

From sklearn.model_selection import KFold,StratifiedKFold

StratifiedKFold parameter description:

Class sklearn.model_selection.StratifiedKFold (shuffle=False, random_state=None) n_splits: indicates several folds (number of folds) shuffle== True: chooses whether to disrupt each layer of the data before splitting it into batches. For 5 times 80% discount, so that each time the data is disrupted, otherwise, the data obtained each time is the same random_state: controlling random state, seeds used by random number generator

Two points to note:

1.kf.split (x) returns the index of the dataset, which requires x [train _ index] to extract the data.

When 2.shuffle=True, shuffle (meaning shuffle), each time the run code is, the randomly obtained index is different. On the contrary, it remains the same.

Import numpy as npfrom sklearn.model_selection import KFold,StratifiedKFoldx = np.array ([[1,1], [2,2], [3,3], [4,4], [5Mae 5], [6je 6]]) kf = KFold (nasty splitswatches 2) for train_index, test_index in kf.split (x): print ('train_index:', train_index) print ("train_data:", x [train _ index]) print (' test_index') Test_index) print ("- 80% discount The test set becomes the training set split line-") train_index: [1 2 3] train_data: [[2 2] [3 3] [4 4]] test_index [0 45]-80% discount The test set becomes the training set split line-train_index: [0 45] train_data: [[1 1] [5 5] [6 6]] test_index [1 23]-80% discount, the test set becomes the training set split line-the difference between KFold and StratifiedKFold

Stratified means stratified sampling to ensure that the proportion of samples in each category of the training set and test set is the same as that of the original data set.

In the following example, 6 data correspond to 6 tags. If we share a 70% discount, 4 data are train,2 and 4 data are test in each training.

StratifiedKFold can ensure that the proportion of samples is the same as that of the original data set, that is, there will be no train_index= [0meme 1meme 2mem3] train_label= [1mem1meme 1meme 0]

Test_index= [4jue 5] test_label= [0penol 0]-biased data distribution

Import numpy as npfrom sklearn.model_selection import KFold,StratifiedKFoldx = np.array ([[1 train_index:', train_index 1], [2, 2], [3, 3], [4, 4], [5 train_index:', train_index 5], [6 test_index' 6]]) y=np.array ([1 train_index:', train_index 1]) kf = StratifiedKFold (nasty splitsmanship 3 moment shuffle True) for train_index, test_index in kf.split (x test_index'): print ('train_index:', train_index) Test_index) print ("- 80% discount When the test integrates the training set split line-") train_index: [0 1 45] test_index [2 3]-80% discount, the test integrates the training set split line-train_index: [0 23 5] test_index [1 4]-80% discount When the test integrates the training set split line-train_index: [1 234] test_index [0 5]-80% discount, the test integrates the training set split line-random_state (random state)

Why do you need to use such a parameter random_state (random state)?

1. When building the model: forest = RandomForestClassifier (n_estimators=100, random_state=0) forest.fit (X_train, y_train) 2, when generating the dataset: X, y = make_moons (n_samples=100, noise=0.25, random_state=3) 3, when splitting the dataset into training set and test set: X_train, X_test, y_train, y_test = train_test_split (cancer.data, cancer.target, stratify=cancer.target, random_state=42)

If you do not set random_state, the model you build will be different each time.

The data set generated each time is different, and the training set and test set are different each time, so it depends on the demand.

The above are the problems of python to achieve K-fold cross-validation and what is the difference between KFold and StratifiedKFold. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.