How to use XGBoost and scikit-learn for Random gradient Enhancement in Python 07/19 Update SLTechnology News&Howtos

How to use XGBoost and scikit-learn for Random gradient Enhancement in Python

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article will explain in detail how to use XGBoost and scikit-learn for random gradient enhancement in Python. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

A simple technique for integrating decision trees involves training trees on subsamples of training data sets. A subset of rows in the training data can be used to train a single tree called a bagged tree. When calculating each split point, this is called a random forest if a subset of the rows of the training data is also used. These techniques can also be used in gradient tree enhancement models in techniques called random gradient enhancement.

Random gradient lifting

Gradient enhancement is a greedy process. Add a new decision tree to the model to correct the residuals of the existing model. Use the greedy search process to create each decision tree to select the partition point that minimizes the objective function. This may cause the tree to use the same attributes over and over again, or even the same split points.

Bagging is a technique for creating a set of decision trees, each of which comes from a different subset of random rows in the training data. As a result, because the randomness of the sample allows the creation of slightly different trees, better performance is obtained from the set of trees, thus increasing the variance for the prediction of the set. Random forests further expand this step by re-sampling elements (columns) when selecting segmentation points, thus further increasing the overall difference of trees. These same techniques can be used in the construction of decision trees in gradient lifting, which is called random gradient lifting. Radical subsamples of training data are usually used, for example, 40% to 80%.

Overview of the tutorial

In this tutorial, we will study the role of different secondary sampling techniques in gradient enhancement. We will adjust the three different random gradient enhancements supported by Python's XGBoost library, especially:

As each tree is created, the rows in the dataset are re-sampled.

Re-sample the columns in the dataset as each tree is created.

As each tree is created, a subsample of each split column in the dataset.

Problem description: Otto dataset

In this tutorial, we will use the Otto Group Product Classification Challenge dataset. This dataset is available for free from Kaggle (you need to register with Kaggle to download this dataset). You can download the training dataset train.csv.zip from the data page and put the unzipped train.csv file in your working directory. The dataset describes 93 confusing details of more than 61000 products divided into 10 product categories (for example, fashion, electronics, etc.). The input property is the count of some different event. The goal is to predict the new product as a probability array for each of the 10 categories and to evaluate the model using multi-class logarithmic losses (also known as cross-entropy). The competition was completed in May 2015, and due to the small number of examples and the difficulty of the problem, there is little need for data preparation (except for the fact that string-like variables are encoded as integers), this dataset is still a big challenge for XGBoost.

Adjust row secondary sampling in XGBoost

Row secondary sampling involves selecting random samples of the training data set without replacement. You can specify row subsamples in the scikit-learn wrapper of the XGBoost class of the subsample parameter. The default value is 1.0, which is not re-sampled. We can use the grid search function built into scikit-learn to evaluate the impact of different subsample values from 0.1 to 1.0 on the Otto dataset.

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]

There are 9 variants of the subsample, and each model will be evaluated using 10x cross-validation, which means that 9 × 10 or 90 models need to be trained and tested.

The complete code listing is provided below.

# XGBoost on Otto dataset, tune subsample from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use ('Agg') from matplotlib import pyplot # load datadata = read_csv (' train.csv') datadataset = data.values # split data into X and y X = dataset [:, 0:94] y = dataset [: 94] # encode string class values as integers label_encoded_y = LabelEncoder (). Fit_transform (y) # grid search model = XGBClassifier () subsample= [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,1.0] param_grid = dict (subsamplesubsample=subsample) kfold = StratifiedKFold (n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV (model, param_grid, scoring= "neg_log_loss", n_jobs=-1 Cv=kfold) grid_result = grid_search.fit (X, label_encoded_y) # summarize results print ("Best:% f using% s"% (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_ ['mean_test_score'] stds = grid_result.cv_results_ [' std_test_score'] params = grid_result.cv_results_ ['params'] for mean, stdev, param in zip (means, stds) Params): print ("% f (% f) with:% r"% (mean, stdev, param)) # plot pyplot.errorbar (subsample, means, yerr=stds) pyplot.title ("XGBoost subsample vs Log Loss") pyplot.xlabel ('subsample') pyplot.ylabel (' Log Loss') pyplot.savefig ('subsample.png')

Running this example will print the best configuration and the log loss for each test configuration.

Note: your results may be different due to the randomness of the algorithm or evaluation program, or due to differences in numerical accuracy. Consider running the example several times and compare the average results.

We can see that the best result is 0.3, or use 30% of the training data set sample training tree.

Best:-0.000647 using {'subsample': 0.3}-0.001156 (0.000286) with: {' subsample': 0.1}-0.000765 (0.000430) with: {'subsample': 0.2}-0.000647 (0.000471) with: {' subsample': 0.3}-0.000659 (0.000635) with: {'subsample': 0.4}-0.000717 (0.000849) with: {' Subsample': 0.5}-0.000773 (0.000998) with: {'subsample': 0.6}-0.000877 (0.001179) with: {' subsample': 0.7}-0.001007 (0.001371) with: {'subsample': 0.8}-0.001239 (0.001730) with: {' subsample': 1.0}

We can plot these mean and standard deviation logarithmic loss values to better understand how performance varies with subsample values.

We can see that 30% of people do have the best average performance, but we can also see that the difference in performance increases significantly as the ratio increases. Interestingly, the average performance of all subsample values is better than that without subsampling (subsample = 1.0).

Adjust column secondary sampling according to tree in XGBoost

We can also create a random sample of the features (or columns) to be used before creating each decision tree in the enhanced model. In the XGBoost wrapper of scikit-learn, this is controlled by the colsample_bytree parameter. The default value is 1.0, which means that all columns are used in each decision tree. We can evaluate the value of colsample_bytree between 0. 1 and 1. 0 in increments.

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]

A complete example is as follows:

# XGBoost on Otto dataset, tune colsample_bytree from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use ('Agg') from matplotlib import pyplot # load datadata = read_csv (' train.csv') datadataset = data.values # split data into X and y X = dataset [:, 0:94] y = dataset [: 94] # encode string class values as integers label_encoded_y = LabelEncoder (). Fit_transform (y) # grid search model = XGBClassifier () colsample_bytree= [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,1.0] param_grid = dict (colsample_bytreecolsample_bytree=colsample_bytree) kfold = StratifiedKFold (n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV (model, param_grid, scoring= "neg_log_loss") N_jobs=-1, cv=kfold) grid_result = grid_search.fit (X, label_encoded_y) # summarize results print ("Best:% f using% s"% (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_ ['mean_test_score'] stds = grid_result.cv_results_ [' std_test_score'] params = grid_result.cv_results_ ['params'] for mean, stdev Param in zip (means, stds, params): print ("% f (% f) with:% r"% (mean, stdev, param)) # plot pyplot.errorbar (colsample_bytree, means, yerr=stds) pyplot.title ("XGBoost colsample_bytree vs Log Loss") pyplot.xlabel ('colsample_bytree') pyplot.ylabel (' Log Loss') pyplot.savefig ('colsample_bytree.png')

Running this example will print the best configuration and the log loss for each test configuration.

Note: your results may be different due to the randomness of the algorithm or evaluation program, or the difference in numerical accuracy.

We can see that the best performance of the model is colsample_bytree = 1. 0. This shows that the secondary sampling of this problem will not add value.

Best:-0.001239 using {'colsample_bytree': 1.0}-0.298955 (0.002177) with: {' colsample_bytree': 0.1}-0.092441 (0.000798) with: {'colsample_bytree': 0.2}-0.029993 (0.000459) with: {' colsample_bytree': 0.3}-0.010435 (0.000669) with: {'colsample_bytree': 0.4}-0.004176 (0.000916) with: {'colsample_bytree': 0.5}-0.002614 (0.001062) with: {' colsample_bytree': 0.6}-0.001694 (0.001221) with: {'colsample_bytree': 0.7}-0.001306 (0.001435) with: {' colsample_bytree': 0.8}-0.001239 (0.001730) with: {'colsample_bytree': 1.0}

As a result, we can see the performance of the smooth segment of the model (at least at this scale), with values of 0.5 to 1.0.

Adjust column secondary sampling in XGBoost by splitting

Instead of subsampling the columns for each tree, we can subsample them in each split of the decision tree. In principle, this is the method used in random forests. We can set the size of the column sample used for each split in the colsample_bylevel parameter of the XGBoost wrapper class of scikit-learn. As before, we changed the ratio from 10% to the default value of 100%.

The complete code listing is provided below.

# XGBoost on Otto dataset, tune colsample_bylevel from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use ('Agg') from matplotlib import pyplot # load datadata = read_csv (' train.csv') datadataset = data.values # split data into X and y X = dataset [:, 0:94] y = dataset [: 94] # encode string class values as integers label_encoded_y = LabelEncoder (). Fit_transform (y) # grid search model = XGBClassifier () colsample_bylevel= [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,1.0] param_grid = dict (colsample_bylevelcolsample_bylevel=colsample_bylevel) kfold = StratifiedKFold (n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV (model, param_grid, scoring= "neg_log_loss") N_jobs=-1, cv=kfold) grid_result = grid_search.fit (X, label_encoded_y) # summarize results print ("Best:% f using% s"% (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_ ['mean_test_score'] stds = grid_result.cv_results_ [' std_test_score'] params = grid_result.cv_results_ ['params'] for mean, stdev Param in zip (means, stds, params): print ("% f (% f) with:% r"% (mean, stdev, param)) # plot pyplot.errorbar (colsample_bylevel, means, yerr=stds) pyplot.title ("XGBoost colsample_bylevel vs Log Loss") pyplot.xlabel ('colsample_bylevel') pyplot.ylabel (' Log Loss') pyplot.savefig ('colsample_bylevel.png')

Running this example will print the best configuration and the log loss for each test configuration.

We can see that the best result can be achieved by setting colsample_bylevel to 70%, resulting in (inverted) logarithmic loss of-0.001062, which is better than the-0.001239 seen when the column sample for each tree is set to 100%.

If 100% columns are recommended for the results of each tree, it is recommended that you do not discard column re-sampling, but try to re-sample by split column.

Best:-0.001062 using {'colsample_bylevel': 0.7}-0.159455 (0.007028) with: {' colsample_bylevel': 0.1}-0.034391 (0.003533) with: {'colsample_bylevel': 0.2}-0.007619 (0.000451) with: {' colsample_bylevel': 0.3}-0.002982 (0.000726) with: {'colsample_bylevel': 0.4}-0.001410 (0.000946) with: {'colsample_bylevel': 0.5}-0.001182 (0.001144) with: {' colsample_bylevel': 0.6}-0.001062 (0.001221) with: {'colsample_bylevel': 0.7}-0.001071 (0.001427) with: {' colsample_bylevel': 0.8}-0.001239 (0.001730) with: {'colsample_bylevel': 1.0}

We can plot the performance of each colsample_bylevel change. The results show that when the value of this ratio is 0.3, the variance is relatively low, and the performance seems to be in a steady state.

On Python how to use XGBoost and scikit-learn for random gradient enhancement to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.