How to use XGBoost for feature importance Analysis and feature selection in Python 07/04 Update SLTechnology News&Howtos

How to use XGBoost for feature importance Analysis and feature selection in Python

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

How to use XGBoost in Python for feature importance analysis and feature selection, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can get something.

The advantage of the integration of decision tree methods such as gradient enhancement is that they can automatically provide estimates of feature importance from trained prediction models.

The advantage of using gradient enhancement is that it is relatively easy to retrieve the importance score of each attribute after building the enhanced tree. Typically, importance provides a score that indicates the usefulness or value of each feature when building an enhanced decision tree in the model. The more attributes are used for key decisions in the decision tree, the greater their relative importance.

This importance is explicitly calculated for each attribute in the dataset, so that the attributes can be ranked and compared with each other. The importance of a single decision tree is calculated by the number of performance indicators improved by each attribute split point and weighted by the number of observations that the node is responsible for. The performance metric can be the purity (Gini coefficient) used to select the split point, or it can be other more specific error functions. Then, the feature importance is averaged among all the decision trees in the model. More technical information on how to calculate the importance of features in an enhanced decision tree

Importance of manually drawing featur

A trained XGBoost model automatically calculates the importance of features in your predictive modeling problems. These importance scores can be obtained in the feature_importances_ member variables of the training model. For example, you can print them directly as follows:

Print (model.feature_importances_)

We can draw these scores directly on the bar chart to visually represent the relative importance of each feature in the dataset. For example:

# plot pyplot.bar (range (len (model.feature_importances_)), model.feature_importances_) pyplot.show ()

We can prove this by training the XGBoost model on the Pima Indian diabetes data set and creating a bar chart based on the calculated feature importance.

Download the dataset and place it in the current working directory.

Dataset file:

Https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv

Dataset details:

Https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names

# plot feature importance manually from numpy import loadtxt from xgboost import XGBClassifier from matplotlib import pyplot # load data dataset = loadtxt ('pima-indians-diabetes.csv', delimiter= ",) # split data into X and y X = dataset [:, 0:8] y = dataset [:, 8] # fit model no training data model = XGBClassifier () model.fit (X, y) # feature importance print (model.feature_importances_) # plot pyplot.bar (range (len (model.feature_importances_)), model.feature_importances_) pyplot.show ()

Note: your results may be different due to the randomness of the algorithm or evaluation program, or due to differences in numerical accuracy. Consider running the example several times and compare the average results.

First running this example will output the importance score.

[0.089701 0.17109634 0.08139535 0.04651163 0.10465116 0.2026578 0.1627907 0.14119601]

We have also obtained a bar chart of relative importance.

The disadvantage of the diagram is that the elements are sorted by their input index rather than their importance. We can sort the features before drawing.

Fortunately, there is a built-in drawing function that can help us.

Using the built-in XGBoost feature importance map XGBoost library provides a built-in function to draw features in order of importance. This function is called plot_importance () and can be used in the following ways:

# plot feature importance plot_importance (model) pyplot.show ()

For example, the following is a complete code listing that uses the built-in plot_importance () function to plot the feature importance of the Pima Indians dataset.

# plot feature importance using built-in function from numpy import loadtxt from xgboost import XGBClassifier from xgboost import plot_importance from matplotlib import pyplot # load data dataset = loadtxt ('pima-indians-diabetes.csv', delimiter= ",") # split data into X and y X = dataset [:, 0:8] y = dataset [:, 8] # fit model no training data model = XGBClassifier () model.fit (X, y) # plot feature importance plot_importance (model) pyplot.show ()

Running the example will provide us with a more useful bar chart.

You can see that features are automatically named according to their indexes in the input array (X) of F0 to F7. Manually mapping these indexes to the names in the problem description, you can see that the figure shows that F5 (body mass index) has the highest importance, while F3 (skin wrinkle thickness) has the lowest importance.

Feature selection of XGBoost feature importance score

Feature importance score can be used for feature selection in scikit-learn. This is done by using the SelectFromModel class, which takes a model and can transform the dataset into a subset with selected features. Such models can be pre-trained, such as those that train on the entire training data set. It can then use thresholds to determine which features to select. This threshold is used when you call the transform () method on the SelectFromModel instance to consistently select the same elements on the training dataset and the test dataset.

In the following example, we first train and then evaluate the XGBoost model on the entire training data set and the test data set respectively. Use the feature importance calculated from the training data set, and then wrap the model in an SelectFromModel instance. We use it to select the features on the training data set, train the model from the selected feature subset, then evaluate the model on the test set, and follow the same feature selection scheme.

For example:

# select features using threshold selection = SelectFromModel (model, threshold=thresh, prefit=True) select_X_train = selection.transform (X_train) # train model selection_model = XGBClassifier () selection_model.fit (select_X_train, y_train) # eval model select_X_test = selection.transform (X_test) y_pred = selection_model.predict (select_X_test)

Out of interest, we can test multiple thresholds to select features according to their importance. Specifically, the feature importance of each input variable essentially enables us to test each feature subset according to its importance, starting from all features to the end of the subset with the most important features.

The complete code listing is provided below:

# use feature importance for feature selection from numpy import loadtxt from numpy import sort from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.feature_selection import SelectFromModel # load data dataset = loadtxt ('pima-indians-diabetes.csv', delimiter= ",") # split data into X and y X = dataset [:, 0:8] Y = dataset [:, 8] # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split (X, Y) Test_size=0.33, random_state=7) # fit model on all training data model = XGBClassifier () model.fit (X_train, y_train) # make predictions for test data and evaluate y_pred = model.predict (X_test) predictions = [round (value) for value in y_pred] accuracy = accuracy_score (y_test Predictions) print ("Accuracy:% .2f%%"% (accuracy * 100.0)) # Fit model using each importance as a threshold thresholds = sort (model.feature_importances_) for thresh in thresholds: # select features using threshold selection = SelectFromModel (model, threshold=thresh, prefit=True) select_X_train = selection.transform (X_train) # train model selection_model = XGBClassifier () selection_model.fit (select_X_train Y_train) # eval model select_X_test = selection.transform (X_test) y_pred = selection_model.predict (select_X_test) predictions = [round (value) for value in y_pred] accuracy = accuracy_score (y_test, predictions) print ("Thresh=%.3f, nasty% d, Accuracy:% .2f%%"% (thresh, select_X_train.shape [1], accuracy*100.0)

Note that if you are using XGBoost 1.0.2 (and possibly other versions), there is an error in the XGBClassifier class that results in an error:

KeyError: 'weight'

This can be solved by using a custom XGBClassifier class, which returns None for the coef_ property. A complete example is listed below.

# use feature importance for feature selection, with fix for xgboost 1.0.2 from numpy import loadtxt from numpy import sort from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.feature_selection import SelectFromModel # define custom class to fix bug in xgboost 1.0.2 class MyXGBClassifier (XGBClassifier): @ property def coef_ (self): return None # load data dataset = loadtxt ('pima-indians-diabetes.csv', delimiter= ",") # split data into X and yX = dataset [: 0:8] Y = dataset [:, 8] # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split (X, Y, test_size=0.33, random_state=7) # fit model on all training data model = MyXGBClassifier () model.fit (X_train, y_train) # make predictions for test data and evaluate predictions = model.predict (X_test) accuracy = accuracy_score (y_test Predictions) print ("Accuracy:% .2f%%"% (accuracy * 100.0)) # Fit model using each importance as a threshold thresholds = sort (model.feature_importances_) for thresh in thresholds: # select features using threshold selection = SelectFromModel (model, threshold=thresh, prefit=True) select_X_train = selection.transform (X_train) # train model selection_model = XGBClassifier () selection_model.fit (select_X_train Y_train) # eval model select_X_test = selection.transform (X_test) predictions = selection_model.predict (select_X_test) accuracy = accuracy_score (y_test, predictions) print ("Thresh=%.3f, nasty% d, Accuracy:% .2f%%"% (thresh, select_X_train.shape [1], accuracy*100.0))

Running this example will print the following output.

Accuracy: 77.95% Thresh=0.071, nasty 8, Accuracy: 77.95% Thresh=0.073, nasty 7, Accuracy: 76.38% Thresh=0.084, nasty 6, Accuracy: 77.56% Thresh=0.090, nasty 5, Accuracy: 76.38% Thresh=0.128, nasty 4, Accuracy: 76.38% Thresh=0.160, nasty 3, Accuracy: 74.80% Thresh=0.186, nasty 2, Accuracy: 71.65% Thresh=0.208, nasty 1, Accuracy: 63.78%

We can see that the performance of the model usually degrades with the number of features selected.

On this issue, we need to weigh the characteristics of the accuracy of the test set, we can decide to use a less complex model (less attributes, such as n = 4), and accept a moderate reduction in estimation accuracy, from 77.95% to 76.38%.

This may be a baptism for such a small dataset, but it may be a more useful strategy for larger datasets and using cross-validation as a model evaluation solution.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.