Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Random Forest to calculate and evaluate the feature importance of Python

2025-03-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "how Python uses random forests to calculate and evaluate feature importance". In daily operation, I believe many people have doubts about how Python uses random forests to calculate and evaluate feature importance. The editor consulted all kinds of data and sorted out simple and useful operation methods, hoping to help you answer the doubts about "how Python uses random forests to calculate and evaluate feature importance". Next, please follow the editor to study!

Catalogue

1 preface

2 brief introduction of Random Forest (RF)

3 Evaluation of feature importance

4 give an example

1 preface

Random forest is an integrated learning algorithm based on decision tree. Random forest is very simple, easy to implement, low computational overhead, and what is more surprising is its amazing performance in classification and regression. Therefore, random forest is also known as "a method that represents the technical level of integrated learning".

2 brief introduction of Random Forest (RF)

Random forests are fairly easy to understand as long as you understand the algorithm of the decision tree. The algorithm of random forest can be summarized in the following steps:

1. The method of sampling and putting back (bootstrap) is used to select n samples from the sample set as a training set.

two。 A decision tree is generated from the sampled sample set. At each node generated:

Randomly select d features without repetition

These d features are used to divide the sample set to find the best feature (can be distinguished by Gini coefficient, gain rate or information gain).

3. Repeat steps 1 to 2 for a total of k times, k is the number of decision trees in the random forest.

4. The training random forest is used to predict the test samples, and the voting method is used to determine the predicted results.

The following figure shows the random forest algorithm visually (picture from Ref. 2):

Figure 1: schematic diagram of random forest algorithm

Yes, it is this algorithm with random values everywhere, which has excellent results in classification and regression. Do you think it is too strong to explain?

However, the focus of this paper is not this, but the next feature importance assessment.

3 Evaluation of feature importance

In reality, there are often hundreds of top features in a data set, so we are more concerned about how to select the features that have the greatest impact than the results, so as to reduce the number of features when building the model. There are actually many such methods, such as principal component analysis, lasso and so on. However, what we are going to introduce here is the use of random forests for feature screening.

The idea of using random forest to evaluate the importance of features is actually very simple. To put it bluntly, it is to see how much contribution each feature has made to each tree in the random forest, and then take an average, and finally compare the contribution between features.

Okay, so what is the term for this contribution? Usually the Gini index (Gini index) or out-of-pocket data (OOB) error rate can be used as an evaluation index to measure.

Here we only introduce the method of evaluation by Gini index. If you want to know about another method, you can refer to 2.

4 give an example

Fortunately, sklearn has encapsulated everything for us, and we just need to call the functions in it.

Let's take the example of wine on UCI as an example to import the dataset first.

Import pandas as pdurl = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'df = pd.read_csv (url, header = None) df.columns = [' Class label', 'Alcohol',' Malic acid', 'Ash',' Alcalinity of ash', 'Magnesium',' Total phenols', 'Flavanoids',' Nonflavanoid phenols', 'Proanthocyanins' 'Color intensity', 'Hue',' OD280/OD315 of diluted wines', 'Proline']

Then, let's take a look at what kind of data set it is at this time.

Import numpy as npnp.unique (df ['Class label'])

Output as

Array ([1,2,3], dtype=int64)

We can see that there are three categories. Then take a look at the information of the data:

Df.info ()

Output as

RangeIndex: 178 entries 0 to 177Data columns (total 14 columns): Class label 178non-null int64Alcohol 178non-null float64Malic acid 178non-null float64Ash 178non-null float64Alcalinity of ash 178non-null float64Magnesium 178non-null int64Total phenols 178 Non-null float64Flavanoids 178non-null float64Nonflavanoid phenols 178non-null float64Proanthocyanins 178non-null float64Color intensity 178non-null float64Hue 178non-null float64OD280/OD315 of diluted wines 178non-null float64Proline 178non-null int64dtypes: float64 (11) Int64 (3) memory usage: 19.5 KB

It can be seen that there are 13 features except class label, and the size of the dataset is 178mm.

According to the general practice, the data set is divided into training set and test set.

From sklearn.cross_validation import train_test_splitfrom sklearn.ensemble import RandomForestClassifierx, y = df.iloc [:, 1:] .values, df.iloc [:, 0] .valuesx _ train, x_test, y_train, y_test = train_test_split (x, y, test_size = 0.3, random_state=0) feat_labels = df.Columbia [1:] forest = RandomForestClassifier (n_estimators=10000, random_state=0, n_jobs=-1) forest.fit (x_train, y_train)

Well, in this way, the random forest has been trained, and the importance of the features has been evaluated. Let's take it out and have a look.

Importances = forest.feature_importances_indices = np.argsort (importances) [::-1] for f in range (x_train.shape [1]): print ("% 2d)%-* s% f"% (f + 1,30, feat_ labels [indications [f], importances [indications [f])

The output is

1) Color intensity 0.182483 2) Proline 0.158610 3) Flavanoids 0.150948 4) OD280/OD315 of diluted wines 0.131987 5) Alcohol 0.106589 6) Hue 0.078243 7) Total phenols 0.060718 8) Alcalinity of ash 0.032033 9) Malic acid 0.02540010) Proanthocyanins 0.02235111) Magnesium 0.02207812) Nonflavanoid phenols 0.01464513) Ash 0.013916

That's right. It's so convenient.

If you want to filter out variables of higher importance, you can do so.

Threshold = 0.15x_selected = x_train [:, importances > threshold] x_selected.shape

Output as

(124,3)

At this point, the study on "how Python uses random forests to calculate and evaluate the importance of features" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report