Several practical feature selection methods in python data Mining 07/06 Update SLTechnology News&Howtos

Several practical feature selection methods in python data Mining

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "several practical feature selection methods in python data mining". In daily operation, I believe that many people have doubts about several more practical feature selection methods in python data mining. Xiaobian consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts of "several more practical feature selection methods in python data mining". Next, please follow the editor to study!

For the partners engaged in data analysis and data mining, feature selection is an unavoidable topic and an indispensable link in the process of data mining. Good feature selection can improve the performance of the model and help us to understand the characteristics and underlying structure of the data, which plays an important role in further improving the model and algorithm.

Feature selection function

Reduce the number of features and dimension, so that the model has stronger generalization ability and less over-fitting.

Enhance the understanding between features and eigenvalues

Introduction of feature selection method

1. Feature importance

In the process of feature selection, if the learner is a tree model, effective features can be screened according to the importance of features. In sklearn, the calculation methods of feature importance of GBDT and RF are the same, both calculate the importance of each feature based on a single tree, explore how much contribution each feature has made on each tree, and then take an average. The importance of features on a single tree is defined as the decrease of weighted impurity of all non-leaf nodes during division, and the more the reduction, the more important the feature.

Import numpy as npfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.externals.six import StringIOfrom sklearn import treeimport pydotplusclf = DecisionTreeClassifier () x = [1meme 1pyrone () x = [1meme 1pyrone 1pyrus 2pyrus 2pyrus 2pyrus 3pyrus 3pyrus 3pyrus 3], [1meme 1pyrort 2pyrrine 1pyrus 1pyrrpt1], [1mine1mlmlmlm1mlmlmlm1mlmlmlm1mlm1mlmlmlm1mlm2pl-2p1pl-1] y = [1pari 2pje 2pje 2pr 3pt3 pencor3pencor3pence3pence2wpt2] y = [1mine2pence2pence2jpmlmlmlmlmlmlmlmlmlmlmlmlmlmlmlmlmlmmlmmlmmlmmlmmmlmmmlmmmlmmmlmmmlmmmlmmmlmmmmlmmmlmmmmmmmmlmmmmmmmmlmmmlmmlmmlmmlmmlmlmmlmlmlmmlmlmlm 1] x = np.array (x) x = np.transpose (x) clf.fit (XMagazy) print (clf.feature_importances_) feature_name = ['A1', 'A2',' A3', 'A4'] target_name = [' 1', 'tree.export_graphviz'] dot_data = StringIO () tree.export_graphviz (clf,out_file = dot_data,feature_names=feature_name, class_names=target_name,filled=True,rounded=True Special_characters=True) graph = pydotplus.graph_from_dot_data (dot_data.getvalue ()) graph.write_pdf ("Tree.pdf")

two。 Coefficient of regression model

The more important the feature is, the larger the corresponding coefficient will be in the model, and the more independent the feature is, the closer the coefficient will be to 0. On the data with little noise, or on the data where the amount of data is much larger than the number of features, if the features are relatively independent, then even the simplest linear regression model can achieve very good results.

From sklearn.linear_model import LinearRegressionimport numpy as npnp.random.seed (0) size = 5000 cycles A dataset with 3 featuresX = np.random.normal (0,1, (size, 3)) # Y = X0 + 2*X1 + noiseY = X [:, 0] + 2roomX [:, 1] + np.random.normal (0,2, size) lr = LinearRegression () lr.fit (X, Y) # A helper method for pretty-printing linear modelsdef pretty_print_linear (coefs, names = None Sort = False): if names = = None: names = ["X% s"% x for x in range (len (coefs))] lst = zip (coefs, names) if sort: lst = sorted (lst, key = lambda x:-np.abs (x [0]) return "+" .join ("% s *% s"% (coef, 3), name) for coef Name in lst)

Print "Linear model:", pretty_print_linear (lr.coef_) 3. Average accuracy reduction is a direct measure of the impact of each feature on the accuracy of the model. The main idea is to disrupt the order of eigenvalues of each feature, and to measure the impact of order changes on the accuracy of the model. Obviously, for unimportant variables, disordering does not have much effect on the accuracy of the model, but for important variables, disrupting the order will reduce the accuracy of the model. This method is not provided directly in sklearn, but it is easy to implement from sklearn.cross_validation import ShuffleSplitfrom sklearn.metrics import r2_scorefrom collections import defaultdict

X = boston ["data"] Y = boston ["target"]

Rf = RandomForestRegressor () scores = defaultdict (list)

# crossvalidate the scores on a number of different random splits of the datafor train_idx, test_idx in ShuffleSplit (len (X), 100,0.3): X_train, X_test = X [train _ idx], X [idx] Y_train, Y_test = Y [train _ idx], Y [idx] r = rf.fit (X_train, Y_train) acc = r2_score (Y_test Rf.predict (X_test)) for i in range (X.shape [1]): shuff_acc = X_test.copy () np.random.shuffle (rf.predict [:, I]) shuff_acc = r2_score (Y_test, rf.predict (Xfolt)) score [namespace] .append ((acc-shuff_acc) / acc) print "Features sorted by their score:" print sorted ([(round (np.mean (score) 4), feat) for feat, score in scores.items ()], reverse=True) so far The study of "several practical feature selection methods in python data mining" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.