Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The principle of Random Forest and what is the implementation of Python Code

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the principle of random forest and Python code implementation is how, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Recently, when I was doing kaggle, I found that the effect of random forest algorithm on classification is very good, in most cases, the effect is much better than svm,log regression, knn and other algorithms. So I want to think about the principle of this algorithm.

To learn random forest, first briefly introduce the integrated learning method and decision tree algorithm.

Concepts and differences between Bagging and Boosting

Both Bagging and Boosting combine the existing classification or regression algorithms in a certain way to form a more powerful classifier. More accurately, it is an assembly method of classification algorithms. The method of assembling weak classifiers into strong classifiers.

First introduce Bootstraping, that is, the bootstrap method: it is a sampling method with placement (duplicate samples may be drawn).

1. Bagging (bootstrap aggregating)

Bagging is bagging method, and its algorithm process is as follows:

A) the training set is extracted from the original sample set. In each round, n training samples are taken from the original sample set using the Bootstraping method (in the training set, some samples may be taken many times, while some samples may not be selected at all). A total of k rounds of extraction were carried out, and k training sets were obtained. (K training sets are independent of each other)

B) one model is obtained by using one training set at a time, and a total of k models are obtained from k training sets. (note: there is no specific classification algorithm or regression method here. We can use different classification or regression methods according to specific problems, such as decision tree, perceptron, etc.)

C) for the classification problem: the k models obtained in the previous step are used to get the classification results by voting; for the regression problem, the average value of the above model is calculated as the final result. (all models are of the same importance)

2 、 Boosting

The main idea is to assemble the weak classifier into a strong classifier. Under the PAC (probability approximately correct) learning framework, the weak classifier can certainly be assembled into a strong classifier.

Two core questions about Boosting:

1) how to change the weight or probability distribution of training data in each round?

By increasing the weight of the samples misclassified by the weak classifier in the previous round and reducing the weight of the sample pair in the previous round, the classifier has a better effect on the misclassified data.

2) how to combine weak classifiers?

The weak classifier is linearly combined by the addition model, such as AdaBoost through weighted majority voting, that is, to increase the weight of the classifier with low error rate and reduce the weight of the classifier with higher error rate.

On the other hand, the lifting tree gradually reduces the residual by fitting the residual and superimposes the model generated in each step to get the final model.

3. The difference between Bagging,Boosting and the two.

The difference between Bagging and Boosting:

1) sample selection:

Bagging: the training set is selected back in the original set, and the rotation training sets selected from the original set are independent.

Boosting: the training set remains the same for each round, except that the weight of each sample in the classifier in the training set changes. The weight is adjusted according to the classification results of the previous round.

2) sample weight:

Bagging: using uniform sampling with equal weight for each sample

Boosting: constantly adjust the weight of the sample according to the error rate. The greater the error rate, the greater the weight.

3) Forecast function:

Bagging: all prediction functions have equal weights.

Boosting: each weak classifier has a corresponding weight, and there will be a larger weight for classifiers with small classification errors.

4) parallel computing:

Bagging: each prediction function can be generated in parallel

Boosting: each prediction function can only be generated sequentially, because the latter model parameter requires the results of the previous model.

4. Summary

These two methods are the methods of integrating several classifiers into one classifier, but the way of integration is different, and finally get different results. Applying different classification algorithms into this kind of algorithm framework will improve the classification effect of the original single classifier to a certain extent, but also increase the amount of computation.

The following is a new algorithm that combines the decision tree with these algorithm frameworks:

1) Bagging + decision Tree = Random Forest

2) AdaBoost + decision tree = lifting tree

3) Gradient Boosting + decision tree = GBDT

Ensemble learning completes the learning task by constructing and combining multiple classifiers. By combining multiple learners, ensemble learning can often achieve better generalization performance than a single learner.

Consider a simple example: in the two-classification task, it is assumed that the performance of three classifiers on three test samples is as shown in the following figure, where √ indicates correct classification, × indicates classification error, and the result of ensemble learning is produced by voting, that is, "minority is subordinate to majority". As shown in the following figure, in (a), each classifier has an accuracy of only 66.6%, but integrated learning reaches 100%; in (b), there is no difference among the three classifiers, and the performance is not improved after integration; in (c), the accuracy of each classifier is only 33.3%, and the result of integrated learning becomes worse. This simple example shows that in order to achieve good integration, individual learners should be "good but different", that is, individual learners should have certain "accuracy", that is, learners should not be too poor, and there should be "diversity", that is, there should be differences between learners.

According to the generation mode of individual learners, the current ensemble learning methods can be roughly divided into two categories, namely, the serialization methods that have strong dependencies among individual learners, which must be generated serially, and the parallelization methods that can be generated at the same time without strong dependencies between individual learners; the former is represented by Boosting, while the latter is represented by Bagging and Random Forest.

Bagging and Random Forest

In order to get the integration with strong generalization performance, the individual learners in the integration should be independent of each other as much as possible. Although this is difficult to do in practical tasks, we can try to make the basic learners as different as possible.

In my experiment, we use the "self-help sampling method": given a dataset containing m samples, we first randomly take a sample into the sampling set, and then put the sample back into the initial dataset, so that the sample may still be selected in the next sampling. in this way, after m random operations, we get a sampling set containing m samples, and some samples in the initial training set appear many times in the sampling set. Some never showed up.

According to this method, we can sample T samples containing m training samples, and then combine these basic learners based on each training set, which is the basic flow of Bagging. When combining the prediction output, Bagging usually uses the simple voting method for classification tasks and the simple average method for regression tasks.

Random forest is an extension of Bagging. Random forest introduces random attribute selection (that is, random feature selection) in the training process of decision tree on the basis of Bagging integration based on decision tree. When selecting partition attributes, the traditional decision tree selects an optimal attribute from the attribute set of the current node (assuming that there are d attributes), while in the random forest, for each node of the base decision tree, first randomly select a subset of k attributes from the attribute set of the node, and then select an optimal attribute from this subset for partition. The parameter k here controls the degree of introduction of randomness: if k is allowed, the construction of the base decision tree is the same as that of the traditional decision tree; if k is made, an attribute is randomly selected for partition.

In this article, we will only talk about the classification of random forests. When random forest is used for classification, that is, n decision trees are used to classify the classification results, and the final classification is obtained by simple voting method to improve the classification accuracy.

For children's shoes that you don't know much about decision trees, you can read an article:

Python Machine Learning: decision Tree ID3, C4.5

To put it simply, a random forest is the integration of a decision tree, but there are two differences:

The main results are as follows: (1) the difference of sampling: the sampling set containing m samples is obtained from the data set containing m samples, and the sampling set containing m samples is obtained for training. This ensures that the training samples of each decision tree are not exactly the same.

(2) the difference of feature selection: n classification features of each decision tree are randomly selected among all features (n is a parameter that needs to be adjusted by ourselves).

The parameters that need to be adjusted in the random forest are:

(1) number of decision trees

(2) the number of feature attributes

(3) the number of recursions (i.e. the depth of the decision tree)

Next, let's talk about how to implement a random forest in code.

Code implementation process:

(1) Import the file and convert all features to float form

(2) divide the data set into n parts to facilitate cross-verification.

(3) construct a subset of data (random sampling), and select the optimal feature under the specified number of features (assuming m, manually adjusting parameters).

(4) constructing decision tree.

(5) create a random forest (the combination of multiple decision trees)

(6) input the test set and test it, and output the prediction result

(1) Import the file and convert all features to float form

(2) divide the data set into n parts to facilitate cross-verification.

(3) construct a subset of data (random sampling), and select the optimal feature under the specified number of features (assuming m, manually adjusting parameters).

(4) constructing decision tree.

(5) create a random forest (the combination of multiple decision trees)

(6) input the test set and test it, and output the prediction result

A summary of the above code:

Training part: suppose we take the m feature in the dataset to construct the decision tree. First, we traverse each of the m feature, then traverse each row, select the optimal feature and eigenvalue through the spilt_loss function (calculate the segmentation cost), classify it according to whether it is greater than this eigenvalue (divided into left,right two categories), and cycle through the above steps until it is inseparable or reaches the recursive limit (to prevent over-fitting). Finally, a decision tree tree is obtained.

The test part: judge each row of the test set, the decision tree tree is a multi-layer dictionary, each layer is divided into two categories, and each row is explored step by step according to the classification index index in the decision tree tree, until the end of the exploration when there is no dictionary, and the value obtained is our predicted value.

On the principle of random forest and how the Python code implementation is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report