In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
Big data machine learning foundation of how to use a visual way to understand decision tree pruning, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.
Visualization of decision Tree
Pruning
If no restrictions are set on the decision tree, it can generate a very large tree, and the training samples covered by the leaf nodes of the decision tree are "pure". In this way, the decision tree is very accurate in the training sample, but not so good in the test set.
The more the number of layers, the more leaf nodes, the more detailed, the deeper the training data, the easier it is to over-fit, resulting in poor prediction of test data. To solve this problem, it is necessary to prune the decision tree.
There are two mainstream pruning schemes, one is pre-pruning, the other is post-pruning.
The so-called pre-pruning is to restrict the growth of the tree to prevent over-fitting. For example, we can limit the data of each node of the decision tree to be split only when it reaches a certain amount, otherwise it will be retained as a leaf node. Or we can limit the proportion of data and stop growing when the proportion of a certain category in the node exceeds the threshold.
Let's focus on post-pruning, because this is the method used by CART.
CART pruning algorithm flow
The CART tree adopts the post-pruning method, that is, a complete decision tree is generated from the training set, and then the non-leaf nodes are investigated from the bottom up. If replacing the corresponding subtree of the node with the leaf node can improve the generalization performance, the subtree is replaced with the leaf node.
Teacher Li Hang introduced the step flow of CART pruning algorithm in "Statistical Learning method".
Doesn't it look complicated? In fact, the core idea is to prune the original decision T0 upward from the bottom root node to the root node. In this process, many subtrees are formed {T0MagneT1rect. Repartee TN}, and then the best subtree is selected by cross-validation method on the verification set.
How to measure the best? Take a look at the decision tree loss function first:
The loss function of the subtree Tt whose root node is t before pruning is:
C (Tt) is the prediction error of the training data, the classification tree is measured by Gini coefficient, and the regression tree is measured by mean square error. | Tt | the number of leaf nodes of subtree T. The only unknown variable in the formula is the regularization parameter α, and the greater its value, the greater the pruning strength. When α slowly increases from 0 to ∞, the optimal subtree will slowly prune from the beginning of the whole tree, until it becomes a single-node tree. For a fixed α, there must be a subtree with the minimum loss function C α (T). We call it the optimal subtree, denoted by T α.
Comparison of two pruning strategies
Post-pruning decision trees usually retain more branches than pre-pruning decision trees.
The underfitting risk of post-pruning decision tree is very small, and the generalization performance of post-pruning decision tree is often better than that of pre-pruning decision tree.
The training time cost of post-pruning decision tree is much higher than that of unpruned decision tree and pre-pruned decision tree. In fact, all you have to do is master the post-pruning.
CART decision tree pruning (parameter interpretation)
Sklearn.tree.DecisionTreeClassifier (criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)
Max_depth: limits the maximum depth of the tree
If the decision tree grows one more layer, the demand for sample size will be doubled, so limiting the depth of the tree can effectively limit over-fitting. It is very effective in high-dimensional and low sample size; it is recommended to start with = 3.
Min_samples_leaf: the minimum number of training samples that each child node must contain after a node is branched
After a node is branched, each child node must contain at least min_samples_leaf training samples. There are two values: (1) integer (2) floating-point type: if the sample size in the leaf node varies greatly, the input floating-point number represents the percentage of the sample size. If the child node after the branch does not satisfy the parameter condition, the branching will not occur, or the branch will occur in the direction that each child node contains min_samples_leaf samples.
This parameter can guarantee the minimum size of each leaf and avoid low variance and over-fitting leaf nodes in the regression problem. Use with max_depth to make the model smoother in the regression tree; it is recommended to start with = 5; for classification problems with few categories, = 1 is usually the best choice.
Min_samples_split: at least the number of training samples a node must contain
If it is less than this number, the node is allowed to branch, otherwise the branching will not occur.
Max_features: the maximum number of features considered when branching
That is, when branching, features that exceed the limit will be discarded. However, without knowing the importance of each feature in the decision tree, forcibly setting this parameter may lead to insufficient model learning.
Min_impurity_decrease: the minimum information gain of the child parent node
The information gain is the difference between the information entropy of the parent node and the information entropy of the child node. The greater the information gain, the greater the contribution of this branch to the model; on the contrary, if the information gain is very small, it means that the branch does not contribute much to the establishment of the model. And because the branch needs a large amount of computation, so if the information gain is very small, we choose to give up the branch.
These are the parameters commonly used in pruning.
Example
If no restrictions are set on the decision tree, the result is as follows: the gini index of each leaf node is equal to 0.
Iris = load_iris ()
Clf = tree.DecisionTreeClassifier (random_state=66,min_samples_leaf=15)
Clf = clf.fit (iris.data, iris.target)
Dot_data = tree.export_graphviz (clf, out_file=None
Feature_names=iris.feature_names
Class_names=iris.target_names
Filled=True, rounded=True
Special_characters=True)
Graph = pydotplus.graph_from_dot_data (dot_data)
Image (graph.create_png ())
Set the minimum number of samples of leaf nodes min_samples_leaf=15, which limits the minimum number of samples of leaf nodes. If the number of leaf nodes is less than the number of samples, it will be pruned together with sibling nodes.
After reading the above, have you mastered how to visually understand decision tree pruning in big data's machine learning foundation? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.