Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Everything you need to know about the decision tree

2025-03-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

Author: Marco Peixeiro

Compilation: ronghuaiyang introduction to decision Tree, Random Forest, bagging,boosting and the principles behind it

Tree-based methods can be used for regression or classification. They involve dividing the prediction space into several simple regions. The segmentation rule set can be summarized in the tree, so it is called the decision tree method.

The performance of a single decision tree is usually not as good as linear regression, logistic regression, LDA and so on. However, it can lose some interpretability and greatly improve the prediction accuracy.

In this article, we will introduce everything about decision trees, bagging, random forests, and boosting algorithms. It will be a long reading, but it will be worth it!

Basis of decision tree

Regression tree

Before we discuss the theory, we need some basic terminology.

The tree is drawn upside down. The last area is called a leaf, and the point where segmentation occurs in the tree is a split node. Finally, the segment that connects the node is a "branch".

Schematic diagram of regression tree

Create a regression tree:

The prediction space is divided into J different and non-overlapping regions. For each sample value in a region, the average response value in that region is taken as the predicted value.

Each region is segmented by minimizing RSS (Root-Sum-Squares). For this reason, it uses a top-down greedy method, also known as recursive binary segmentation.

Why from the top down?

Because all the samples were in the same area before the first split.

Why is it greedy?

Because the best segmentation occurs in specific steps, rather than looking forward and segmenting, it can be better predicted in future steps.

Mathematically, we define this pair of half-planes as:

We require j and s to minimize:

However, this may lead to overfitting. Pruning the tree will result in a smaller subtree, which we can use cross-validation to verify.

Schematic diagram of an untrimmed tree

Classification tree

The classification tree is very similar to the regression tree. However, we cannot use the average of responses, so we now predict the most common classes in an area. Of course, RSS cannot be used as a standard. On the contrary, each segmentation is to minimize the classification error rate.

The classification error rate is the proportion of training samples that do not belong to the most common category in the region.

Classification error rate formula

Unfortunately, this is not sensitive enough for the growth of trees. In practice, two other methods are used.

Gini index:

Gini index

This is a measure of the total variance of all classes. As you can see, if the ratio is close to 0 or 1, the Gini index will be very small, so it is a good indicator of node purity.

A similar principle applies to another method called cross entropy:

Cross entropy

Now that we know how the basic decision tree works, let's see how to improve its performance!

Bagging, Random Forest and Boosting

Bagging

We have previously seen that bootstrap can calculate any number of standard deviations. For the decision tree, the variance is very large. Therefore, through bootstrap aggregation or bagging, the variance can be reduced and the performance of the decision tree can be improved.

Bagging includes repeatedly sampling from a dataset. This generates B different guided training sets. Then, we train all the bootstrap training sets, get the prediction of each set, and average the prediction.

The mathematical average predictions are:

Applying it to the decision tree means that we can construct a large number of trees with high variance and low deviation. Then, we can average their predictions to reduce the variance and improve the performance of the decision tree.

Random forest

Random forests provide an improvement over Bagging trees by "decorating" trees.

Just like Bagging, multiple decision trees are built. However, at each split, m random samples are randomly selected from all p predictions. Usually, only one of m predictions is allowed for segmentation.

In other words, the algorithm is not allowed to consider most of the available predictors at each split!

Why?

Suppose there is a very strong predictor in the dataset, as well as other moderately strong predictors. Then, in the collection of bagging trees, they all use this powerful predictor in the top partition. Therefore, all bagging trees are very similar, and on average their predictions do not reduce variance because predictions are highly correlated.

Random forests overcome this problem by forcing each partition to consider only a predicted subset of the effectively "decorated" tree.

Of course, if m equals p, it's like bagging. In general, the square root of p gives the best results, as shown below.

A classification error is a function of the number of trees. Each row represents the number of predictions available in each split.

Boost

Boost works like bagging, but trees grow sequentially: each tree uses information from previously grown trees.

This means that the learning speed of algorithms is very slow. Each tree fits the residual of the model, not the target variable. As a result, each tree is very small, and where it does not perform well, it will slowly improve its predictive ability.

Boost has three tuning parameters:

Number of trees (B): unlike bagging and random forests, if B is too large, boost may be too suitable. Use cross-validation to select the correct number of trees. Contraction parameter (alpha): a small positive number that controls the learning rate of boost. It is usually set to 0.01or 0.001. Number of partitions per tree (d): it controls the complexity of the enhanced integration. In general, a single split (d = 1) works well. It is also called interaction depth.

A classification error is a function of the number of trees. Each line represents a different depth of interaction.

As you can see above, an interaction depth of 1 seems to provide the best results.

Now you know everything about decision trees, random forests, ascension and bagging. Decision trees can be used for both regression and classification, and they perform very well when used in combination with boost and bagging.

In future articles, I'll show you how to use Python to implement a decision tree in a real world!

Original English text:

Https://towardsdatascience.com/everything-you-need-to-know-about-decision-trees-8fcd68ecaa71

Https://www.toutiao.com/a6727985346639823363/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 230

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report