Python example Analysis of decision Tree algorithm 07/04 Update SLTechnology News&Howtos

Python example Analysis of decision Tree algorithm

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article is to share with you about the Python example analysis of the decision tree algorithm, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

one。 Overview

The previous Python learning tutorial introduced some basic concepts of the decision tree, including the basic knowledge of the tree and the related content of information entropy, so this time, we will use an example to show how the decision tree works and the role of information entropy in it.

One thing to say first is that there are three classical algorithms in the process of decision tree optimization, namely ID3,C4.5 and CART. The following algorithms are improved based on some shortcomings of the previous algorithm. In this Python learning tutorial, we will first talk about the ID3 algorithm, and we will talk about its shortcomings and improvements later.

two。 An example

As we all know, whether or not to stay in bed in the morning is a very profound question. It depends on many variables. I get up every morning and wonder if I can find some reason not to go to work today. , let's take a look at Xiaoming's bedtime habits.

Here we randomly selected Xiaoming's bed stay for 14 days of the year. Now we can build a decision tree based on this data.

It can be found that there are three attributes in the data that affect the final result, namely, the season, whether the time is past 8 o'clock, and the wind condition.

To build a decision tree, we first need to find its root, and as mentioned in the previous section, we need to calculate the information entropy of each attribute.

Before calculating the information entropy of each attribute, we need to calculate the bed-lying information entropy without any attribute influence according to the historical data. From the data, we can know that among the 14 days, there are 8 days in bed and 6 days in bed. Then

P (stay in bed) = 8 / 12.

P (no bed) = 6 / 12.

Information entropy is

H (bed) =-(p (bed) * log (p (bed)) + p (bed-free) * log (p (bed-free) = 0.89

Next, we can calculate the information entropy of each attribute, and there are three attributes, namely, the season, whether it is after 8 o'clock in the morning, and the wind condition, we need to calculate the probability of staying in bed and not staying in bed in different cases, and entropy.

To put it a little bit awkward, we first use the attribute of wind condition to calculate its information entropy. There are three kinds of wind conditions, and we calculate the entropy in three cases respectively.

When the wind is breeze, there is a 4 / 5 probability of staying in bed, while 1 / 5 of the probability is not. Its entropy is entropy (breeze) =-(p (breeze, bed) * log (p (breeze, bed)) + p (breeze, not bad bed) * log (p (breeze, bed) 0.722.

When the wind is no wind, the entropy is calculated as entropy (no wind) = 0.811 as above.

When the wind condition is gale, the entropy is entropy (gale) = 0.918.

Finally, the entropy value of the attribute of wind condition is the entropy value of each value multiplied by the corresponding frequency.

H (wind condition) = 5 * entropy (breeze) + 4 * entropy (no wind) + 3 * entropy (gale) = 0.801

Remember, there was no property at first, and the entropy of lying in bed was H (lying in bed) = 0.89. Now, with the addition of the wind property, the entropy of the bed has been reduced to 0.801. This shows that the information becomes clearer and our classification is getting better and better.

In the same way, we can calculate the information entropy of the other two attributes:

H (season) = 0.56

H (whether past 8 o'clock) = 0.748

Through their information entropy, we can calculate the information gain of each attribute. Yes, there is a new term, information gain. However, it is not difficult to understand that the information gain is to take the information entropy of the previous step (here is the information entropy of the most primitive bed-lying case) minus the information entropy of the selected attribute, that is,

Information gain g (season) = H (stay in bed)-H (season) = 0.33

In this way, we can calculate the information gain of each attribute, and then select the one with the largest information gain as the root node. In this case, it is obvious that the attribute with the greatest information gain is the season.

What if the root node is selected? Treat each node as a new tree, select the remaining attributes, and repeat the above steps.

When it's all over, a complete tree is built, and in this case, the tree we end up with looks like this:

three。 Over-fitting and pruning

When building a decision tree, our expectation is to build the shortest decision tree. Why do we need to build the shortest decision tree? This is because we have to avoid over-fitting.

What is overfitting? the following picture is a small example of a classification problem, with normal classification results on the left and over-fitting classification results on the right.

In the real world, our data is usually not perfect, and there may be some wrong data or some weird data in the data set. Such as the blue square in the picture above, under normal circumstances, we allow a certain error, the pursuit of universality, that is, to adapt to most situations. However, over-fitting will lead to excessive pursuit of correctness, resulting in poor universality.

Pruning, that is, reducing the height of the tree is to solve the problem of over-fitting. if you think about it, in the case of over-fitting, the decision tree can have an accurate classification of every attribute in a given sample. but too accurate will lead to the situation in the above figure, losing its universality.

And pruning is divided into two methods, pre-pruning and post-pruning. In fact, these two methods are quite easy to understand, one is top-down, the other is bottom-up. Let's take a look at it separately.

Pre-pruning

In fact, you can think of pre-pruning as a top-down method. In the process of building, we will set a height, and when the built tree reaches that height, we will stop building the decision tree, which is the basic principle of pre-pruning.

Post pruning

Back pruning is actually a bottom-up method. It will first allow the decision tree to be built, and when it is completed, it will start at the bottom to determine which branches should be cut off.

Notice the biggest difference between pre-pruning and post-pruning, pre-pruning stops ahead of time, and post-pruning allows decision tree construction to be completed, so in terms of performance, pre-pruning will be more block, while post-pruning can be more accurate.

Deficiency and improvement of ID3 decision Tree algorithm

Insufficient ID3 decision tree

Although it is relatively simple to use ID3 algorithm to build a decision tree, this algorithm has a problem. The decision tree constructed by ID3 will favor the attributes with more values. Why is there this phenomenon? Again, let's take the example above, if we add an attribute, date. There are 365 days in a year, if we really use this attribute as the basis for division, then the result of whether we will stay in bed every day will be very clear, because the sample of each day is very small, it will be clear at a glance. In this way, the information gain will be very large, but there will be over-fitting mentioned above. Do you think this situation can be generalized to other situations? Obviously not!

C4.5 decision tree

In order to solve this problem of ID3 decision tree, another algorithm C4.5 is proposed to construct decision tree.

A new concept has been introduced into the C4.5 decision tree. Instead of using information gain to select which attribute to do as a branch, now we use the gain rate to choose!

Here, IV (a), the greater the value of the attribute, the more optional values it has (for example, 365 days a year).

The higher the IV (a) value, the smaller the gain rate, which gives rise to a new problem. The C4.5 decision tree, in contrast to the ID3 decision tree, favors attributes with fewer optional values. This is very troublesome, so is there a more fair and objective decision tree algorithm? Yes!

CART decision tree

As mentioned above, ID3 decision tree uses information gain as attribute selection, and C4.5 uses gain rate as attribute selection. But they all have their own defects, so they finally put forward CART. At present, the decision tree algorithm used in sklearn is CART. The CART decision tree uses another thing as the criterion for attribute selection, and that is the Gini coefficient.

The competition in the first formula represents the optional values of each attribute, which are calculated and accumulated. This formula is similar to the result of information gain. When the attribute distribution is more uniform, that is, the information is more blurred, the Gini coefficient will be larger, and vice versa. However, this method of calculation is only calculated by dichotomy each time. For example, the season attribute in the above example has four optional values (spring, summer, autumn and winter). Then spring will be calculated as a dichotomy (spring, (summer, autumn, winter)). The other steps that follow are similar to ID3. Through this calculation, the defect of ID3 decision tree can be avoided.

So if we use the CART algorithm to train the decision tree, it will be a little different from the decision tree trained by ID3 above.

The above is the Python example analysis of the decision tree algorithm. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.