Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the difference among ID3, C4.5 and CART decision trees

2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article is about what is the difference between ID3, C4.5 and CART. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

A decision tree contains a root node, several internal nodes and several leaf nodes; the leaf node corresponds to the decision result, and each other node corresponds to an attribute test; the samples contained in each node are divided into sub-nodes according to the attribute test results; the root node contains a complete set of samples, and the path from the root node to each leaf node corresponds to a decision test sequence. The purpose of decision tree learning is to produce a decision tree with strong generalization ability, that is, it can deal with unseen examples.

ID3 decision tree

Information entropy is the most commonly used index to measure the purity of sample sets. Assuming that the proportion of the k samples in the sample set D is competitive, then the information entropy is calculated as follows

The smaller the value of Ent (D), the higher the purity of the sample set D.

With information entropy, when I choose to divide the sample set D with a certain attribute an of the sample, I can get the "information gain" brought by dividing the sample D with attribute a.

Generally speaking, the greater the information gain, it means that if the attribute an is used to divide the sample set D, then the purity will be improved, because we separately calculate the gain of all the attributes of the sample and choose the largest one as a node of the decision tree. or it can be said that the attributes with large information gain tend to be closer to the root node, because we will give priority to the attributes with large energy differentiation, that is, large information gain. When an attribute has been used as the basis for division, we will no longer participate in the election below. We just said that the root node represents all the samples, and after each value of the attributes below the root node, the sample can be divided according to the corresponding attribute values, and the current sample uses the remaining attributes to calculate the information gain to further select the divided nodes. This is how the ID3 decision tree is established.

C4.5 decision tree

C4.5 decision tree is proposed to solve a disadvantage of ID3 decision tree. When the number of available values of an attribute is large, then there may be only one or very few samples under the corresponding available values of this attribute. At this time, its information gain is very high. At this time, the purity is very high. ID3 decision tree will think that this attribute is very suitable for division. However, the problem caused by the division of more valued attributes is that its generalization ability is weak and can not effectively predict the new samples.

On the other hand, C4.5 decision tree does not directly use information gain as the main basis for dividing samples, but puts forward another concept, gain rate.

However, the same gain rate has a preference for the attributes with a small number of available values, so the C4.5 decision tree first finds out the attributes whose information gain is higher than the average level from the candidate partition attributes, and selects the attributes with the highest gain rate.

CART decision tree

The full name of CART decision tree is Classification and Regression Tree, which can be applied to classification and regression.

Using Gini coefficient to divide attributes

Gini value

Gini coefficient

Therefore, among the candidate attributes, the attribute with the lowest Gini coefficient is selected as the optimal partition attribute.

Thank you for reading! This is the end of the article on "what is the difference between the three decision trees of ID3, C4.5 and CART". I hope the above content can be helpful to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report