Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to implement CART decision Tree algorithm with Python

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to use Python to achieve CART decision tree algorithm", the content of the article is simple and clear, easy to learn and understand, now please follow the editor's ideas slowly in depth, together to study and learn "how to use Python to achieve CART decision tree algorithm" bar!

1. Brief introduction of CART decision tree algorithm

CART (Classification And Regression Trees Classification regression Tree) algorithm is a tree construction algorithm, which can be used for both classification tasks and regression. Compared with ID3 and C4.5, which can only be used for discrete data and classification tasks, CART algorithm is much more applicable. It can be used for both discrete data and continuous data, and can handle both classification and regression tasks.

This paper only discusses the construction of basic CART classification decision tree, not regression tree and pruning.

First of all, we have to make the following points clear:

1. CART algorithm is commonly used in binary classification. The decision tree generated by CART algorithm is binary tree, while the decision tree generated by ID3 and C4.5 algorithm is multi-tree. From the point of view of operation efficiency, binary tree model is more efficient than multi-tree.

2. CART algorithm uses Gini index to select the optimal feature.

II. Gini coefficient

The Gini coefficient represents the impurity of the model, and the smaller the Gini coefficient, the lower the impurity. Note that this is contrary to the definition of the information gain ratio of C4.5.

In the classification problem, assuming that there are K classes and the probability that the sample points belong to class k is competition, the Gini coefficient of the probability distribution is defined as:

If CART is used for two-class classification problems (not only for two-class classification), then the Gini coefficient of the probability distribution can be simplified to

Suppose dataset D is divided into two parts D1 and D2 using feature A, and the Gini coefficient of the dataset divided by feature An is:

Third, CART decision tree generation algorithm.

Input: training data set D, conditions for stopping calculation

Output: CART decision tree

According to the training data set, starting from the root node, recursively perform the following operations on each node to build a binary decision tree:

(1) calculate the Gini index of the existing feature for the data set, as shown above

(2) choose the feature corresponding to the minimum Gini index as the optimal feature, and the corresponding syncopation point as the optimal syncopation point (if there are multiple features or syncopation points corresponding to the minimum value, you can take any one)

(3) according to the optimal feature and the optimal syncopation point, two sub-nodes are generated from the present node, and the data in the training data set is allocated to the two sub-nodes according to features and attributes.

(4) call (1) (2) (3) recursively on two child nodes until the stop condition is met.

(5) generate CART tree.

The condition of stopping the algorithm: the number of samples in the node is less than the predetermined threshold, or the Gini index of the sample set is less than the predetermined threshold (the sample basically belongs to the same class, if it belongs to the same class, it is 0), or the feature set is empty.

Note: the optimal syncopation point is a necessary condition for dividing the current sample into two categories (because we want to construct a binary tree). For discrete cases, the optimal syncopation point is a value of the current optimal feature; for continuous cases, the optimal syncopation point can be a specific value. In the specific application, we need to traverse all the possible optimal syncopation points to find the optimal syncopation points we need.

4. Python implementation of CART algorithm.

If it is a binary classification problem, the functions calcGini and choose_best_feature can be simplified as follows:

# calculate the probability that the sample belongs to the first category pdef calcProbabilityEnt (dataset): numEntries = len (dataset) count = 0 label = dataset [0] [len (dataset [0])-1] for example in dataset: if example [- 1] = = label: count + = 1 probabilityEnt = float (count) / numEntries return probabilityEntdef choose_best_feature (dataset): # Total number of features numFeatures = len (dataset [ 0])-1 # when there is only one feature, if numFeatures = = 1: return 0 # initialization optimal Gini coefficient bestGini = 1 # initialization optimal feature index_of_best_feature =-1 for i in range (numFeatures): # deduplication Each attribute value unique uniqueVals = set (example [I] for example in dataset) # defines the Gini coefficient of the value of the feature Gini = {} for value in uniqueVals: sub_dataset1, sub_dataset2 = split_dataset (dataset,i Value) prob1 = len (sub_dataset1) / float (len (dataset)) prob2 = len (sub_dataset2) / float (len (dataset)) probabilityEnt1 = calcProbabilityEnt (sub_dataset1) probabilityEnt2 = calcProbabilityEnt (sub_dataset2) Gini [value] = prob1 * 2 * probabilityEnt1 * (1-probabilityEnt1) + prob2 * 2 * probabilityEnt2 * (1-probabilityEnt2) if Gini [value] < bestGini: bestGini = Gini [value] index_of_best_feature = I best_split_point = value return index_of_best_feature Best_split_ point 5. Running result

Thank you for your reading, the above is the content of "how to use Python to achieve CART decision tree algorithm". After the study of this article, I believe you have a deeper understanding of how to use Python to achieve CART decision tree algorithm, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report