Classical algorithm in the field of data mining-- CART algorithm 04/18 Update SLTechnology News&Howtos

Classical algorithm in the field of data mining-- CART algorithm

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Brief introduction

CART, similar to C4.5, is a decision tree algorithm. In addition, the common decision tree algorithm is ID3, and the difference between them lies in the division of features:

ID3: feature division based on information gain

C4.5: feature division based on information gain ratio

CART: feature division based on Gini index

Basic thought

CART assumes that the decision tree is a binary tree, the value of the internal node feature is "yes" and "no", the left branch is the branch with the value of "yes", and the right branch is the branch with the value of "no". Such a decision tree is equivalent to recursively dichotomy each feature, dividing the input space, that is, the feature space, into finite elements, and determining the predicted probability distribution on these elements, that is, the conditional probability distribution of the output under the given condition of the input.

CART algorithm consists of the following two steps:

Decision tree generation: generate a decision tree based on the training data set, and generate a decision tree as large as possible

Decision tree pruning: the generated tree is pruned with the verification data set and the optimal subtree is selected, when the minimum loss function is used as the pruning standard.

The generation of CART decision tree is the process of recursively constructing binary decision tree. CART decision tree can be used for both classification and regression. In this article, we only discuss CART for classification. For the classification tree, CART uses the Gini coefficient minimization criterion to select features to generate a binary tree. The CART generation algorithm is as follows:

Input: training data set D, conditions for stopping calculation:

Output: CART decision tree.

According to the training data set, starting from the root node, recursively perform the following operations on each node to build a binary decision tree:

Set the training data set of the node as D, and calculate the Gini coefficient of the existing features to the data set. At this time, for each feature A, for each possible value a, D is divided into D1 and D2 parts according to the "yes" or "no" test of the sample point, and the Gini coefficient is calculated.

Among all possible features An and their possible segmentation points a, the features with the smallest Gini coefficient and their corresponding segmentation points are selected as the optimal features and optimal segmentation points. According to the optimal feature and the optimal syncopation point, two sub-nodes are generated from the current node, and the training data set is allocated to the two sub-nodes according to the feature.

Step lumb2 is called recursively on the two child nodes until the stop condition is met.

Generate the CART decision tree.

The condition that the algorithm stops computing is that the number of samples in the nodes is less than the predetermined threshold, or the Gini coefficient of the sample set is less than the predetermined threshold (the samples basically belong to the same class), or there are no more features.

Code

The code has been implemented on github (calling sklearn), which is also posted here

The test data set is MNIST data set, and the get address is train.csv

Running result

In order to help you make learning easier and efficient, we will share a large number of materials free of charge to help you overcome difficulties on your way to becoming big data engineers and even architects. Here to recommend a big data learning exchange circle: 658558542 welcome everyone to enter × × × stream discussion, learning exchange, common progress.

When you really start learning, it is inevitable that you do not know where to start, resulting in inefficiency that affects your confidence in continuing learning.

But the most important thing is not to know which skills need to be mastered, step on the pit frequently while learning, and eventually waste a lot of time, so it is necessary to have effective resources.

Finally, I wish all the big data programmers who encounter bottle disease and do not know what to do, and wish you all every success in the future work and interview.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.