In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "how to realize the Python decision tree algorithm and what are its advantages and disadvantages". In the daily operation, I believe that many people have doubts about how to realize the Python decision tree algorithm and what are the advantages and disadvantages. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about how to realize the Python decision tree algorithm and what are its advantages and disadvantages. Next, please follow the editor to study!
1. Overview of algorithms
On the basis of known probability of occurrence of various situations, decision tree algorithm is a decision analysis method that calculates the probability that the expected value of net present value is greater than or equal to zero, evaluates project risk and judges its feasibility.
The classification algorithm uses the training sample set to obtain the classification function, that is, the classification model (classifier), so as to divide the samples in the data set into each class. The classification model learns the potential relationship between attribute sets and categories in the training samples, and predicts which category the new samples belong to.
Decision tree algorithm is a graphical method of intuitive use of probability analysis, is a very commonly used classification method, belongs to supervised learning.
The decision tree is a tree structure in which each internal node represents a test on an attribute, each branch represents a test output, and each leaf node represents a category.
Decision tree learning is case-based inductive learning, which adopts top-down recursive method. its basic idea is to construct a tree with the fastest entropy decline based on information entropy, and the entropy at the leaf node is zero. At this time, the instances in each leaf node belong to the same class.
The biggest advantage of decision tree learning algorithm is that it can be self-learning, in the process of learning, users do not need to know too much background knowledge, only need to better mark the training examples to learn.
two。 Types of algorithms
ID3 algorithm
In the ID3 algorithm, the features are evaluated and selected according to the information gain of information theory. Each time, the candidate feature with the maximum information gain is selected as the judgment module.
The information gain is proportional to the range size of the attribute. The more types of attributes, the more likely they are to become split attributes.
ID3 also cannot handle continuously distributed data.
C4.5 algorithm
C4.5 algorithm uses information gain rate instead of information gain for feature selection, which overcomes the deficiency that information gain tends to a large number of eigenvalues when selecting features.
The specific steps of C4.5 algorithm are similar to those of ID3.
C4.5 can discretize continuous attributes and deal with incomplete data.
C5.0 algorithm
C5.0 algorithm is a commercial improved version put forward by Quinlan on the basis of C4.5 algorithm, which aims to analyze data sets containing a large amount of data.
Compared with C4.5 algorithm, c5.0 algorithm has the following advantages:
The construction time of the decision tree is several times faster than that of the C4.5 algorithm, and the resulting decision tree is smaller in size and has fewer leaf nodes.
The lifting method (boosting) is used to combine multiple decision trees to make classification, which greatly improves the accuracy.
The provision of options is determined by the user on a case-by-case basis, such as whether to consider the weight of the sample, the cost of sample misclassification, etc.
CART algorithm
The generation of CART decision tree is the process of recursively constructing binary decision tree.
CART uses the Gini coefficient minimization criterion to select features to generate a binary tree.
3. Algorithm example
In machine learning, decision tree is a prediction model, which represents a mapping relationship between object attributes and object values.
The purpose of the decision tree is to fit a model that can predict the final output value by specifying input values.
4. Implementation steps of the algorithm
The selection of attributes is a crucial step in building a decision tree. The selected attributes will become a node of the decision tree, and the decision tree can be finally constructed by recursively selecting the optimal attributes.
Calculate the entropy H (xi) of each attribute in the dataset S select the attribute in the dataset S with the lowest entropy (or the maximum information gain, the two are equivalent) to generate the attribute node on the decision tree using the remaining nodes to repeat the above steps to generate the attribute node of the decision tree
5. Algorithm related concepts
Entropy.
In 1948, Shannon put forward the concept of "information entropy". Entropy is the average amount of information contained in each piece of information received, a measure of uncertainty, not a measure of certainty, because the more random the source, the greater the entropy. Entropy is defined as the inverse of the logarithm of the probability distribution.
Information gain
"Information gain" is used to measure the ability of an attribute to distinguish data samples. When an attribute is used as the root node of a decision tree, the greater the amount of information gain of that attribute. The decision tree chooses to maximize the information gain to divide the nodes.
7. Algorithm implementation code import numpy as npimport mathfrom collections import Counter# creates data def create_data (): X1 = np.random.rand (50,1) * 100X2 = np.random.rand (50,1) * 100X3 = np.random.rand (50,1) * 100def f (x): return 2 if x > 70 else 1 if x > 40 else 0 y = X1 + X2 + X3 Y = y > 150Y = Y + 0r = map (f) X1) X1 = list (r) r = map (f, X2) X2 = list (r) r = map (f, X3) X3 = list (r) x = np.c_ [X1, X2, X3, Y] return x, ['courseA',' courseB' 'courseC'] # function def calculate_info_entropy (dataset): n = len (dataset) # We use Counter to count the number of Y labels = Counter (dataset [:,-1]) entropy = 0.0 # apply the information entropy formula for k, v in labels.items (): prob = v / n entropy-= prob * math.log (prob 2) return entropy# implements the split function def split_dataset (dataset, idx): # idx is the feature subscript splitData = defaultdict (list) for data in dataset: # the value of idx is deleted here Because splitdata [data [IDX]] .append (np.delete (data, idx)) return list (splitData.values ()) is not needed. List (splitData.keys ()) # implement the feature selection function def choose_feature_to_split (dataset): n = len (dataset [0])-1m = len (dataset) # Information entropy entropy = calculate_info_entropy (dataset) bestGain = 0.0 feature =-1 for i in range (n): # split split_data according to feature I, _ = split_dataset (dataset I) new_entropy = 0.0 # calculate the information entropy for data in split_data: prob = len (data) / m new_entropy + = prob * calculate_info_entropy (data) # get the information gain gain = entropy-new_entropy if gain > bestGain: bestGain = gain feature = I return feature# decision tree creation function def create_decision_tree (dataset Feature_names): dataset = np.array (dataset) counter = Counter (dataset [:,-1]) # if the dataset value leaves a class Return if len (counter) directly = 1: return dataset [0,-1] # if all the features have been segmented Also directly return if len (dataset [0]) = = 1: return counter.most_common (1) [0] [0] # find the feature of the best segmentation fidx = choose_feature_to_split (dataset) fname = feature_ namespace node = {fname: {}} feature_names.remove (fname) # Recursive call Recursively build a tree split_data for each split value, vals = split_dataset (dataset, fidx) for data, val in zip (split_data, vals): node [fname] [val] = create_decision_tree (data, feature_names [:]) return node# decision tree node prediction function def classify (node, feature_names) Data): # get the feature judged by the current node key = list (node.keys ()) [0] node = node [key] idx = feature_names.index (key) # Recursive pred = None for key in node: # find the corresponding bifurcation if data [idx] = = key: # if there is still a subtree further down Then recursive, otherwise the result if isinstance (node [key], dict): pred = classify (node [key], feature_names, data) else: pred = node [key] # if there is no corresponding bifurcation Find a fork and return if pred is None: for key in node: if not isinstance (node [key], dict): pred = node [key] break return pred8. Advantages and disadvantages of algorithm
Advantages: small datasets are effective
Shortcoming
It is not good to deal with continuous variables
When there are more categories, errors increase more quickly.
Can't handle a lot of data.
9. Algorithm optimization
Decision tree algorithm is a very classical algorithm, which mainly depends on the entropy and information gain between data as the basis for classification, and the classification effect is better. However, in general, we train the decision tree in the data set with a small amount of data. when the training data used by the training classifier is large enough, the decision tree will have some problems, such as too high tree body, poor fitting effect and so on. Therefore, how to construct decision tree efficiently and accurately has become a research hotspot in the field of pattern recognition.
Iterative training decision tree by incremental training
Training multiple decision trees by combining Bagging and Boosting techniques
For datasets with small fluctuations and small variances, we can explore a relatively stable splitting criterion as a solution.
At this point, the study on "how to implement the Python decision tree algorithm and what are its advantages and disadvantages" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 293
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.