In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "how to realize decision tree modeling by python". In daily operation, I believe many people have doubts about how to realize decision tree modeling by python. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how to realize decision tree modeling by python". Next, please follow the editor to study!
Decision tree is a simple machine learning method. After training, the decision tree looks like a series of if-then statements arranged in the form of a tree. Once we have a decision tree, as long as we go down the path of the tree and answer each question correctly, we will finally get the answer. Going back up along the final leaf node, you will get a reasoning process about the final classification result.
Decision tree:
Class decisionnode: def _ init__ (self,col=-1,value=None,results=None,tb=None,fb=None): self.col=col # judgment condition to be tested self.value=value # corresponds to the value self.results=results # that the current column must match in order for the result to be true. When the result self.tb=tb # for the current branch is true, the node self.fb=fb # on the tree relative to the subtree of the current node results in false A node on a subtree of a tree relative to the current node.
The following uses the algorithm of classification and regression tree. In order to construct the decision tree, the algorithm first creates a root node, then evaluates all the observed variables in the table, and selects the most appropriate variables to split the data. In order to select the appropriate variables, we need a way to measure the mixing of various factors in the data set. There are several measures to choose from for measuring the degree of clutter:
Gini impurity: the expected error rate at which a result from a set is randomly applied to a data item in the set.
Here is the python implementation:
one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
Def uniquecounts (rows):
Results= {}
For row in rows:
# The result is the last column
R=row [len (row)-1]
If r not in results: results [r] = 0
Results [r] + = 1
Return results
Def giniimpurity (rows):
Total=len (rows)
Counts=uniquecounts (rows)
Imp=0
For k1 in counts:
P1=float (counts [K1]) / total
# imp+=p1*p1
For k2 in counts:
If k1==k2: continue
P2=float (countries [k2]) / total
Imp+=p1*p2
Return imp#1-imp
Implementation in "set":
one
two
three
four
five
six
seven
eight
nine
ten
Def entropy (rows):
From math import log
Log2=lambda x:log (x) / log (2)
Results=uniquecounts (rows)
# Now calculate the entropy
Ent=0.0
For r in results.keys ():
P=float (resultsr]) / len (rows)
Ent=ent-p*log2 (p)
Return ent
The main difference between entropy and Gini impurity is that entropy reaches its peak relatively slowly. Therefore, entropy is a heavier penalty for chaotic sets.
Our algorithm first calculates the entropy of the whole group, then tries to split the group by using the possible values of each attribute, and calculates the entropy of two new groups. The algorithm calculates the corresponding information gain. Information gain refers to the difference between the current entropy and the weighted average entropy of two new groups. The algorithm calculates the corresponding information gain for each attribute, and then selects the attribute with the largest information gain. By calculating the optimal split attribute of each new node, the splitting process of the branch and the construction process of the tree will continue. When the information gain of splitting a node is not greater than 0, the split of the branch will stop:
one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
twenty-four
twenty-five
twenty-six
twenty-seven
twenty-eight
twenty-nine
thirty
thirty-one
thirty-two
thirty-three
thirty-four
thirty-five
thirty-six
Def buildtree (rows,scoref=entropy):
If len (rows) = = 0: return decisionnode ()
Current_score=scoref (rows)
# Set up some variables to track the best criteria
Best_gain=0.0
Best_criteria=None
Best_sets=None
Column_count=len (rows [0])-1
For col in range (0mm column count):
# Generate the list of different values in
# this column
Column_values= {}
For row in rows:
Column_ values [rose [col]] = 1
# Now try dividing the rows up for each value
# in this column
For value in column_values.keys ():
(set1,set2) = divideset (rows,col,value)
# Information gain
P=float (len (set1)) / len (rows)
Gain=current_score-p*scoref (set1)-(1murp) * scoref (set2)
If gain > best_gain and len (set1) > 0 and len (set2) > 0:
Best_gain=gain
Best_criteria= (col,value)
Best_sets= (set1,set2)
# Create the sub branches
If best_gain > 0:
TrueBranch=buildtree (best_sets [0])
FalseBranch=buildtree (best_sets [1])
Return decisionnode (col=best_criteria [0], value=best_criteria [1])
Tb=trueBranch,fb=falseBranch)
Else:
Return decisionnode (results=uniquecounts (rows))
The function first accepts a list of rows of data as an argument. It traverses each column in the dataset, looks for each possible value for each column, and splits the dataset into two new subsets. By multiplying the entropy of each subset by the proportion of the data items contained in the subset in the metadata set, the function calculates the formaldehyde average entropy of each pair of newborn subsets, and records the pair of subsets with the lowest entropy. If the weighted average entropy obtained from the pair of subsets with the lowest entropy is larger than that of the current set, the split is over and the counting results for various possible results will be saved. Otherwise, the algorithm will continue to call the buildtree function in the newly generated subset and add the result of the call to the tree. We append the call results for each subset to the True branch and the False branch of the node respectively, and finally the whole tree is constructed.
We can print it out:
one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
Def printtree (tree,indent=''):
# Is this a leaf node?
If tree.resultskeeper none:
Print str (tree.results)
Else:
# Print the criteria
Print str (tree.col) +':'+ str (tree.value) +'?'
# Print the branches
Print indent+'T- >'
Printtree (tree.tb,indent+'')
Print indent+'F- >'
Printtree (tree.fb,indent+'')
Now it's time for us to use the decision tree. Accept the new observation data as parameters and classify them according to the decision tree:
?
one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
Def classify (observation,tree):
If tree.resultskeeper none:
Return tree.results
Else:
V=observation [tree.col]
Branch=None
If isinstance (vreint) or isinstance (vParade float):
If v > = tree.value: branch=tree.tb
Else: branch=tree.fb
Else:
If v==tree.value: branch=tree.tb
Else: branch=tree.fb
Return classify (observation,branch)
This function traverses the tree in the same way as printtree. After each call, the function determines whether it reaches the end of the branch based on the result of the call. If the end has not been reached, it evaluates the observed data to confirm that the column data matches the reference value. If there is a match, classify is called in the True branch, and classify is called in the False branch if there is no match.
There is a problem with the above method of training the decision tree:
Over-fitting: it may become too specific to the training data, and its entropy may be lower than the real situation. The process of pruning is to check a group of nodes with the same parent node to determine whether the increase in entropy will be less than a specified threshold if they are merged. If so, these leaf nodes are merged into a single node, and the merged new node contains all possible result values. This approach helps to avoid over-fitting, so that the predictions made by the decision tree are no more special than the actual conclusions drawn from the data set:
one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
Def prune (tree,mingain):
# if the branch is not a leaf node, prune it
If tree.tb.results==None:
Prune (tree.tb,mingain)
If tree.fb.results==None:
Prune (tree.fb,mingain)
# if both branches are leaf nodes, determine whether they need to be merged
If tree.tb.resultsfinding none and tree.fb.resultsmatching results none:
# construct the merged dataset
Tb,fb= [], []
For vjcec in tree.tb.results.items ():
Tb+= [[v]] * c
For vjcec in tree.fb.results.items ():
Fb+= [[v]] * c
# check for entropy reduction
Delta=entropy (tb+fb)-(entropy (tb) + entropy (fb) / 2)
If delta=tree.value: branch=tree.tb
Else: branch=tree.fb
Else:
If v==tree.value: branch=tree.tb
Else: branch=tree.fb
Return mdclassify (observation,branch)
The only difference between mdclassify and classify is at the end: if important data is found missing, the corresponding result value for each branch is calculated, and the final result value is multiplied by their respective weights.
For numerical problems, we can use variance as an evaluation function to replace entropy or Gini impurity. A low variance means that the numbers are very close to each other, while a high variance means that the numbers are scattered. In this way, the selection of node judgment condition is based on the split so that the larger number is on one side of the tree, and the smaller number is on the other side of the tree.
At this point, the study on "how to achieve decision tree modeling by python" is over. I hope to be able to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.