Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize decision Tree Modeling by python

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how to realize decision tree modeling by python". In daily operation, I believe many people have doubts about how to realize decision tree modeling by python. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how to realize decision tree modeling by python". Next, please follow the editor to study!

Decision tree is a simple machine learning method. After training, the decision tree looks like a series of if-then statements arranged in the form of a tree. Once we have a decision tree, as long as we go down the path of the tree and answer each question correctly, we will finally get the answer. Going back up along the final leaf node, you will get a reasoning process about the final classification result.

Decision tree:

Class decisionnode: def _ init__ (self,col=-1,value=None,results=None,tb=None,fb=None): self.col=col # judgment condition to be tested self.value=value # corresponds to the value self.results=results # that the current column must match in order for the result to be true. When the result self.tb=tb # for the current branch is true, the node self.fb=fb # on the tree relative to the subtree of the current node results in false A node on a subtree of a tree relative to the current node.

The following uses the algorithm of classification and regression tree. In order to construct the decision tree, the algorithm first creates a root node, then evaluates all the observed variables in the table, and selects the most appropriate variables to split the data. In order to select the appropriate variables, we need a way to measure the mixing of various factors in the data set. There are several measures to choose from for measuring the degree of clutter:

Gini impurity: the expected error rate at which a result from a set is randomly applied to a data item in the set.

Here is the python implementation:

one

two

three

four

five

six

seven

eight

nine

ten

eleven

twelve

thirteen

fourteen

fifteen

sixteen

seventeen

eighteen

nineteen

twenty

twenty-one

Def uniquecounts (rows):

Results= {}

For row in rows:

# The result is the last column

R=row [len (row)-1]

If r not in results: results [r] = 0

Results [r] + = 1

Return results

Def giniimpurity (rows):

Total=len (rows)

Counts=uniquecounts (rows)

Imp=0

For k1 in counts:

P1=float (counts [K1]) / total

# imp+=p1*p1

For k2 in counts:

If k1==k2: continue

P2=float (countries [k2]) / total

Imp+=p1*p2

Return imp#1-imp

Implementation in "set":

one

two

three

four

five

six

seven

eight

nine

ten

Def entropy (rows):

From math import log

Log2=lambda x:log (x) / log (2)

Results=uniquecounts (rows)

# Now calculate the entropy

Ent=0.0

For r in results.keys ():

P=float (resultsr]) / len (rows)

Ent=ent-p*log2 (p)

Return ent

The main difference between entropy and Gini impurity is that entropy reaches its peak relatively slowly. Therefore, entropy is a heavier penalty for chaotic sets.

Our algorithm first calculates the entropy of the whole group, then tries to split the group by using the possible values of each attribute, and calculates the entropy of two new groups. The algorithm calculates the corresponding information gain. Information gain refers to the difference between the current entropy and the weighted average entropy of two new groups. The algorithm calculates the corresponding information gain for each attribute, and then selects the attribute with the largest information gain. By calculating the optimal split attribute of each new node, the splitting process of the branch and the construction process of the tree will continue. When the information gain of splitting a node is not greater than 0, the split of the branch will stop:

one

two

three

four

five

six

seven

eight

nine

ten

eleven

twelve

thirteen

fourteen

fifteen

sixteen

seventeen

eighteen

nineteen

twenty

twenty-one

twenty-two

twenty-three

twenty-four

twenty-five

twenty-six

twenty-seven

twenty-eight

twenty-nine

thirty

thirty-one

thirty-two

thirty-three

thirty-four

thirty-five

thirty-six

Def buildtree (rows,scoref=entropy):

If len (rows) = = 0: return decisionnode ()

Current_score=scoref (rows)

# Set up some variables to track the best criteria

Best_gain=0.0

Best_criteria=None

Best_sets=None

Column_count=len (rows [0])-1

For col in range (0mm column count):

# Generate the list of different values in

# this column

Column_values= {}

For row in rows:

Column_ values [rose [col]] = 1

# Now try dividing the rows up for each value

# in this column

For value in column_values.keys ():

(set1,set2) = divideset (rows,col,value)

# Information gain

P=float (len (set1)) / len (rows)

Gain=current_score-p*scoref (set1)-(1murp) * scoref (set2)

If gain > best_gain and len (set1) > 0 and len (set2) > 0:

Best_gain=gain

Best_criteria= (col,value)

Best_sets= (set1,set2)

# Create the sub branches

If best_gain > 0:

TrueBranch=buildtree (best_sets [0])

FalseBranch=buildtree (best_sets [1])

Return decisionnode (col=best_criteria [0], value=best_criteria [1])

Tb=trueBranch,fb=falseBranch)

Else:

Return decisionnode (results=uniquecounts (rows))

The function first accepts a list of rows of data as an argument. It traverses each column in the dataset, looks for each possible value for each column, and splits the dataset into two new subsets. By multiplying the entropy of each subset by the proportion of the data items contained in the subset in the metadata set, the function calculates the formaldehyde average entropy of each pair of newborn subsets, and records the pair of subsets with the lowest entropy. If the weighted average entropy obtained from the pair of subsets with the lowest entropy is larger than that of the current set, the split is over and the counting results for various possible results will be saved. Otherwise, the algorithm will continue to call the buildtree function in the newly generated subset and add the result of the call to the tree. We append the call results for each subset to the True branch and the False branch of the node respectively, and finally the whole tree is constructed.

We can print it out:

one

two

three

four

five

six

seven

eight

nine

ten

eleven

twelve

thirteen

Def printtree (tree,indent=''):

# Is this a leaf node?

If tree.resultskeeper none:

Print str (tree.results)

Else:

# Print the criteria

Print str (tree.col) +':'+ str (tree.value) +'?'

# Print the branches

Print indent+'T- >'

Printtree (tree.tb,indent+'')

Print indent+'F- >'

Printtree (tree.fb,indent+'')

Now it's time for us to use the decision tree. Accept the new observation data as parameters and classify them according to the decision tree:

?

one

two

three

four

five

six

seven

eight

nine

ten

eleven

twelve

thirteen

Def classify (observation,tree):

If tree.resultskeeper none:

Return tree.results

Else:

V=observation [tree.col]

Branch=None

If isinstance (vreint) or isinstance (vParade float):

If v > = tree.value: branch=tree.tb

Else: branch=tree.fb

Else:

If v==tree.value: branch=tree.tb

Else: branch=tree.fb

Return classify (observation,branch)

This function traverses the tree in the same way as printtree. After each call, the function determines whether it reaches the end of the branch based on the result of the call. If the end has not been reached, it evaluates the observed data to confirm that the column data matches the reference value. If there is a match, classify is called in the True branch, and classify is called in the False branch if there is no match.

There is a problem with the above method of training the decision tree:

Over-fitting: it may become too specific to the training data, and its entropy may be lower than the real situation. The process of pruning is to check a group of nodes with the same parent node to determine whether the increase in entropy will be less than a specified threshold if they are merged. If so, these leaf nodes are merged into a single node, and the merged new node contains all possible result values. This approach helps to avoid over-fitting, so that the predictions made by the decision tree are no more special than the actual conclusions drawn from the data set:

one

two

three

four

five

six

seven

eight

nine

ten

eleven

twelve

thirteen

fourteen

fifteen

sixteen

seventeen

eighteen

nineteen

twenty

twenty-one

twenty-two

twenty-three

Def prune (tree,mingain):

# if the branch is not a leaf node, prune it

If tree.tb.results==None:

Prune (tree.tb,mingain)

If tree.fb.results==None:

Prune (tree.fb,mingain)

# if both branches are leaf nodes, determine whether they need to be merged

If tree.tb.resultsfinding none and tree.fb.resultsmatching results none:

# construct the merged dataset

Tb,fb= [], []

For vjcec in tree.tb.results.items ():

Tb+= [[v]] * c

For vjcec in tree.fb.results.items ():

Fb+= [[v]] * c

# check for entropy reduction

Delta=entropy (tb+fb)-(entropy (tb) + entropy (fb) / 2)

If delta=tree.value: branch=tree.tb

Else: branch=tree.fb

Else:

If v==tree.value: branch=tree.tb

Else: branch=tree.fb

Return mdclassify (observation,branch)

The only difference between mdclassify and classify is at the end: if important data is found missing, the corresponding result value for each branch is calculated, and the final result value is multiplied by their respective weights.

For numerical problems, we can use variance as an evaluation function to replace entropy or Gini impurity. A low variance means that the numbers are very close to each other, while a high variance means that the numbers are scattered. In this way, the selection of node judgment condition is based on the split so that the larger number is on one side of the tree, and the smaller number is on the other side of the tree.

At this point, the study on "how to achieve decision tree modeling by python" is over. I hope to be able to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report