How to simulate decision Tree in python 07/02 Update SLTechnology News&Howtos

How to simulate decision Tree in python

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article shows you how to simulate the decision tree in python, the content is concise and easy to understand, it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

1. The program simulates three criteria for feature selection of decision tree. Import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitfrom collections import Counterimport mathfrom math import logimport pprint# textbook sample data def createData (): datasets = [['Youth','No','No', 'General','No'], ['Youth','No','No', 'good','No'], ['Youth' 'yes','no', 'good', 'yes'], ['youth', 'yes', 'general', 'yes'], ['youth','no', 'general','no'], ['middle-aged','no', 'general','no'], ['middle-aged','no' 'no', 'good','no'], ['middle age', 'yes', 'yes', 'yes'], ['middle age','no', 'yes', 'yes', 'very good', 'yes'], ['middle age','no', 'yes', 'yes'], ['old age','no' 'yes', 'very good', 'yes'], ['old age','no', 'yes', 'good', 'yes'], ['old age', 'yes','no', 'good', 'yes'], ['old age','no', 'very good', 'yes'], ['old age','no' 'no', 'general','no'] labels = ['age', 'have a job', 'own house', 'credit situation', 'category'] return datasets Labels# calculates empirical entropy def calc_ent (datasets): data_length = len (datasets) label_count = {} # used to record the number of different categories for i in range (data_length): label = datasets [I] [- 1] # records the last dimension of the current piece of data if label not in label_count: # if a new category is encountered for the first time First initialize label_ countlabel] = 0 label_ countlabel] + = 1 # the number of categories plus one # calculated entropy. The p taken out each time is the number of categories of the current datasets set ent =-sum ((p / data_length) * log (p / data_length, 2) for p in label_count.values ()) return ent# calculates conditional empirical entropy def cond_ent (datasets, axis=0): # axis is our chosen feature, and the conditional empirical entropy of this feature to the set is also calculated in time. Data_length = len (datasets) feature_sets = {} for i in range (data_length): feature = datasets [I] [axis] if feature not in feature_sets: feature_ sets [feature] = [] feature_ sets [feature] .append (datasets [I]) # We store the data in dictionary format according to characteristics. # calculate conditional empirical entropy. (Di/D) * H (Di). The p taken out here each time is all the data that belongs to the norm category. The category here is the classification of features specified by axis. Cond_ent = sum ((len (p) / data_length) * calc_ent (p) for p in feature_sets.values ()) return cond_ent# calculates the information gain def info_gain (ent Cond_ent): return ent-cond_ent# calculates the information gain def info_gain_train (datasets): count = len (datasets [0])-1 ent = calc_ent (datasets) print ("current empirical entropy: {: .3f}\ n" .format (ent)) best_feature = [] for c in range (count): print ("conditional empirical entropy of feature {} is: {: .3f}" .format Cond_ent (datasets, c)) c_info_gain = info_gain (ent, cond_ent (datasets, c)) best_feature.append ((c, c_info_gain)) print ("Information gain of current feature ({}) is: {: .3f}" .format (label [c], c_info_gain)) best = max (best_feature) Key=lambda c: C [- 1]) # what is returned here is the tuple return bestdataset with the largest information gain, label = createData () trainData = pd.DataFrame (dataset, columns=label) best_feature = info_gain_train (dataset) print ("\ n {} feature ({})) has the maximum information gain. Is: {: .3f} ".format (best_feature [0] + 1, label [best _ feature [0] + 1], best_ feature [1]))

Result

Current empirical entropy: 0.971 the conditional empirical entropy of the first feature is: 0.888 the information gain of the current feature (age) is: 0.083 the conditional empirical entropy of the second feature is: 0.647 the information gain of the current feature (working) is: 0.324 the conditional empirical entropy of the third feature is: 0.551 the information gain of the current feature (having its own house) is: 0.420 The conditional empirical entropy of the sign is: 0.608 the information gain of the current feature (credit condition) is: 0.363 the information gain of the third feature (credit condition) is the maximum. Is: 0.4202 Program implements ID3 algorithm import pandas as pdimport numpy as npimport mathfrom math import logdef create_data (): datasets = [['Youth','No','No', 'General','No'], ['Youth','No','No', 'good','No'], ['Youth', 'Yes', 'good', 'Yes'] ['youth', 'yes', 'general', 'yes'], ['youth','no','no', 'general','no'], ['middle age','no','no', 'general','no'], ['middle age','no','no' 'good','No'], ['middle-aged', 'yes', 'good', 'yes'], ['middle-aged','no', 'yes', 'very good', 'yes'], ['middle-aged','no', 'yes', 'very good', 'yes'] ['old age','no', 'yes', 'very good'], ['old age','no', 'yes', 'yes', 'old age','no', 'good', 'yes'], ['old age', 'yes', 'good', 'yes'], ['old age', 'yes','no' 'very good', 'Yes'], ['Old Age','No','No', 'average','No'],] labels = [u 'age', u 'have a job', u 'have a house', u 'credit situation', u 'category'] # returns the dataset and the name of each dimension return datasets Labels# defines node class binary tree class Node: def _ _ init__ (self, root=True, label=None, feature_name=None) Feature=None): self.root = root # Mark whether the current node is the root node self.label = label # label record the value of the current node self.feature_name = feature_name # record the current feature name self.feature = feature # record the sequence number of the current feature name in the feature list self.tree = {} self.result = {'label:': self.label 'feature': self.feature,' tree': self.tree} def _ _ repr__ (self): # returns the information of the current node return'{} '.format (self.result) def add_node (self, val, node): # add point self.tree [val] = node def predict (self Features): # Forecast the current node if self.root is True: return self.label return self.tree.threshold (features) class DTree: def _ _ init__ (self, epsilon=0.1): self.epsilon = epsilon # threshold It is used to determine whether the current information gain is in line with the size. If the information gain is less than the threshold, it is equivalent to ignoring. Self._tree = {} # Entropy @ staticmethod def calc_ent (datasets): data_length = len (datasets) label_count = {} for i in range (data_length): label = datasets [I] [- 1] if label not in label_count: label_count [label] = 0 label_ count [label] + = 1 Ent =-sum ([(p / data_length) * log (p / data_length) 2) for p in label_count.values ()]) return ent # empirical conditional entropy def cond_ent (self, datasets) Axis=0): data_length = len (datasets) feature_sets = {} for i in range (data_length): feature = datasets [I] [axis] if feature not in feature_sets: feature_ sets [feature] = [] feature_ sets [feature] .append (datasets [I]) cond_ent = sum ([(len (p)) ) / data_length) * self.calc_ent (p) for p in feature_sets.values ()]) return cond_ent # Information gain @ staticmethod def info_gain (ent Cond_ent): return ent-cond_ent def info_gain_train (self, datasets): # calculates the features with the largest information gain in the current dataset. Count = len (datasets [0])-1 ent = self.calc_ent (datasets) best_feature = [] for c in range (count): c_info_gain = self.info_gain (ent, self.cond_ent (datasets, axis=c) best_feature.append ((c, c_info_gain)) # compare size best_ = max (best_feature Key=lambda x: X [- 1]) return best_ def train (self, train_data): "" input: dataset D (DataFrame format) Feature set A, threshold eta output: decision tree T "" _ = train_data.iloc [:,:-1] # except for all data in the last column y_train = train_data.iloc [:,-1] # contains only the last column features = train_data.columns [:-1] # Features except the last feature # the following are the four steps of the ID3 algorithm. D is the training data set and An is the feature set. # 1 if the instance in D belongs to the same kind of Ck, T is a single-node tree, and take class Ck as the class tag of the node, return the T if len (y_train.value_counts ()) = = 1: # value_counts function, sort the data by value, and sort by return Node (root=True, label=y_train.iloc [0]) # 2 if An is empty, T is a single-node tree Take Ck, the largest class of instance tree in D, as the class tag of this node, and return T if len (features) = = 0: return Node (root=True, label=y_train.value_counts (). Sort_values (ascending=False). Index [0]) # there is no feature division, so all categories of the current data set are classified. And count, and then sort, # Select the category with the largest number of categories It is estimated to calculate the maximum information gain for category # 3 represented by the current node # max_feature is the serial number of the maximum information gain feature name, and # max_info_gain is the maximum information gain # max_feature_name is the maximum information gain feature name max_feature Max_info_gain = self.info_gain_train (np.array (train_data)) max_feature_name = threshold information gain is less than threshold Ignore, set T as a single node tree, and take the class Ck with the largest number of instances in D as the class tag of this node, and return T if max_info_gain

< self.epsilon: return Node( root=True, label=y_train.value_counts().sort_values( ascending=False).index[0]) # 5 构建最大信息增益点Ag的子集。 # 按照Ag的每一个可能的取值ai，将数据集D分成一个Di，每个Di中实例数最大的类作为标记，构建子节点。 # 由当前最大信息增益点及其子节点构成树，并返回 node_tree = Node( root=False, feature_name=max_feature_name, feature=max_feature) # 将train_data按照当前最大信息增益特征不同值划分，index指value_counts后的类别部分 feature_list = train_data[max_feature_name].value_counts().index for f in feature_list: # 去掉最大信息增益这个特征 sub_train_df = train_data.loc[train_data[max_feature_name] == f].drop([max_feature_name], axis=1) # 6 递归生成树 # 对去掉这个特征的数据继续进行训练 sub_tree = self.train(sub_train_df) # 将当前节点加到递归上层父节点上。 # 这个算法不是从根节点，一步一步加点生成树， # 而是从根节点开始找出最大信息增益节点，这个点只是声明了一下，并没有建立联系。然后递归向下，到达叶节点之后， # 将叶节点添加到上层递归的父节点，然后父节点在train另一个子节点，然后将子节点在加入到父节点。 # 这时最初的根节点仍然是只有一个点，但是最下边的某个子树已经建立了父子关系，生成了树 node_tree.add_node(f, sub_tree) return node_tree def fit(self, train_data): self._tree = self.train(train_data) return self._tree def predict(self, X_test): return self._tree.predict(X_test)datasets, labels = create_data()data_df = pd.DataFrame(datasets, columns=labels)dt = DTree()tree = dt.fit(data_df)print(tree)print("下边预测数据：[老年, 否, 否, 一般]，结果为：")print(dt.predict(['老年', '否', '否', '一般'])) 结果： {'label:': None, 'feature': 2, 'tree': {'否': {'label:': None, 'feature': 1, 'tree': {'否': {'label:': '否', 'feature': None, 'tree': {}}, '是': {'label:': '是', 'feature': None, 'tree': {}}}}, '是': {'label:': '是', 'feature': None, 'tree': {}}}}下边预测数据：[老年, 否, 否, 一般]，结果为：否3，sklearn模拟import numpy as npimport pandas as pdfrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.tree import export_graphvizimport graphvizdef create_data(): iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) df['label'] = iris.target df.columns = [ 'sepal length', 'sepal width', 'petal length', 'petal width', 'label' ] data = np.array(df.iloc[:100, [0, 1, -1]]) # print(data) return data[:, :2], data[:, -1]X, y = create_data()X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)clf = DecisionTreeClassifier()clf.fit(X_train, y_train,)print(clf.score(X_test, y_test))# 为了让下边的语句执行，首先要安装graphviz，官网下载exe或者zip都行，这种方式要配置环境变量# pycharm安装失败时，用命令行安装pip install graphviz -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com。这种方式直接安装在了Anaconda下就不用配置环境变量了# 成功后，在命令行运行dot -v看一下是否成功安装# 成功后，现在命令行运行dot - c。执行完后，在运行程序即可。tree_pic = export_graphviz(clf, out_file="mytree.pdf") # 生成了画树的graphviz语句with open('mytree.pdf') as f: dot_graph = f.read() # 我们将画树的语句取出来graph = graphviz.Source(dot_graph) # 将这些语句存储graph.view() # 画出来结果： 0.9666666666666667 最初生成的mytree.pdf文件：

A file drawn by dot (automatically named Source.gv.pdf):

The above is how to simulate the decision tree in python. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.