How to build a regression tree from scratch with Python code 07/06 Update SLTechnology News&Howtos

How to build a regression tree from scratch with Python code

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shows you how to use Python code to build a regression tree from scratch, the content is concise and easy to understand, can definitely brighten your eyes, through the detailed introduction of this article, I hope you can get something.

Introduction

Flowcharts are used to illustrate the decision-making process through visual media. Design requires a complete understanding of the whole system, so it also requires human professional knowledge. The question is: "in terms of the complexity of the process, can you automatically create a flowchart to make it faster, cheaper, and more scalable?" The answer is the decision tree!

The decision tree can automatically infer the rules that can best express the internal work of the decision. When trained on a marked dataset, the decision tree learns the rule tree (that is, the flowchart) and follows it to determine the output of any given input. Their simplicity and high interpretation make them an important asset in the ML toolkit.

A regression tree-a decision tree with continuous output-is described and code snippets for learning and prediction are implemented. Use the Boston dataset to create use case scenarios and learn the rules that define house prices. You can find a link to the complete code in the references.

A flowchart for processing COVID-19. [1]

Learning rules

Looking for rules, similar to the tree flow chart, is the best explanation of the relationship between the function of the house and its price. Each rule becomes a node in the tree and divides the house into disjoint sets, such as a house with two rooms, a house with three rooms, and a house with more than three rooms. Rules can also be based on a variety of functions, such as a house with two rooms near the Charles River. Therefore, the space of all possible trees is huge and needs to be simplified to solve learning problems in a computational way.

As the first simplification, consider only the binary rule: the rule that divides the house into two parts, such as "is the house less than three rooms?" . As the second, the combination of features is omitted because the number of combinations may be large, and only rules based on one feature are considered. Under these simplifications, the rule is a "less than relationship" with two parts: characteristics (such as the number of rooms) and partition thresholds (such as three).

Based on this rule definition, we build a rule tree by recursively searching for rules that preferably divide the data into two parts.

In other words, first divide the data into two splits as much as possible, and then consider each split separately. The segmentation continues until a predetermined condition, such as maximum depth, is met. Due to simplification and greedy rule search, the constructed tree is only an approximation of the best tree. Below, you can find the Python code that implements this learning.

A recursive split process implemented with Python.

Implement the split process as a function and call it using the training data (X-ray training recently received train). This function finds the best rules to divide the training data into two parts and divides them according to the found rules. It invokes itself continuously by using left-right split as training data until it reaches the maximum depth specified in advance or the training data is too small to divide. When the stop condition is met, it will stop the division and predict the house price based on the average price of the training data in the current split.

In the split function, the exception rule is defined as a dictionary of keys with left,right,feature and threshold. The optimal partition rule is returned by another function that scans possible rules in detail by traversing each feature and threshold in the training set. The threshold for determining a feature depends on the value taken by the feature throughout the data set. This is the code:

The function of finding the best rule, which separates the training data on hand.

This feature tracks the best rules by measuring the quality of segmentation recommended by the rules. Quality is measured by a "lower the better" metric called "sum of squares of residuals" (RSS) (see the notebook in Resources for more details on RSS). Finally, the best rule is returned as a dictionary.

Rules of interpretation

The learning algorithm automatically selects features and thresholds to create rules that best explain the relationship between house features and their prices. Here is the visualization of the rule tree learned from the Boston dataset with a maximum depth of 3. It can be observed that the extracted rules overlap with human intuition. In addition, house prices can be predicted as easily as following a flow chart.

A rule tree with a maximum depth of 3 learned from the Boston dataset.

Now describe a process that automatically uses the above flowchart for prediction. Given a house with the characteristics of the data set, the question is raised in the node and propagated according to the answer until the prediction (that is, the leaf node) is obtained.

For example, houses located in the following locations: (I) the lower identity percentage is 5.3, (ii) the average room per house is 10.2, and (iii) the per capita crime rate is 0.01, the first question is "yes", the second question is "no" and the third is YES. Therefore, the price is expected to be 45.80K. The visualization of the path you follow can be found below.

The example in the tree rule predicts the path.

It is very simple to code the prediction process using the dictionary returned by the split function. Traverse the rule dictionary by comparing the eigenvalues and thresholds specified by the rule. Move left or right based on the answer until you encounter a rule with a prediction key (that is, a leaf node). The following code snippet is used for prediction.

The ability to use learned trees to predict house prices.

A regression tree is a fast and intuitive structure used as a regression model. For Boston datasets, they can reach an R ²score of about 0.9 when the maximum depth is adjusted appropriately. But they can be vulnerable to small changes in the dataset, which makes them unreliably used as a single predictive variable. Random forest and gradient enhanced tree are proposed to solve the problem of high sensitivity, which can produce the same results as the depth model.

The above is how to use Python code to build a regression tree from scratch. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.