The basic principle of xgboost 04/11 Update SLTechnology News&Howtos

The basic principle of xgboost

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "the basic principles of xgboost". Friends who are interested may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn the basic principles of xgboost.

One, xgboost and GBDT

Xgboost is an ensemble learning algorithm, which belongs to the category of boosting algorithm among the three commonly used ensemble methods (bagging,boosting,stacking). It is an additive model, the basic model generally chooses the tree model, but it can also choose other types of models such as logical regression.

Xgboost belongs to the category of gradient lifting tree (GBDT) model. The basic idea of GBDT is to let the new base model (GBDT based on CART classification and regression tree) fit the deviation of the previous model, so as to continuously reduce the deviation of the additive model.

Compared with the classic GBDT,xgboost, some improvements have been made, resulting in a significant improvement in effect and performance (focus on regular interviews).

First, GBDT expands the objective function Taylor to the first order, while xgboost expands the objective function Taylor to the second order. Retain more information about the objective function, which is helpful to improve the effect.

Second, GBDT is to find a new fitting label for the new base model (the negative gradient of the previous addition model), while xgboost is to find a new objective function for the new base model (the objective function is the second-order Taylor expansion of the new base model).

Third, xgboost adds the L2 regularization term of and leaf weight, which is beneficial to the model to obtain lower variance.

Fourth, xgboost adds a strategy to automatically deal with missing value features. By dividing the samples with missing values into the left subtree or the right subtree respectively, and comparing the advantages and disadvantages of the objective function under the two schemes, the samples with missing values are divided automatically, and there is no need to fill and preprocess the missing features.

In addition, xgboost also supports candidate quantile cutting, feature parallelism and so on, which can improve performance.

Second, an overview of the principle of xgboost

The following is a general introduction to the principle of xgboost from three aspects: hypothetical space, objective function and optimization algorithm.

1, hypothetical space

2, objective function

3. Optimization algorithm

Basic idea: greedy method, learning tree by tree, each tree fitting the deviation of the previous model.

Third, what does the t tree learn?

To finish building the xgboost model, we need to determine the following things.

1. How to boost? If you have obtained the addition model composed of the previous tMel tree, how to determine the learning goal of the t-tree?

2. How to build a tree? On the premise of knowing the learning goal of the t tree, how to learn this tree? Specifically, including whether to split or not? Which feature is selected to split? Which split point should be chosen? How to determine the value of split leaf nodes?

We first consider the problem of how to boost, and by the way solve the problem of how to value the split leaf nodes.

Fourth, how to generate the t-tree?

Xgboost uses a binary tree, and at the beginning, all the samples are on one leaf node. Then the leaf nodes continue to split through two to gradually form a tree.

Xgboost uses levelwise's generation strategy, which is to split all leaf nodes at the same level each time.

There are several basic questions about the process of splitting the spanning tree of leaf nodes: do you want to split? Which feature is selected to split? At what point of the feature is the split? And what value will be taken on the new leaves after division?

The problem of the value of the leaf node has been solved before. Let's focus on a few remaining issues.

1. Do you want to split?

Depending on the pruning strategy of the tree, there are two different ways to deal with this problem. If it is a pre-pruning strategy, it will be split only if there is a certain way of splitting so that the objective function decreases after the split.

However, if it is a post-pruning strategy, it will split unconditionally, and then check whether each branch of the tree makes a positive contribution to the decline of the objective function from top to bottom after the tree is generated.

Xgboost adopts pre-pruning strategy and splits only if the gain after splitting is greater than 0.

2. What features should be selected for splitting?

Xgboost uses the method of feature parallelism to calculate and select the features to be split, that is, using multiple threads, try to take each feature as the split feature, find the optimal partition point of each feature, calculate according to the gain generated by their split, and select the feature with the largest gain as the split feature.

3. Which split point should be chosen?

There are two methods for xgboost to select the split points of a feature, one is the global scanning method, and the other is the candidate quantile method.

In the global scanning method, the values of this feature of all samples are arranged from small to large, and all possible splitting positions are tried to find the split point with the greatest gain. Its computational complexity is proportional to the number of different values of the sample features on the leaf node.

The candidate quantile method is an approximate algorithm, in which only constant (such as 256) candidate split positions are selected, and then the best one is found from the candidate split positions.

At this point, I believe you have a deeper understanding of the "basic principles of xgboost". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.