What's the difference between GBDT and XGBoost? 07/06 Update SLTechnology News&Howtos

What's the difference between GBDT and XGBoost?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article is to share with you about the difference between GBDT and XGBoost, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

1 traditional GBDT uses CART as the base classifier, and xgboost also supports linear classifier, at this time xgboost is equivalent to logistic regression (classification problem) or linear regression (regression problem) with L1 and L2 regularization terms.

(2) the traditional GBDT only uses the information of the first derivative in the optimization, while xgboost uses the second-order Taylor expansion of the cost function and uses both the first and second derivatives. By the way, the xgboost tool supports custom cost functions, as long as the functions can be first-and second-order derivatives.

3 xgboost adds a regular term to the cost function to control the complexity of the model. The regular term contains the number of leaf nodes of the tree and the square sum of the L2 module of the score output on each leaf node. From the perspective of Bias-variance tradeoff, the regularization term reduces the variance of the model, makes the learned model simpler and prevents over-fitting, which is also a feature that xgboost is superior to the traditional GBDT. (on this point, I will explain it in detail next.)

4 Shrinkage (reduction), equivalent to the learning rate (eta in xgboost). After an iteration, xgboost multiplies the weight of leaf nodes by the coefficient, mainly to weaken the influence of each tree and give more room for learning. In practical applications, the eta is generally set to a little smaller, and then the number of iterations is set to a little larger. (add: the implementation of traditional GBDT also has a learning rate)

Five-column sampling (column subsampling) is called feature sampling. Xgboost draws lessons from the practice of random forest and supports column sampling, which can not only reduce overfitting, but also reduce computation, which is also a characteristic of xgboost which is different from traditional gbdt.

(6) the treatment of missing value. For samples with missing values of features, xgboost can automatically learn its splitting direction.

The 7 xgboost tool supports parallelism. Isn't boosting a serial structure? How does it work in parallel? Note that xgboost parallelism is not tree granularity parallelism, and xgboost also completes one iteration before the next iteration (the cost function of the t iteration contains the predicted value of the previous one). The parallelism of xgboost is in terms of feature granularity.

We know that one of the most time-consuming steps in learning the decision tree is to sort the values of the features (because to determine the best segmentation point). Xgboost sorts the data in advance before training, and then saves it as a block structure, which is used repeatedly in subsequent iterations to greatly reduce the amount of computation. This block structure also makes parallelism possible. When splitting nodes, it is necessary to calculate the gain of each feature, and finally select the feature with the largest gain to do the splitting, then the gain calculation of each feature can be carried out by multi-thread.

Parallel approximate histogram algorithm. When the tree node is split, we need to calculate the gain corresponding to each partition point of each feature, that is, all possible partition points are enumerated by greedy method. When the data can not be loaded into memory at once or in the distributed case, the efficiency of the greedy algorithm becomes very low, so xgboost also proposes a parallel approximate histogram algorithm to efficiently generate candidate partition points.

Supplement

When xgboost/gbdt adjusts parameters, why does the depth of the tree rarely achieve high precision? It is very accurate to use xgboost/gbdt to adjust the maximum depth of the tree to 6. But when using DecisionTree/RandomForest, you need to change the depth of the tree to 15 or more. I can understand that the depth of the tree required to use RandomForest is the same as that of DecisionTree, because it combines DecisionTree together in bagging's way, which is the same as having done DecisionTree many times. But xgboost/gbdt can achieve high prediction accuracy with a depth of 6 nodes only by using the gradient ascent method, which makes me so surprised that I suspect it is cool techs. Excuse me, how does xgboost/gbdt do it? Is its node different from the normal DecisionTree?

This is a very good question, the subject of the study of each algorithm is very detailed and thorough, and the questions asked are also related to the nature of the two algorithms. This question is actually not a very simple question, I try to use my shallow machine learning knowledge to answer this question.

In a word, from Zhou Zhihua's machine learning textbook (machine learning-Zhou Zhihua): Boosting mainly focuses on reducing deviation, so Boosting can build a strong integration based on learners with weak generalization performance; Bagging mainly focuses on reducing variance, so it is more effective in non-pruning decision trees, neural networks and other learners.

Random forest (random forest) and GBDT belong to the category of integrated learning (ensemble learning). There are two important strategies Bagging and Boosting under ensemble learning.

The Bagging algorithm is like this: each classifier randomly takes samples from the original samples, then trains the classifiers on these sampled samples, and then combines these classifiers. A simple majority vote is generally fine. Its representative algorithm is random forest. What Boosting means is that he trains a series of classifiers iteratively, and the sample distribution of each classifier is related to the results of the previous round of learning. Its representative algorithm is AdaBoost, GBDT.

In fact, as far as machine learning algorithm is concerned, the generalization error can be divided into two parts, bias and variance. This can be derived from the formula of the following graph (here the probability formula D (X) = E (X ^ 2)-[E (X)] ^ 2) is used. Deviation refers to the degree of deviation between the expected prediction and the real prediction of the algorithm, which reflects the fitting ability of the model itself; the variance measures the change of learning performance caused by the change of the training set of the same size, and describes the impact caused by data disturbance. This is a bit winding, but you must have known the fitting.

As shown in the following figure, the more complex the model is, the higher the degree of fitting is and the smaller the training deviation of the model is. But at this time, if you change a set of data, the model may change greatly, that is, the variance of the model is very large. So when the model is too complex, it will lead to overfitting.

These are the differences between GBDT and XGBoost. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.