The understanding of xgboost and lightgbm and the points that should be paid attention to 07/15 Update SLTechnology News&Howtos

The understanding of xgboost and lightgbm and the points that should be paid attention to

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

These two algorithms are integrated learning classification regression tree model, first discuss how to integrate.

The method of integration is Gradient Boosting

For example, I want to fit a piece of data as follows:

The first time we built a model, such as the broken line in the figure above, the effect is not very ideal, and then we need to build a new model to synthesize the results, so how to build the second model, we will take the difference between the actual target value and the prediction of our first model as the target value of the second model, as shown in the figure below, and build another model:

And then constantly create new models, the process is as follows:

Finally, these models can be integrated to continuously improve the accuracy of predictions.

The procedure is as follows:

Loss function:

Minimization goal:

Partial derivative of each base learner:

The base learner of these two algorithms is a classification regression tree, that is, a tree that classifies first and then regresses. Here, decision trees are used to classify features.

The main problem to consider in building decision trees is how we find the right features and the right segmentation points to partition the dataset. What are the criteria for judging.

Traversing can be used to traverse each feature of each data, note that in order to be able to quickly divide the data set, when dividing a certain feature, it is sorted according to this feature, so that when cutting the data set, it is not necessary to compare each data with the threshold of the segmentation point again.

The classification is based on the variance of the target value of the dataset after classification and the most to fall (continuous values for target values).

Suppose the last column in the following code is the target value:

def err_cnt(dataSet): Classification index of regression tree input: dataSet(list): training data output: m*s^2(float): Total variance ''' data = np.mat(dataSet) return np.var(data[:, -1]) * np.shape(data)[0]

This allows each tree to be built.

However, considering that it is inefficient to traverse each feature and also traverse each value of each feature, lightgbm has an improvement over xgboost, which is to create a histogram for the value distribution of a feature and then segment the dataset according to the cut points of the histogram.

There's an important parameter for this place:

max_bin, that is, how many columns are there for each feature to draw the histogram. If the more columns are drawn, the finer the division is. Therefore, when adjusting this parameter, you should first have a grasp of the distribution of features. Whether my features need so many bins to draw the histogram. Histograms have the advantage that the histogram of a sibling node can be obtained by subtracting the child node from the parent node.

After building each tree, we should consider the problem of overfitting. For example, if there is only one data on each leaf node, of course, we can perfectly fit the training data set, but the generalization ability may be very poor, that is, my tree is built very deep. Here are two parameters:

min_data_in_leaf

It indicates how many data there are at least on a leaf node. According to the distribution of target values in the training dataset, it is possible that some target values deviate very far (for example, some target odds in the safe data contest are very high, which may be a major accident). If you really want to fit this data, you can only set this parameter to 1.

min_sum_hessian_in_leaf

This parameter represents the sum of the weights of the data on a leaf. If it is adjusted very small, that is, the less data on this leaf, there is also a risk of overfitting. The larger the adjustment, the more data on the leaf. It should be determined according to the distribution of target values of the data set.

n_estimators indicates how many trees there are

num_leaves indicates how many leaves there are. Note that lightgbm and xgboost build a tree differently (they are not complete binary trees, that is, num_leaves can be less than 2^max_depth)

Xgboost tree building process is as follows (level-wise tree):

The process of lightgbm tree building is as follows (leaf-wise tree):

The nodes with larger residuals grow preferentially. In the figure above, after the second step, the next layer with the larger growth residual is selected instead of filling in the child nodes of the right subtree.

Lightgbm does not treat every data item equally when it builds a new model. It does a sampling and selects data with larger gradients (residuals) as inputs to the new model.

There is a good blog post about this:

https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

Finally, this article shows how to build a classification regression tree:

import numpy as npimport pickleclass node: Class of node of '' tree ''' def __init__(self, fea=-1, value=None, results=None, right=None, left=None): self.fea = fea #Column index values for attributes used to segment the dataset self.value = value #Set the value of the partition self.results = results #Store leaf node values self.right = right #right subtree self.left = left #Left subtree def load_data(data_file): '' Import Training Data input: data_file(string): The file that holds the training data output: data(list): training data ''' data = [] f = open(data_file) for line in f.readlines(): sample = [] lines = line.strip().split("\t") for x in lines: sample.append(float(x)) #convert to float format data.append(sample) f.close() return datadef split_tree(data, fea, value): '''Divides data set data into left and right subtrees according to value in feature fea input: data(list): Training samples fea(float): Feature index to partition value(float): The value of the specified partition output: (set_1, set_2)(tuple): aggregation of left and right subtrees ''' set_1 = [] #Set of right subtrees set_2 = [] #Set of left subtrees for x in data: if x[fea] >= value: set_1.append(x) else: set_2.append(x) return (set_1, set_2)def leaf(dataSet): '' Compute leaf node values input: dataSet(list): Training samples output: np.mean(data[:, -1])(float): mean ''' data = np.mat(dataSet) return np.mean(data[:, -1])def err_cnt(dataSet): Classification index of regression tree input: dataSet(list): training data output: m*s^2(float): Total variance ''' data = np.mat(dataSet) return np.var(data[:, -1]) * np.shape(data)[0]def build_tree(data, min_sample, min_err): '''build tree input: data(list): training samples min_sample(int): Minimum number of samples in a leaf node min_err(float): minimum error output: node: root node of the tree ''' #Build a decision tree, the function returns the root node of the decision tree if len(data) 0 and len(set_2) > 0: best_err = now_err bestCriteria = (fea, value) bestSets = (set_1, set_2) #3. Determine whether the division is over if best_err > min_err: right = build_tree(bestSets[0], min_sample, min_err) left = build_tree(bestSets[1], min_sample, min_err) return node(fea=bestCriteria[0], value=bestCriteria[1], \ right=right, left=left) else: return node(results=leaf(data)) #Returns the current category label as the final category label def predict(sample, tree): '''Predicts for each sample input: sample(list) tree: trained CART regression tree model output: results(float): predicted value ''' #1: Just the roots. if tree.results != None: return tree.results else: #2 There are left and right trees. val_sample = sample[tree.fea] #Value at fea branch = None #2.1 Select the right subtree if val_sample >= tree.value: branch = tree.right #2.2 Select the left subtree else: branch = tree.left return predict(sample, branch)def cal_error(data, tree): '' Evaluate CART regression tree model input: data(list): tree: trained CART regression tree model output: err/m(float): mean squared error ''' m = len(data) #Number of samples n = len(data[0]) - 1 #Number of features in the sample err = 0.0 for i in range(m): tmp = [] for j in range(n): tmp.append(data[i][j]) pre = predict(tmp, tree) #Computes its predicted value for the sample #Calculate residuals err += (data[i][-1] - pre) * (data[i][-1] - pre) return err / mdef save_model(regression_tree, result_file): '' Save trained CART regression tree model locally input: regression_tree: Regression tree model result_file(string): filename ''' with open(result_file, 'wb') as f: pickle.dump(regression_tree, f)if __name__ == "__main__": # 1. Import Training Data print ("----------- 1、load data -------------") data = load_data("sine.txt") #2: Building a CART tree print ("----------- 2、build CART ------------") regression_tree = build_tree(data, 30, 0.3) #3 Evaluating CART Trees print ("----------- 3、cal err -------------") err = cal_error(data, regression_tree) print ("\t--------- err : ", err) #4 Save the final CART model print ("----------- 4、save result -----------") save_model(regression_tree, "regression_tree")

The data format is as follows:

The first column of data is separated by tab key as feature, and the last column is target value.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.