Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use CatBoost for fast gradient lifting

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article is about how to use CatBoost for rapid gradient promotion. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it.

We will take a closer look at a gradient enhancement library called CatBoost.

In the gradient promotion, the prediction is made by a group of weak learners. Unlike random forests that create decision trees for each sample, trees are created one after another in gradient enhancement. The previous tree in the model does not change. The results of the previous tree are used to improve the next tree. In this article, we will take a closer look at a gradient enhancement library called CatBoost.

CatBoost is a depth gradient enhancement library developed by Yandex. It uses forgotten decision trees to generate balanced trees. The same function is used to split each level of the tree left and right.

(CatBoost official link: https://github.com/catboost)

Compared with the classical tree, the forgetting tree is more efficient and easy to install on CPU.

Dealing with classification features

The common methods to deal with classification in machine learning are single thermal coding and label coding. CatBoost allows you to use the classification feature without preprocessing it.

When using CatBoost, we should not use one-click coding, because it will affect the training speed and the quality of prediction. Instead, we only need to use the cat_features parameter to specify the classification feature.

Advantages of using CatBoost

Here are some reasons to consider using CatBoost:

CatBoost allows training of data on multiple GPU.

Using default parameters can provide good results, thus reducing the time required for parameter adjustment.

Because the over-fitting is reduced, the accuracy is improved.

Make quick predictions using CatBoost's model application.

The trained CatBoost model can be exported to Core ML for on-device reasoning (iOS).

Missing values can be handled internally.

It can be used for regression and classification problems.

Training parameters

Let's take a look at the common parameters in CatBoost:

Loss_function alias objective-indicator for training. These are regression indicators, such as root mean square errors for regression and logarithmic losses for classification.

Eval_metric-Metric used to detect overfitting.

Iterations-the maximum number of trees to be built, default is 1000. Aliases are num_boost_round, n_estimators and num_trees.

Learning_rate alias eta-the learning rate that determines how fast or slow the model will learn. The default value is usually 0.03.

Random_seed alias random_state-A random seed for training.

L2_leaf_reg alias reg_lambda-the coefficient of the L2 regularization term of the cost function. The default value is 3.0.

Bootstrap_type-sampling methods for determining object weights, such as Bayesian, Bernoulli, MVS, and Poisson.

Depth-depth of the tree.

Grow_policy-determines how to apply the greedy search algorithm. It can be SymmetricTree, Depthwise or Lossguide. SymmetricTree is the default value. In SymmetricTree, build the tree step by step until the depth is reached. In each step, the leaves of the previous tree are divided under the same conditions. When Depthwise is selected, a tree is built-in step by step until the specified depth is implemented. In each step, all non-terminal leaves at the last tree level are separated. Split the leaves using conditions that lead to the best loss improvement. In Lossguide, build the tree leaf by leaf until the specified number of leaves is reached. In each step, the non-terminal leaves with the best loss improvement are split.

Min_data_in_leaf alias min_child_samples-this is the minimum number of training samples in a leaf. This parameter is used only with Lossguide and Depthwise growth strategies.

Max_leaves alias num_leaves-this parameter is used only with the Lossguide policy and determines the number of leaves in the tree.

Ignored_features-indicates features that should be ignored during the training process.

Nan_mode-the way to handle missing values. Options include Forbidden, Min, and Max. The default is Min. When Forbidden is used, missing values result in errors. With Min, the missing value is taken as the minimum value for this feature. In Max, the missing value is regarded as the maximum value of the feature.

Leaf_estimation_method-the method used to calculate the median of the leaf. In the classification, 10 Newton iterations are used. Regression problems that use quantiles or MAE losses use an Exact iteration. Multiple classifications use one Netwon iteration.

Leaf_estimation_backtracking-the type of backtracking used in the gradient descent process. The default is AnyImprovement. AnyImprovement reduces the descent step until the value of the loss function is less than that of the last iteration. Armijo decreases the descending step until the Armijo condition is satisfied.

Boosting_type-Enhancement Program. It can be used by plain for classic gradient enhancement schemes, or for ordered, and it can provide better quality on smaller data sets.

Score_function-score type, which is used to select the next split during tree construction. Cosine is the default option. Other available options are L2, NewtonL2, and NewtonCosine.

Early_stopping_rounds-at that time True, set the overfit detector type to Iter and stop training when the best metric is reached.

Classes_count-the number of categories for the multiple classification problem.

Task_type-whether CPU or GPU is used. CPU is the default setting.

Devices-ID of the GPU device used for training.

Cat_features-an array with classified columns.

Text_features-used to declare text columns in a classification problem.

Regression example

CatBoost uses the scikit-learn standard in its implementation. Let's see how it can be used for regression.

As usual, the first step is to import the regression and instantiate it.

When fitting the model, CatBoost can also be set up to enable users to visualize plot=true:

It also allows you to perform cross-validation and visualize the process:

Similarly, you can perform a grid search and visualize it:

The above is how to use CatBoost for rapid gradient promotion. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report