How to use Catboost in AI Intelligence 07/12 Update SLTechnology News&Howtos

How to use Catboost in AI Intelligence

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of how to use Catboost in AI intelligence, the content is detailed and easy to understand, the operation is simple and fast, and has a certain reference value. I believe you will have something to gain after reading this article on how to use Catboost in AI intelligence. Let's take a look.

When dealing with a large amount of data, it is necessary to compress the space with characteristics into vectors. One example is text embedding, which is an integral part of almost all NLP model creation processes. Unfortunately, it is not always possible to use neural networks to deal with this type of data-for example, it may be due to low fitting or reasoning rates.

Here is an interesting way to use it, which is the gradient promotion that few people know about.

Data and materials

A recent competition about Kagel ended with a small dataset containing textual data. I decided to use the data in the experiment because the competition showed that the dataset tags were well remembered and I didn't encounter any unpleasant surprises.

Column:

Id-the only ID of the excerpt

Url_legal-Source URL

License-Source material license

Excerpt-text that predicts the difficulty of reading

Target-easier to understand

Standard_error-measures the score distribution among multiple raters in each excerpt

As the goal of the data set, it is a numerical variable that proposes to solve the regression problem. However, I decided to replace it with a classification problem. The main reason is that the library I will use does not support handling text and embedding in regression issues. I hope developers can eliminate this deficiency in the future. But in any case, the problems of regression and classification are closely related, and for analysis, there is no difference between which problem to solve.

Let's calculate the number of bin through the Sturge rule:

Num_bins = int (np.floor (1 + np.log2 (len (train) train ['target_q'], bin_edges = pd.qcut (train [' target'], q=num_bins, labels=False, retbins=True, precision=0)

But first, I clean up the data.

Train ['license'] = train [' license']. Fillna ('nan') train [' license'] = train ['license'] .astype (' category') .cat.codes

With the help of a small self-writing function, I cleaned up the text and restored the word form. The function may be complex, but that's enough for my experiment.

Def clean_text (text): table = text.maketrans (dict.fromkeys (string.punctuation)) words = word_tokenize (text.lower (). Strip (). Translate (table)) words = [word for word in words if word not in stopwords.words ('english')] lemmed = [WordNetLemmatizer (). Lemmatize (word) for word in words] return "" .join (lemmed)

I saved the cleaned text as a new feature.

Train ['clean_excerpt'] = train [' excerpt'] .apply (clean_text)

In addition to text, I can also select individual words in URL and convert that data into new text functionality.

Def getWordsFromURL (url): return re.compile (r'[\: /? =\-&.] +', re.UNICODE) .split (url) train ['url_legal'] = train [' url_legal'] .fillna ("nan") .apply (getWordsFromURL) .apply (lambda x: "" .join (x))

I've created several new features from the text-- these are all kinds of statistics. Again, there is a lot of room for creativity, but these data are enough for us. The main purpose of these features is to be useful for baseline models.

Def get_sentence_lengths (text): tokened = sent_tokenize (text) lengths = [] for idx,i in enumerate (tokened): splited = list (i.split (")) lengths.append (len (splited)) return (max (length), min (lengths), round (mean (lengths)) 3) def create_features (df): df_f = pd.DataFrame (index=df.index) df_f ['text_len'] = df [' excerpt'] .apply (len) df_f ['text_clean_len'] = df [' clean_excerpt']. Apply (len) df_f ['text_len_div'] = df_f [' text_clean_len'] / df_f ['text_len'] df_f [' text_word_count'] = df ['clean_excerpt'] .apply (lambda x: len (x.split (') df_f [['max_len_sent','min_len_sent'] 'avg_len_sent']] =\ df_f.apply (lambda x: get_sentence_lengths (x [' excerpt']), axis=1, result_type='expand') return df_ftrain = pd.concat ([train, create_features (train)], axis=1, copy=False Sort=False) basic_f_columns = ['text_len',' text_clean_len', 'text_len_div',' text_word_count', 'max_len_sent',' min_len_sent', 'avg_len_sent']

When data is scarce, it is difficult to test hypotheses, and the results are usually unstable. Therefore, in order to have more confidence in the results, I prefer to use OOF (Out-of-Fold) predictions in this case.

Base line

I chose Catboost as the free library for the model. Catboost is a high-performance open source library for gradient lifting on the decision tree. Starting with version 0.19.1, it supports out-of-the-box GPU text classification functionality. The main advantage is that CatBoost can include classification and text functions in your data without additional preprocessing.

In unconventional emotion analysis: BERT and Catboost, I extended how Catboost handles text and compared it to BERT.

This library has a trump card: it knows how to use embedding. Unfortunately, nothing is known about this in the documentation at the moment, and few people know about this advantage of Catboost.

! pip install catboost

When using Catboost, I recommend using Pool. It is a convenient wrapper that combines features, tags, and further metadata, such as classification and text features.

In order to compare experiments, I created a baseline model that uses only numerical and classification features.

I wrote a function to initialize and train the model. By the way, I didn't choose the best parameters.

Def fit_model_classifier (train_pool, test_pool, * * kwargs): model = CatBoostClassifier (task_type='GPU', iterations=5000, eval_metric='AUC', od_type='Iter', od_wait=500, l2_leaf_reg=10, bootstrap_type='Bernoulli', subsample=0.7, * * kwargs) return model.fit (train_pool) Eval_set=test_pool, verbose=100, plot=False, use_best_model=True)

For the implementation of OOF, I wrote a small and simple function.

Def get_oof_classifier (n_folds, x_train, y, embedding_features, cat_features, text_features, tpo, seeds, num_bins, emb=None, tolist=True): ntrain = x_train.shape [0] oof_train = np.zeros ((len (seeds), ntrain, num_bins) models = {} for iseed Seed in enumerate (seeds): kf = StratifiedKFold (n_splits=n_folds, shuffle=True, random_state=seed) for I, (tr_i, Tubi) in enumerate (kf.split (x_train) (y): if emb and len (emb) > 0: x_tr = pd.concat ([x _ room.ilocs [tr _ I,:], get_embeddings (x_ room.ilocs [tr _ I,:], emb, tolist)], axis=1, copy=False Sort=False) x_te = pd.concat ([x _ room.ilocs [t _ I,:], get_embeddings (x_ room.ilocs [t _ I,:], emb, tolist)], axis=1, copy=False Sort=False) columns = [x for x in x_tr if (x not in ['excerpt'])] if not embedding_features: for c in emb: columns.remove (c) else: x_tr = x.iloc.iloc.tr _ I :] x_te = x.iloc.iloc.t _ I Columns = [x for x in x_tr if (x not in ['excerpt'])] x_tr = x _ tra [columns] x_te = x _ te [columns] y_tr = y [tr _ I] y_te = y [t _ I ] train_pool = Pool (data=x_tr Label=y_tr, cat_features=cat_features, embedding_features=embedding_features, text_features=text_features) valid_pool = Pool (data=x_te, label=y_te, cat_features=cat_features, embedding_features=embedding_features) Text_features=text_features) model = fit_model_classifier (train_pool, valid_pool, random_seed=seed, text_processing=tpo) oof_train [iseed, iseed,:] =\ model.predict_proba (valid_pool) models [(seed) I)] = model oof_train = oof_train.mean (axis=0) return oof_train, models

I'll write about the get_embeddings function below, but it's not currently used to get the baseline of the model.

I trained the baseline model with the following parameters:

Columns = ['license',' url_legal'] + basic_f_columnsoof_train_cb, models_cb = get_oof_classifier (n_folds=5, x_train=train [columns], y=train ['target_q'] .values, embedding_features=None, cat_features= [' license'], text_features= ['url_legal'], tpo=tpo, seeds= [0,42,888], num_bins=num_bins)

Quality of the training model:

Roc_auc_score (train ['target_q'], oof_train_cb, multi_class= "ovo") AUC:0.684407

Now I have a benchmark for the quality of the model. Judging from the numbers, this model is very weak, and I will not implement it in production.

Embed

You can convert multidimensional vectors to embedding, which is a relatively low-dimensional space. Therefore, embedding simplifies machine learning for large inputs, such as sparse vectors that represent words. Ideally, embedding captures some input semantics by placing semantically similar inputs close to each other in the embedded space.

There are many ways to get such vectors, and I don't consider them in this article, because that's not the purpose of the study. However, getting embedded in any way is enough for me; the most important thing is that they keep the necessary information. In most cases, I use the currently popular method-pre-trained Transformer.

From sentence_transformers import SentenceTransformerSTRANSFORMERS = {'sentence-transformers/paraphrase-mpnet-base-v2': (' mpnet', 768), 'sentence-transformers/bert-base-wikipedia-sections-mean-tokens': (' wikipedia', 768)} def get_encode (df, encoder, name): device = torch.device ("cuda:0" if torch.cuda.is_available () else "cpu") model = SentenceTransformer (encoder Cache_folder=f'./hf_ {name} /') model.to (device) model.eval () return np.array (model.encode (df ['excerpt'])) def get_embeddings (df, emb=None, tolist=True): ret = pd.DataFrame (index=df.index) for e S in STRANSFORMERS.items (): if emb and s [0] not in emb: continue ret [s [0]] = list (get_encode (df, e, s [0]) if tolist: ret = pd.concat ([ret, pd.DataFrame (ret [s [0]] .tolist ()) Columns= [f'{s [0]} _ {x} 'for x in range (s [1])], index=ret.index)], axis=1, copy=False, sort=False) return ret

Now I have everything to start testing different versions of the model.

Model

I have several options for fitting the model:

Character feature

Embedded feature

Embedded features, such as a separate list of digital features.

I've been training various combinations of these options, which allows me to draw conclusions about how useful embedding can be, or maybe it's just an overdesign.

For example, I give a code that uses all three options:

Columns = ['license',' url_legal', 'clean_excerpt',' excerpt'] oof_train_cb, models_cb = get_oof_classifier (n_folds=FOLDS, x_train=train [columns], y=train [' target_q'] .values, embedding_features= ['mpnet',' wikipedia'], cat_features= ['license'], text_features= [' clean_excerpt','url_legal'], tpo=tpo, seed = [0,42,888] Num_bins=num_bins, emb= ['mpnet',' wikipedia'], tolist=True) on "how to use Catboost in AI Intelligence" ends here Thank you for reading! I believe you all have a certain understanding of "how to use Catboost in AI intelligence". If you want to learn more, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.