How to use Kaggle to implement antagonistic Verification 04/26 Update SLTechnology News&Howtos

How to use Kaggle to implement antagonistic Verification

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how to use Kaggle to achieve confrontation verification". In daily operation, I believe many people have doubts about how to use Kaggle to achieve confrontation verification. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how to use Kaggle to achieve confrontation verification". Next, please follow the editor to study!

Learning confrontation verification model

First, import some libraries:

Data preparation

For this tutorial, we will use Kaggle's IEEE-CIS credit card fraud detection dataset. First, suppose you have loaded the training and test data into pandas DataFrames and named them df_train and df_test, respectively. Then we will do some basic cleanup by replacing the missing values.

For adversarial verification, we want to learn a model that can predict which rows in the training data set and which rows in the test set. Therefore, we create a new target column where the test sample is marked with 1 and the training sample is marked with 0, as follows:

This is the goal that we train the model to predict. At present, the training data set and the test data set are separate, and each data set has only one target value label. If we train a model on this training set, it will only know that everything is zero. We want to reorganize the training and test data sets, and then create a new data set to fit and evaluate the adversarial verification model. I defined a function for merging, reorganizing, and re-splitting:

The new dataset adversarial_train and adversarial_test include a mix of the original training set and the test set, while the target indicates the original dataset. Note: I have added TransactionDT to the feature list.

For modeling, I will use Catboost. I prepare the data by putting DataFrames into the Catboost Pool object.

Modeling

This part is simple: we just need to instantiate the Catboost classifier and fit it into our data:

Let's move on and draw the ROC curve on the reserved dataset:

This is a perfect model, which means that there is a clear way to tell you whether any given record is in the training or test set. This violates the assumption that our training and test sets are distributed the same.

Diagnose the problem and iterate

To understand how the model does this, let's look at the most important features:

By far, TransactionDT is the most important feature. This makes perfect sense given that the original training and test data sets come from different periods (the test set appears in the future of the training set). The model has just learned that if the TransactionDT is greater than the last training sample, it is in the test set.

I include TransactionDT only to illustrate this point-it is generally not recommended to use the original date as a model feature. But the good news is that the technology has been discovered in such a dramatic way. This kind of analysis can obviously help you identify this error.

Let's eliminate TransactionDT and run the analysis again.

The ROC curve now looks like this:

It is still a fairly powerful model, AUC > 0.91, but much weaker than before. Let's take a look at the feature importance of this model:

Now, id_31 is the most important function. Let's look at some values to see what it is.

This column contains the software version number. Obviously, this is conceptually similar to including the original date, because the first appearance of a particular software version corresponds to its release date.

Let's solve this problem by deleting all characters that are not letters in the column:

Now, the values of our column are as follows:

Let's use this purge column to train a new confrontation verification model:

The ROC figure now looks like this:

Performance has degraded from 0.917 AUC to 0.906. This means that it has been difficult for the model to distinguish between our training dataset and our test dataset, but it is still powerful.

At this point, the study on "how to use Kaggle to achieve antagonistic verification" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.