Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to analyze credit card anti-fraud

2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to use Python to analyze credit card anti-fraud". In daily operation, I believe many people have doubts about how to use Python to analyze credit card anti-fraud. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how to use Python to analyze credit card anti-fraud". Next, please follow the editor to study!

Data sources and project overview

The dataset contains transactions made by European cardholders through credit cards in September 2013. The data set provides transactions that took place within two days, of which 492 out of 284807 transactions were fraudulent.

The dataset is very uneven, with negative categories (fraud) accounting for 0.172% of all transactions.

It contains only numeric input variables, which is the result of the PCA transformation. Unfortunately, due to confidentiality issues, we are unable to provide the original characteristics and more background information about the data. Features V 1 and V 2. V28 is the main component obtained with PCA, and the only features that are not converted with PCA are 'Time' and' Amount'.

Time contains the number of seconds elapsed between each transaction and the first transaction in the dataset.

The 'amount' is the transaction amount, and this feature can be used for cost-sensitive learning that depends on the example.

"Class" is a response variable with a value of 1 in the case of fraud, otherwise 0.

2. Prepare and initially view the dataset

# Import package import numpy as np import pandas as pd import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec import seaborn as sns Plt.style.use ('ggplot') import sklearn from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.utils import shuffle from sklearn.metrics import confusion_matrix from sklearn.manifold import TSNE pass # pour and view data crecreditcard_data=pd.read_csv ('. / creditcard.csv') crecreditcard_data.shape,crecreditcard_data.info () RangeIndex: 284807 entries 0 to 284806 Data columns (total 31 columns): Time 284807 non-null float64 V1 284807 non-null float64 V2 284807 non-null float64 V3 284807 non-null float64 V4 284807 non-null float64 V5 284807 non-null float64 V6 284807 non-null float64 V7 284807 non-null float64 V8 284807 non-null float64 V9 284807 non-null float64 V10 284807 non-null float64 V11 284807 non-null float64 V12 284807 non-null float64 V13 284807 non-null float64 V14 284807 non-null float64 V15 284807 non-null float64 V16 284807 non-null float64 V17 284807 non-null float64 V18 284807 non-null float64 V19 284807 non-null float64 V20 284807 non-null float64 V21 284807 non-null float64 V22 284807 non-null float64 V23 284807 non-null float64 V24 284807 non-null float64 V25 284807 non-null float64 V26 284807 non-null float64 V27 284807 non-null float64 V28 284807 non-null float64 Amount 284807 non-null float64 Class 284807 non-null int64 dtypes: float64 (30) Int64 (1) memory usage: 67.4 MB ((284807, 31), None) crecreditcard_data.describe () pass crecreditcard_data.head () pass # look at the ratio of fraud to non-fraud count_classes=pd.value_counts (crecreditcard_data ['Class'], sort=True). Sort_index () # specific data count_classes.value_counts () # can also be used count_classes [0], count_classes [1] look at the respective data 284315 1 4921 Name: Class Dtype: int64 count_classes.plot (kind='bar') plt.show ()

0 represents normal, 1 represents fraud, the two are seriously out of balance, extremely unbalanced, not in the same order of magnitude at all.

3. The relationship between fraud and time series distribution.

# View the descriptive statistics of the two, and the relationship between the time series distribution and print ('Normal') print (crecreditcard_data. Time [crecreditcard _ data.Class = = 0] .duration () print ('-'* 25) print ('Fraud') print (crecreditcard_data. Min () Normal count 284315.000000 mean 94838.202258 std 47484.015786 min 0.000000 25% 84711.000000 84711.000000 139333.000000 max 172792.000000 Name: Time, dtype: float64-Fraud count 492.000000 mean 80746.806911 std 47835.365138 min 406.000000 51241.500000 50% 75568.500000 75% 128483.000000 max 170348.000000 Name: Time, dtype: float64 f, (ax1 Ax2) = plt.subplots) bins=50 ax1.hist (crecreditcard_ data.Time [crecreditcard _ data.Class = = 1], bins=bins) ax1.set_title ('fraud (Fraud))', fontsize=22) ax1.set_ylabel ('trading volume', fontsize=15) ax2.hist (crecreditcard_ data.Time [crecreditcard _ data.Class = 0], bins=bins) ax2.set_title ('normal (Normal',fontsize=22) plt.xlabel (in seconds)') Fontsize=15) plt.xticks (fontsize=15) plt.ylabel ('trading volume', fontsize=15) # plt.yticks (fontsize=22) plt.show ()

Fraud is not necessarily related to time, and there is no periodicity.

Normal trading has obvious periodicity, and there is a trend similar to bimodal.

4. The relationship and distribution of fraud and amount.

Print ('fraud') print (crecreditcard_ data.Amount [crecreditcard _ data.Class== 1] .fraud () print ('-'* 25) print ('normal transaction') print (crecreditcard_ data.Amount [crecreditcard _ data.Class==0] .fraud ()) fraudulent count 492.000000 mean 122.211321 std 256.683288 min 0.000000 1.000000 25% 9.250000 75% 105.890000 max 2125.870000 Name: Amount Dtype: float64-normal trading count 284315.000000 mean 88.291022 std 250.105092 min 0.000000 5.650000 50% 22.000000 75% 77.050000 max 25691.160000 Name: Amount, dtype: float64 f, (ax1,ax2) = plt.subplots (2 crecreditcard_ data.Amount [crecreditcard _ data.Class = = 1], bins=bins) ax1.set_title ('fraud (Fraud)' Fontsize=22) ax1.set_ylabel ('trading volume', fontsize=15) ax2.hist (crecreditcard_ data.Amount [crecreditcard _ data.Class = = 0], bins=bins) ax2.set_title ('normal (Normal)', fontsize=22) plt.xlabel ('amount ($)', fontsize=15) plt.xticks (fontsize=15) plt.ylabel ('trading volume', fontsize=15) plt.yscale ('log') plt.show ()

The amount is generally low, so it can be seen that the data in this column is of little reference value to the analysis.

5. Check the relationship between independent variables (V1-V29) and dependent variables

To see if there is a relationship between each variable and normal or fraudulent, in order to show it more intuitively, judge one by one through the distplot diagram, as follows:

Features= [x for x in crecreditcard_data.columns if x not in ['Time','Amount','Class']] plt.figure (figsize= (122828)) gs = gridspec.GridSpec (28score1) import warnings warnings.filterwarnings (' ignore') for I in enumerate (crecreditcard_ data [v _ features]): ax=plt.subplot (GS [I]) sns.distplot (crecreditcard_ data [CN] [crecreditcard_data.Class==1], bins=50,color='red') sns.distplot (crecreditcard_ data [CN] [crecreditcard_data.Class==0], bins=50) Color='green') ax.set_xlabel ('') ax.set_title ('histogram:' + str (cn)) plt.savefig ('relationship between variables and class. PNG', transparent=False,bbox_inches='tight') plt.show ()

Red indicates fraud and green indicates normal.

The larger the cross area of the two distributions, the smallest the distinction between fraud and normal, such as V15.

The smaller the cross area of the two distributions, the greater the influence of the variable on the dependent variable, such as V14.

Let's take a look at the correlation analysis between each single variable and class. For a more intuitive display, we can directly draw the graph as follows:

# Matrix distribution of each variable crecreditcard_data.hist (figsize= (1515), bins=50) plt.show ()

6. Modeling and analysis of three methods

In this part, three methods of logical regression, random forest and support vector SVM will be used to model and analyze.

Prepare the data:

# divide the data into fraud group and normal group first Then proportional production training and test data set # grouping Fraud=crecreditcard_ data [crecreditcard _ data.Class = = 1] Normal=crecreditcard_ data [crecreditcard _ data.Class = = 0] # training feature set x_train=Fraud.sample (frac=0.7) x_train=pd.concat ([Xerox training Magi Normal.sample (frac=0.7)] Axis=0) # Test feature set x_test=crecreditcard_data.loc [~ crecreditcard_data.index.isin (x_train.index)] # tag set y_train=x_train.Class y_test=x_test.Class # remove tags and time columns from the feature set x_train=x_train.drop (['Class','Time'], axis=1) x_test=x_test.drop ([' Class','Time'], axis=1) # View the data structure print ) (199364, 29) (199364,) (85443, 29) (85443,)

6.1 logical regression method

From sklearn import metrics import scipy.optimize as op from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import KFold,cross_val_score from sklearn.metrics import (precision_recall_curve, auc,roc_auc_score, roc_curve,recall_score, classification_report) lrmodel = LogisticRegression (penalty='l2') lrmodel.fit (x_train, y_train) # View the model print ('lrmodel') print (lrmodel) lrmodel LogisticRegression (lrmodel' 1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100 Multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) # View confusion matrix ypred_lr=lrmodel.predict (x_test) print ('confusion_matrix') print (metrics.confusion_matrix (yearly testbook ypredtelllr)) confusion_matrix [[85284 11] [56 92]] # View classification report print (' classification_report') print (metrics.classification_report (y_test) Ypred_lr) classification_report precision recall f1-score support 0 1.00 1.00 1.00 85295 1 0.89 0.62 0.73 148 avg / total 1.00 1.00 85443 # View the prediction accuracy and decision coverage print ('Accuracy:%f'% (metrics.accuracy_score (metrics.accuracy_score (metrics.accuracy_score) print (' Area under the curve:%f'% (metrics.roc_auc_score) Accuracy:0.999216 Area under the curve:0.810746

6.2 Stochastic Forest Model

From sklearn.ensemble import RandomForestClassifier rfmodel=RandomForestClassifier () rfmodel.fit (Xanzhongjia Yaoyuntrain) # View the model print ('rfmodel') rfmodel rfmodel RandomForestClassifier (bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0 Warm_start=False) # View confusion matrix ypred_rf=rfmodel.predict (x_test) print ('confusion_matrix') print (metrics.confusion_matrix (yearly testjournal ypredcurrf)) confusion_matrix [[85291 4] [34 114]] # View classification report print (' classification_report') print (metrics.classification_report (y_test) Ypred_rf) classification_report precision recall f1-score support 0 1.00 1.00 1.00 85295 1 0.97 0.86 148 avg / total 1.00 1.00 85443 # View the prediction accuracy and decision coverage print ('Accuracy:%f'% (metrics.accuracy_score (metrics.accuracy_score) print (' Area under the curve:%f'% (metrics.roc_auc_score) Accuracy:0.999625 Area under the curve:0.902009

6.3 support vector machine SVM

# SVM Category from sklearn.svm import SVC svcmodel=SVC (kernel='sigmoid') svcmodel.fit (Xerox print ('svcmodel') svcmodel SVC (cache_size=200 1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='sigmoid', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001) # View the model Verbose=False) # View confusion matrix ypred_svc=svcmodel.predict (x_test) print ('confusion_matrix') print (metrics.confusion_matrix (yearly testjournal ypreddistribusvc)) confusion_matrix [[85197 98] [1426]] # View classification report print (' classification_report') print (metrics.classification_report (y_test) Ypred_svc) classification_report precision recall f1-score support 0 1.00 1.00 1.00 85295 1 0.06 0.04 0.0148 avg / total 1.00 1.00 1.00 85443 # View the prediction accuracy and decision coverage print ('Accuracy:%f'% (metrics.accuracy_score (yearly test) print (' Area under the curve:%f'% (metrics.roc_auc_score (y_test)) Ypred_svc)) Accuracy:0.997191 Area under the curve:0.519696 to this The study on "how to use Python to analyze credit card anti-fraud" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report