In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly shows you "what is the principle of KS in python risk control". It is easy to understand and clear. I hope it can help you solve your doubts. Let me lead you to study and learn this article "what is the principle of KS in python risk control".
First, business background
In the field of financial risk control, KS index is often used to measure the degree of differentiation (discrimination) of the evaluation model, which is also one of the most pursued indicators of the risk control model. The following will make an in-depth analysis of KS from the point of view of the concept of differentiation, KS calculation method, business guidance, geometric analysis, mathematical ideas and so on.
Second, intuitively understand the concept of differentiation.
In the data exploration, if we want to roughly judge whether the independent variable x distinguishes the dependent variable y or not, the sample is often divided into positive and negative to observe the distribution difference of the variables. So, how to judge that independent variables are useful? Intuitively understand that the smaller the overlap of the two distributions, the greater the difference between positive and negative samples, and the independent variables can better distinguish between positive and negative samples. As shown in figure 1.
For example, imagine that this variable is a pair of hands and pull it apart on both sides. The greater the strength of the hands, the farther the distance between the two probability distributions, indicating that the variables are more differentiated.
Figure 1-comparison of distribution differences between positive and negative samples
Import matplotlibimport numpy as npimport matplotlib.pyplot as pltmu = 100 # mean of distributionsigma = 15 # standard deviation of distributionx = mu + sigma * np.random.randn (20000) num_bins = 80fig, ax = plt.subplots () # the histogram of the datan, bins, patches = ax.hist (x, num_bins, density=1) N1, bins1, patches1 = ax.hist (x-20, num_bins) Density=1) # add a 'best fit' liney = (1 / (np.sqrt (2 * np.pi) * sigma)) * np.exp (- 0.5 * (1 / sigma * (bins-mu)) * * 2)) y1 = ((1 / (np.sqrt (2 * np.pi) * sigma)) * np.exp (- 0.5 * (1 / sigma * (bins-mu)) * * 2) ax.plot (bins, y,'--') Label = 'bads') ax.plot (bins1, y,' -', label = 'goods') ax.set_xlabel (' Varible') ax.set_ylabel ('Probability density') ax.set_title (' Distribution of bads and goods') fig.tight_layout () plt.grid (True,linestyle =':', color = 'ritual, alpha = 0.7) plt.legend () plt.show () III. Definition of KS statistics
KS (Kolmogorov-Smirnov) statistics are proposed by two Soviet scientists, A.N.Kolmogorov and N.V.Smirnov.
In risk control, KS is usually used to evaluate the model differentiation, the greater the differentiation, the stronger the risk ranking ability of the model.
KS is based on empirical cumulative distribution function (Emporical Cumulative Distribution Function,ecdf).
4. The common calculation methods of KS calculation process and business analysis KS:
Step1: binning variables. You can choose equal frequency, equal distance, or custom distance.
Step2: calculate the number of good samples (goods) and bad samples (bads) in each sub-box interval.
Step3: calculate the cumulative ratio of good customers to total good customers (cum_good_rate) and cumulative bad customers to total bad customers (cum_bad_rate) in each sub-box interval.
Step4: calculate the absolute value of the cumulative proportion of bad customers and cumulative good customers in each sub-box interval, and get the KS curve, that is:
Step5: take the maximum of these absolute values to get the final KS value of this variable.
To make it easier to understand, show the above process with specific data:
Table 1-KS calculation process
Superscript indicator calculation logic:
The following information can be obtained from the above table:
1. The higher the model score, the lower the overdue rate, so the low segment bad_rate is higher than the high segment, and the growth rate of cum_bad_rate curve is faster than that of cum_good_rate. The cum_bad_rate curve is above the cum_good_rate curve.
two。 The number of samples in each sub-box is basically the same, indicating that it is equal-frequency sub-box.
3. If the cutoff is limited to 0.65, its cum_bad_rate is 82.75%, indicating that 82.75% of bad customers will be rejected, but at the same time cum_good_rate is 29.69%, indicating that 29.69% of good customers will be rejected at the same time.
4. According to the changing trend of bad_rate, the ranking of the model is very good. If it is an A card, the requirement for ranking will be higher, because the user risk needs to be priced according to the risk level.
5. The KS of the model reaches 53.1%, with a strong degree of differentiation, which is the most ideal state. In practical business applications, the relationship between the pass rate and the bad debt rate needs to be weighed according to the preset conditions. Generally, cutoff is not selected at the ideal value, so it is known that KS is the upper limit of differentiation.
6. For A card, it is usually difficult for KS to reach 52%. Therefore, if the data in the above table is the result of A card, you need to further confirm whether the model has been fitted.
It should be further pointed out that KS is evaluated on the lending sample, which is always biased for the full sample. For the streaking risk control system, the deviation will be very small; on the contrary, the better the risk control system does, the greater the deviation will be. Therefore, KS is not just a number, there are many reasons behind it, which need to be combined with business for specific analysis.
When the KS is not good, in order to achieve the desired purpose, the following checks can be made:
1. To verify whether the input variable has been used by the policy, the use of repetitive variables will cause the model to fail to hit the bad customer who should have been hit, resulting in a decline in the effect of the model.
two。 Test whether the customer group difference between the training sample and the verification sample is obvious, including time distribution, some feature distribution, special feature hit and so on.
3. Develop new features that are more targeted to the target scene, such as tax scenarios, pay more attention to tax indicators when deriving features; for example, to identify long-term risks, use strong financial attribute variables, and for fraud risks, use some short-term negative variables.
4. Cluster modeling, but stability and differences should be considered to prevent over-fitting.
5. Bad customer analysis, trying to generalize through personality.
By visualizing the data in Table 1, we get the KS curve, which mainly uses the last three columns of data, namely cum_good_rate, cum_bad_rate and KS. The specific code and image are as follows:
Import matplotlib as mplimport matplotlib.pyplot as pltimport numpy as npcum_good_rate = np.array ([0.000.05, 0.12, 0.20, 0.30) cum_bad_rate = np.array ([0.26, 0.45, 0.59)) cum_bad_rate = np.array (0.92, 0.90, 0.97) plt.plot (x, cum_good_rate, label = 'cum_good_rate') plt.plot (x, cum_bad_rate) Label = 'cum_bad_rate') plt.plot (x, cum_bad_rate-cum_good_rate, label =' KS') plt.title ('KS Curve', fontsize = 16) plt.grid (True,linestyle =':', color = 'ritual, alpha = 0. 7) plt.axhline (y = 0. 53, c =' r', ls ='- -', lw = 3) # draw horizontal guides plt.axvline (x = 0.43, c ='r') parallel to the x axis Ls ='-', lw = 3) # draw a vertical reference line plt.legend () plt.show () parallel to the y-axis
Figure 2-KS curve
So far, we have understood the basic process, evaluation criteria, business guidance and optimization ideas of KS calculation, and then there are several questions:
1. Why is it that KS is often used to evaluate the effectiveness of the model in risk control, instead of using accuracy, recall and so on?
two。 The maximum KS is only a macro result, so what is the difference in the effect of the model when taking max under different cutoff?
3. In general, the larger the KS, the better, but why is it generally considered that a KS higher than 75% is not reliable?
Fifth, the reasons for choosing KS in risk control
In the process of risk control modeling, sample tags are often divided into four categories: G=Good (good guys, marked as 0), B=Bad (bad guys, marked as 1), I=Indeterminate (uncertain, not entering the performance period), and X=Exclusion (exclusion, abnormal samples).
It should be pointed out that the definition between Good and Bad is often vague and continuous, depending on the actual business requirements. Here are two examples to help understand:
Example 1: fuzziness
For 12 credit products, if the performance period is set as the first six periods, the S6D15 (in the first six periods, any one of which is overdue by 15 days) is 1, otherwise it is 0; but later, if the performance period is adjusted to 3, then for the sample of "normal repayment for the first three periods, but only overdue for 4-6 periods and more than 15 days", the originally defined label has changed from 1 to 0. Therefore, the business requirements are different, resulting in the definition of the label is not absolute. Therefore, the definition of good or bad samples must be based on the actual business needs, and should be determined on the basis of full understanding and analysis of the business, rather than slapping the head.
Example 2: continuity
Define the overdue period of the first period as 1, otherwise it is 0. However, there is no insurmountable hard interval between users who are 29 days overdue and those who are 31 days overdue. Users who are 29 days overdue may further deteriorate to 31 days overdue.
Because the definition of overdue severity itself has a certain subjective color, it is difficult for us to say how many essential differences there are in the number of overdue days, so even if we make a rigid boundary definition of 1 and 0 in order to transform it into a classification problem, but business understanding is still a continuous problem.
Therefore, in risk control, the definition of y is not black-and-white (discrete), but may be more reasonable to measure by probability distribution (continuous).
So why choose the KS indicator? KS tends to measure the difference between positive and negative samples from the perspective of probability. Because of the fuzziness and continuity between positive and negative samples, KS is also a continuous curve. But in the end, the main reason for taking a maximum value is to extract a remarkable feature from the KS curve, so that it is easy to compare with each other.
The above is all the content of the article "what is the principle of KS in python risk Control". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.