In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces the relevant knowledge of "how Python implements the RFM user analysis model". Many people will encounter such a dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Take a look at this article. The source data looks like this:
After learning, just hit enter, and the source data will look like this:
Are you impressed? OK, talk less, let's start dinner!
RFM is a classic scalp-numbing user classification and value analysis model. At the same time, this model is known for its bluntness, so straightforward that it writes the required fields on its face. Let's say it again: "ringing Foundry M!":
Rpm Rencency, that is, how many days each customer has not bought back, it can be understood as the number of days since the last purchase.
Fminute Frequency is how many times each customer has purchased it.
The average purchase amount for each customer. Here it can also be the cumulative purchase amount.
These three dimensions are the essence of the RFM model, which helps us to divide the mixed customer data into eight standard categories, and then carry out the fine operation of triple matching of people, goods and markets according to the different characteristics such as the proportion of the number of users and amount contribution of each category.
Using Python to build RFM model, the overall modeling idea is divided into five steps, in a word-"five steps in hand, you have the model", which are data overview, data cleaning, dimension scoring, score calculation and customer layering.
01 data overview
Our source data is the order table, which records the fields related to the user's transaction:
One detail to note is that each line of an order represents a single purchase by an individual user. What does that mean? If a user buys four times in a day, the order table records four rows, while in the actual business scenario, a user's multiple consumption behavior in a day should be regarded as one as a whole.
For example, I bought a pizza voucher at 10:00 today at Pizza Hut Tmall, ordered another beverage coupon at 11:00, and bought two more ice cream coupons at 18:00. Although I have placed three orders in this day, I will eventually consume these coupons at one time, which should only be counted as a complete consumption behavior. This logic will guide the later calculation of F value.
We find that in the order status, in addition to the successful transaction, there are also user refunds that cause the transaction to be closed. Does that include other statuses? Let me see see:
There are only these two states, in which the refund order is of little value to our model and needs to be removed in the follow-up cleaning.
Then observe the type and missing of the data:
The order has a total of 28833 lines, without any missing value, Nice! In terms of type, the payment date is in time format, the actual amount, postage and purchase quantity are numeric, and the others are all string type.
02 data cleaning
Eliminate refund
During the observation phase, we made it clear that the first goal of cleaning is to eliminate the refund data:
Key field extraction
After excluding, we think that there are still a lot of fields in our order, while the RFM model only needs three key fields: buyer's nickname, payment time and actual payment amount, so extract them:
Key field construction
The basic cleaning above is over, and the key to this step lies in the three fields needed to build the model: r (how many days since the last purchase), F (how many times purchased), and M (average or cumulative purchase amount).
The first is the R value, that is, how many days since each user's last purchase. If the user has placed the order only once, subtract the payment date from the current date; if the user places the order multiple times, filter out the time of the user's last payment, and then subtract it from today.
It should be reminded that the torrent of time is getting worse and fiercer. In the time format, the nearer we are to today, the bigger the time will be. For example, September 9, 2019 is greater than September 1, 2019:
Therefore, to get the last payment time for all users, simply group by buyer's nickname and select the maximum value of the payment date:
In order to get the final R value, subtract the last payment time for each user today and get the R value. This order was generated on July 1st, so here we regard "2019-7-1" as "today":
Then let's get the F value, that is, the cumulative purchase frequency of each user.
In the previous data overview phase, we made clear the idea of "regarding the behavior of a single user placing orders multiple times in a day as a whole", so we introduced a day-accurate date tag, grouped according to "buyer nickname" and "date label", merged the multiple orders issued by each user in a day, and then counted the number of purchases:
In the previous step, the purchase frequency of each user is calculated. Here, we only need to get the total amount of each user, and then divide the total amount by the purchase frequency to get the average amount paid by the user:
Finally, Wanjian returned to the family, and the three indicators were merged:
At this point, we have completed the calculation of the core indicators of the model, which can be regarded as cleaning the house before buying a treat.
03 dimension score
The core of dimensional confirmation is score determination. According to the set criteria, we rate each consumer's R/F/ M value, which depends on our preference, that is, the more we like the behavior, the higher the score:
Take the R value as an example. R represents how many days the user did not place an order. The higher the value, the greater the possibility of user loss. Of course, we do not want users to lose, so the larger the R, the smaller the score.
The F value represents the user's purchase frequency, and the M value is the average amount paid by the user. the larger the value, the better, that is, the larger the value, the higher the score.
In the RFM model, the scoring system generally adopts the 5-point system, and there are two common ways, one is to score according to the quantile of data, and the other is to divide the score according to the understanding of data and business. Here, I hope that students will deepen their understanding of the data and set their own scores, so the second kind is used in the process of telling, that is, the corresponding scores of different values are worked out in advance.
According to the experience of the industry, the R value is set to a span of 30 days, and the interval is left closed and right open:
The F value is linked to the purchase frequency. For each additional purchase, the score will be increased by one point:
We can first do a simple interval statistics of M value, and then group it into groups. Here we divide it according to a range of 50 yuan:
In this step, we have established a scoring framework, and each indicator of each user has a corresponding score.
04 score calculation
The division logic of the score has been determined, and it seems to be a bit troublesome. Next, we would like to invite Master Pandas to take the stage and see how he can solve this troublesome grouping logic with three punches and two strokes. Let's take a sample of the R value:
The vicissitudes of life, Master Fang Xianpan true color, just one line of code to achieve 5 levels of scoring. Pandas's cut function, let's review:
The first parameter is passed in the data column to be split.
The bins parameter represents the interval by which we are grouped. Above, we have determined that the R value is grouped according to the interval of 30 days. You can enter [0pc30re60fujijie 120,000000]. The last value is set to be very large, to give the grouping a fault tolerance space and allow extremely large values.
Right indicates whether the right interval is open or closed, that is, whether the value on the right is included. If it is set to False, it represents [0jue 30), including the packet data on the left side but not the right side. If set to True, it is [0jue 30], both at the beginning and the end.
The array of labels and bins splits echoes back and forth. What do you mean? bins sets 6 values, divides 5 groups, labels labels each group, 0-30 is 5, 30-60 is 4, and so on.
Then, the F and M values are very easy, just split them according to the values we set:
The first round of scoring has been completed, let's move on to the second round of scoring.
Guest officials should not be dirty, there are more than two rounds of interviews, the family RFM model is not so casual.
Now R-SCORE, F-SCORE, M-SCORE are between 1-5 numbers, if you combine three values, such as 111112113. In this way, 125 results can be combined, and excessive classification and non-classification are essentially the same. Therefore, we simplify the classification results by judging whether the R, F, M values of each customer are greater than the average.
Because each customer and the average R, F, M, only 0 and 1 (0 means less than the average, 1 means greater than the average) two results, the overall combination of a total of 8 groups, is a more reasonable situation. Let's determine whether each score of the user is greater than the average:
The results returned after judgment in Python are True and False, corresponding to the values of 1 and 0. As long as the Boolean result is multiplied by 1 Magi True, it becomes 1 # false becomes 0, which is easier to read after processing.
05 customer tiering
Review the previous steps, after cleaning, we determine the scoring logic, and then calculate the R, F, M scores (SCORE) of each user, and then compare the scores with the corresponding averages to get three columns of results that are greater than the mean. At this point, all the data required for modeling is ready, and all that is left is customer layering.
RFM's classic layering divides users into eight categories according to whether each R/F/M indicator is higher than the average. We summarize it, as shown in the following table:
Due to the traditional classification, some of the names are somewhat wrung, like most classifications are preceded by "important". What is the difference between "potential" and "deep ploughing"? what's the difference between "recall" and "salvation"?
In line with the principle of clarity first, we have made appropriate improvements to the original name. It is emphasized that the potential is for consumption (average payment), deep ploughing is to increase consumption frequency, and important recall customers are actually very similar to important value customers, but there is no buyback recently, so we should make an early warning of loss, and so on. Here is just a brick to attract jade, to provide an idea, in short, everything is to make it easier to understand.
For each type of customer characteristics, we also made a simple interpretation, such as important value customers, that is, recently purchased our products, and in the entire consumer life cycle, the purchase frequency is higher, the average amount of each payment is also high. Other classifications are equally logical and can be combined with interpretation to strengthen understanding. Next, we will use Python to implement this classification.
First, introduce an auxiliary column of the population value, and concatenate the three values of whether the R\ F\ M is greater than the mean value.
The crowd value is a numerical type, so the first 0 is automatically skipped, for example, 1 represents the high consumption of "001" to call back customers, and 10 corresponds to the general customer of "010".
In order to get the final population tag, define a judgment function to return the corresponding classification label by judging the value of the population value:
Finally, the label classification function is applied to the crowd numerical column:
The completion of customer classification work announces the end of RFM model modeling, and each customer has his own RFM tag.
Analysis of the results of RFM model
In fact, to the last step, we have completed the whole modeling process, but all the model results eventually have to serve the business, so finally, we do some expansion and exploratory analysis based on the existing model results.
Check the proportion of various types of users:
Explore the contribution proportion of different types of customers' consumption amount:
Visualization of the result (the visualization code is left for everyone to try):
From the above results, we can quickly draw some inferences:
The situation of customer loss is grim, high consumption calls back customers, and the proportion of lost customers is more than 50%. How to formulate a targeted recall strategy is extremely urgent.
The proportion of important value customers is only 2.97%, and three customers account for even less than 2%. The scoring of our model may not be scientific enough, so we can further adjust the scoring range for optimization.
...
Combined with the amount of money for analysis:
The proportion of customers recalled by high consumption accounts for 28.87%, and the proportion of the amount has risen to 38.11%. These customers are the mainstay of consumption. Why they are lost should be further mined in combination with order and purchase behavior data.
The proportion of customers' money is closely followed by the frequency of deep ploughing customers, which is characterized by recent consumption, low consumption frequency, high consumption amount, and high consumption calls back customers only have the difference in purchase time. How to avoid this part of customers calling back customers to high consumption is the main proposition that we need to think about.
The number of lost customers accounts for 26.28%, and the amount accounts for only 12.66%. How many of these customers are mattress wool users and how many target users? how can we guide and adjust our drainage strategy?
This is the end of the content of "how Python implements the RFM user analysis model". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.