In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "the classic group example analysis in the database". The content of the explanation in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought. Let's study and learn the classic group example analysis in the database.
Theoretical review
Simultaneous group analysis is a classical thinking of hin in data analysis. The core is to divide users into different groups according to the occurrence time of initial behavior, and then analyze how the behavior of similar groups changes with time. This is typically done through a retention table like this:
Each line represents the retention of new customers in the month in the next few months.
Through horizontal comparison, we can have a preliminary understanding of customer retention and life cycle. Based on the vertical observation, we can find the differences in the retention of customers in different periods, so as to deduce whether the customers introduced in this period are accurate.
This table looks simple and clear, and there are some mature tools that can be implemented, but it takes a lot of brains to implement it in Python based on order data.
Data overview
First, import the order data and take a look at what the source data looks like:
The key fields that will be used in subsequent analysis are customer nickname, payment time, order status and payment amount.
Then check the amount of data and missing:
The order has a total of 42713 lines, all of which are complete (excluding missing values) except for the time of payment. The overall format is regular, the payment time is in datetime format, and the purchase amount and quantity are numerical.
Data cleaning
The point of cleaning is to figure out why so much payment time is missing. Let's first filter out the lines where the payment time is null and find out:
It seems that the lack of payment time data, the order status is mainly the failure of the transaction. A preliminary inference is made here that the reason for the lack of payment time is that there is no actual transaction.
Sure enough, the order without payment time is the failure status of the transaction, and the complete data is the success of the transaction. Next, you only need to filter out the successful orders. 40339 lines of data is the main battlefield of group analysis for the same period.
Train of thought analysis
Let the original retention table brush with a sense of existence:
It takes a lot of hair to think directly about how to generate this table at one time. A more reasonable way is to disassemble the table with the idea of building blocks. Each row of this table represents a contemporaneous group, and their essential logic is the same.
First, calculate the number of new customers in that month and record customer nicknames.
Then take this part of the customers to match with the customers purchased every month, and count how many customers have repurchased (retained).
As long as we calculate the monthly new customers and corresponding retention, and put these data together, we will get the coveted group retention table for the same period.
One-month realization
Following the train of thought of the previous step, the problem becomes simple, realizing the computing logic of one month, and can be applied in other months. Add a column to the group retention table for the same period to mark the year and month in which the data belongs.
We take the data of October 2019 as a model to realize single-row simultaneous group analysis.
Obviously, there were 7336 customers and 8096 orders in October 2019.
Next, we want to calculate the number of new customers per month, which needs to be matched with the previous month traversal to verify. The customers before October 2019 are the data of September 2019:
Match with historical data, verify and filter out the number of new customers added in October 2019:
Then, match the monthly customer nicknames after October, calculate the monthly retention, and add the new customers in the first month to the list:
7083 new customers were added in October 2019, 539 were retained the following month (November), and then decreased, while the number of retained repo customers increased slightly from the previous month in February 2020.
The final version code # introduces the time label month_lst = order ['time tag'] .unique () final = pd.DataFrame () for i in range (len (month_lst)-1): # to construct a list as long as the month to facilitate the subsequent uniform format of count = [0] * len (month_lst) # to filter out orders for the current month Group by customer nickname target_month = order.loc [order ['time tag'] = = month_lst [I],:] target_users = target_month.groupby ('customer nickname') ['payment amount'] .sum (). Reset_index () # if it is the first month If I = = 0:new_target_users = target_month.groupby ('customer nickname') ['payment amount'] .sum () .reset_index () else:# if not Find the previous historical order history = order.loc [order ['time tag'] .isin (month_lst [: I]),:] # filter out the new customers new_target_users = target_ users. Loc. [target _ users ['customer nickname'] .isin (history ['customer nickname']) = = False,:] # the number of new customers in the current month is placed in the first value count [0] = len (new_target_users) # in months Loop traversal Calculate for jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjct in zip (range (I + 1jjcelen (month_lst)), range (1jjcelen (month_lst)): # order for the next month next_month = order.loc [order ['time tag'] = = month_ LST [j] :] next_users = next_month.groupby ('customer nickname') ['payment amount'] .sum (). Reset_index () # calculate the number of customers remaining in the month isin = new_target_users ['customer nickname'] .isin (next_users ['customer nickname']) .sum () count [ct] = isin# format transpose result = pd.DataFrame ({month_lst [I]: count}) .T # merge final = pd.concat ([final] Result]) final.columns = ['add this month','+ January','+ February','+ March','+ April','+ May']
However, the real data is in the form of retention rate, and you can do a little more processing:
Finally, it's done! The group analysis table that we want to achieve at the same time is realized. A brief glance shows that:
Horizontal observation, the next month loss is serious, the best performance of the next month retention is only 12%, and then a steady decline, stable at about 6%.
In vertical comparison, the number of new customers in 2019 was the least, only 2042, but the population was relatively accurate and the retention rate was better than that in other months.
Thank you for your reading. the above is the content of "classic simultaneous group example analysis in the database". After the study of this article, I believe you have a deeper understanding of the classic simultaneous group example analysis in the database. The specific use of the situation also needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.