How to build data with Pandas 04/19 Update SLTechnology News&Howtos

How to build data with Pandas

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

In this article, the editor introduces in detail "how to build data with Pandas". The content is detailed, the steps are clear, and the details are handled properly. I hope this article "how to build data with Pandas" can help you solve your doubts.

Build data

The data used in this case is simulated by the editor himself, and mainly contains two data: order data and fruit information data, and the two data will be merged.

Import pandas as pdimport numpy as npimport randomfrom datetime import * import timeimport plotly.express as pximport plotly.graph_objects as goimport plotly as py# draw subgraph from plotly.subplots import make_subplots

1. Time field

2. Fruits and users

3. Generate order data

Order = pd.DataFrame ({"time": time_range, # order time "fruit": fruit_list, # fruit name "name": name_list, # customer name # purchase quantity "kilogram": np.random.choice (list (range (50100)), size=len (time_range), replace=True)}) order

4. Generate information data of fruits.

Infortmation = pd.DataFrame ({"fruit": fruits, "price": [3.8,8.9,12.8,6.8,15.8,4.9,5.8,7], "region": ["South China", "North China", "Northwest", "Central China", "North China", "South China", "North China"]}) infortmation

5. Data merging

The order information and fruit information are directly combined into a complete DataFrame, and this df is the data to be processed next.

6. Generate a new field: order amount

Here you can learn:

How to generate time-related data

How to generate random data from a list (iterable objects)

The DataFrame of Pandas is created by itself, including generating new fields.

Pandas data merge

Analysis dimension 1: monthly sales trend from 2019 to 2021

1. First extract the year and month:

Df ["year"] = df ["time"] .dt.yeardf ["month"] = df ["time"] .dt.month# extracts both year and month df ["year_month"] = df ["time"] .dt.strftime ('% Y% m') df

2. View the field type:

3. Statistics and display by year and month:

Df1 = df.groupby (["year_month"]) ["kilogram"] .sum () .reset_index () fig = px.bar (df1,x= "year_month", y = "kilogram", color= "kilogram") fig.update_layout (xaxis_tickangle=45) # Tilt Angle fig.show ()

Sales trend from 2019 to 2021 df2 = df.groupby (["year_month"]) ["amount"] .sum () .reset_index () df2 ["amount"] = df2 ["amount"] .apply (lambda x:round (xPower2)) fig = go.Figure () fig.add_trace (go.Scatter (# x=df2 ["year_month"], y=df2 ["amount"], mode='lines+markers') # mode Mode Select name='lines')) # name fig.update_layout (xaxis_tickangle=45) # Tilt Angle fig.show ()

Annual sales, sales and average sales

Analysis dimension 2: proportion of annual sales of commodity fruits df4 = df.groupby (["year", "fruit"]). Agg ({"kilogram": "sum", "amount": "sum"}). Reset_index () df4 ["year"] = df4 ["year"] .astype (str) df4 ["amount"] = df4 ["amount"] .apply (lambda x: round (XP2)) from plotly.subplots import make_subplotsimport plotly.graph_objects as gofig = make_subplots (rows=1, cols=3) Subplot_titles= ["2019", "2020", "2021"], specs= [[{"type": "domain"}, # specify the type {"type": "domain"}, {"type": "domain"}]]) years = df4 ["year"] .unique () .tolist () for I Year in enumerate (years): name = df4 [df4 ["year"] = = year]. Fruit value = df4 [df4 ["year"] = = year] .kilogram fig.add_traces (go.Pie (labels=name, values=value), rows=1,cols=i+1) fig.update_traces (textposition='inside', # 'inside','outside','auto' 'none' textinfo='percent+label', insidetextorientation='radial', # horizontal, radial, tangential hole=.3, hoverinfo= "label+percent+name") fig.show ()

Years = df4 ["year"]. Unique (). Tolist () for _, year in enumerate (years): df5 = df4 [df4 ["year"] = = year] fig = go.Figure (go.Treemap (labels = df5 ["fruit"]. Tolist (), parents = df5 ["year"]. Tolist (), values = df5 ["amount"]. Tolist () Textinfo = "label+value+percent root")) fig.show ()

Changes in monthly sales of goods

Fig = px.bar (df5,x= "year_month", y = "amount", color= "fruit") fig.update_layout (xaxis_tickangle=45) # tilt angle fig.show ()

The changes shown in the line chart:

Analysis dimension 3: sales volume in different regions

Average annual sales in different regions

Df7 = df.groupby (["year", "region"]) ["amount"] .mean () .reset_index ()

Analysis dimension 4: df8 = df.groupby (["name"]). Agg ({"time": "count", "amount": "sum"}) .reset_index () .rename (columns= {"time": "order_number"}) df8.style.background_gradient (cmap= "Spectral_r")

Users' fruit preferences

According to the order quantity and order amount of each fruit for each user:

Df9 = df.groupby (["name", "fruit"]). Agg ({"time": "count", "amount": "sum"}). Reset_index (). Rename (columns= {"time": "number"}) df10 = df9.sort_values (["name", "number", "amount"], ascending= [True,False,False]) df10.style.bar (subset= ["number", "amount"], color= "# a97fcf")

Px.bar (df10, x = "fruit", y = "amount", # color= "number", facet_col= "name")

User layering-RFM model

RFM model is an important tool and means to measure customer value and profit-making ability.

Through this model, we can reflect three indicators of a user's delivery transaction behavior, the overall transaction frequency and the total transaction amount, and describe the value status of the customer through three indicators; at the same time, customers are divided into 8 types of customer value according to these three indicators:

Recency (R) is the number of days from the date of the customer's last purchase to the present. This indicator is related to the point in time of the analysis, so it is variable. In theory, the more customers buy in the near future, the more likely they are to buy again.

Frequency (F) refers to the number of times customers make purchases-consumers who buy most often have a higher level of loyalty. Increasing the number of customer purchases means taking a larger share of the time.

Monetary value (M) is the total amount that the customer spent on the purchase.

Here are several methods of Pandas to solve these three indicators, the first is F and M: the number of orders per customer and the total amount

How to solve the R index?

1. First solve the difference between each order and the current time

2. Sort each user in ascending order according to the difference R, and the first piece of data is his recent purchase record: for example, for xiaoming users, the last one is December 15, and the difference between the current time and the current time is 25 days.

3. According to the weight of the user, retain the first piece of data, so as to get the R index of each user:

4. Three indicators are obtained by data merging.

When the amount of data is large enough and there are enough users, only the RFM model can be used to divide users into eight types.

Analysis of user repurchase cycle

The repurchase cycle is the interval between every two purchases: for xiaoming users, the first two repurchase cycles are 4 days and 22 days, respectively.

The following is the process of solving each user's repurchase cycle:

1. Purchase time of each user in ascending order

2. Move the time by one unit:

3. The difference after the merger:

The null value is that there is no data before the first record of each user, and then the null value part is deleted directly.

Take out the number of days directly:

5. Comparison of repurchase cycle

Px.bar (df16, x = "day", y = "name", orientation= "h", color= "day", color_continuous_scale= "spectral" # purples)

The narrower the rectangle in the image above, the smaller the interval; the entire repurchase cycle for each user is determined by the length of the rectangle. View the sum of the overall repurchase cycle and the average repurchase cycle for each user:

A conclusion is drawn: the overall repurchase cycle of Michk and Mike users is relatively long, and they are loyal users in the long run; and from the average repurchase cycle, it is relatively low, indicating that the repurchase is active in a short time.

It can also be observed from the violin below that the repurchase cycles of Michk and Mike are the most concentrated.

After reading this, the article "how to build data with Pandas" has been introduced. If you want to master the knowledge points of this article, you still need to practice and use it yourself to understand it. If you want to know more about related articles, welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.