How to understand Python dataset and Visualization 04/18 Update SLTechnology News&Howtos

How to understand Python dataset and Visualization

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to understand Python datasets and visualization". In daily operation, I believe many people have doubts about how to understand Python datasets and visualization. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts of "how to understand Python datasets and visualization"! Next, please follow the editor to study!

Scene reproduction:

Your task is to improve the performance of the sales team. In our hypothetical case, potential customers have quite spontaneous demand. When this happens, your sales team places an order lead in the system. Your sales representative will then try to schedule a meeting that will be held when the order lead is noticed. Sometimes in the front, sometimes in the back. Your sales representative has a budget that combines meetings with meals. Sales representatives spend their costs and hand over invoices to the accounting team for processing. After the prospect decides whether to accept your offer or not, the sales representative will track whether the order lead is converted to sales.

For analysis, you can access the following three data sources:

Order leads (including all order leads and conversion information)

Sales team (including the company and responsible sales representatives)

Invoice (provide invoice and participant information)

Import and install:

You need to install the standard library. In addition, install seaborn on your Notebook by using the following command.

! pip install seaborn download data:

You can download and merge the data as described last week, or you can download the file here and load it into Notebook.

The first two rows of the sales_team data table

The first two rows of the order_leads data table

The first two rows of the invoices data table begin to explore

Total conversion rate development:

Change of conversion rate with time

Things seem to be going downhill in early 2017. After checking with the chief sales officer, it was found that a competitor entered the market at about that time. It's nice to know, but there's nothing we can do right now.

_ = order_leads.set_index (pd.DatetimeIndex (order_leads.Date)) .groupby (pd.Grouper (freq='D')) ['Converted'] .mean () ax = _ .bike (60) .mean () .plot (figsize= (20L7), title='Conversion Rate Over Time') vals = ax.get_yticks () ax.set_yticklabels ([' {:, .0f}% '.format (xan100) for x in vals]) sns.despine ()

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

We use an underscore _ as a temporary variable. I usually do this for one-time variables that are no longer used in the future.

We used pd.DateTimeIndex on order_leads.Date and set the result as an index, which enabled us to

Use pd.Grouped (freq ='D') to group data by day. Alternatively, you can change the frequency to WQuery MJQ or Y (weekly, monthly, quarterly, or yearly)

We calculate the average of the daily "conversion", which will give the conversion rate of the order for the day.

We use .roll (60) and .mean () to get a 60-day rolling average.

Then we format the yticklables so that they show a percent sign.

Conversion rate of each sales representative:

There seems to be a big difference among the sales representatives. Let's do more research on this.

In terms of the features used, there is not much new here. But notice how we use sns.distplot to draw the data onto the axis.

If we review the sales_team data, we will remember that not all sales representatives have the same number of customers, which will definitely have an impact! Let's check.

Conversion rate allocated according to the number of accounts allocated

We can see that the conversion rate figure seems to be inversely proportional to the number of accounts assigned to the sales representative. Those reduced conversion rates make sense. After all, the more accounts a representative has, the less time he can spend with everyone.

Here, we first create an auxiliary function that maps a vertical line to each subgraph and annotates the line with the mean and standard deviation of the data. Then we set some seaborn drawing defaults, such as the larger font_scale and whitegrid to style.

The impact of eating:

Meal data

It seems that we have determined the date and time of the meal, let's take a quick look at the time distribution:

Invoices ['Date of Meal'] = pd.to_datetime (invoices [' Date of Meal']) invoices ['Date of Meal']. Dt.time.value_counts (). Sort_index () out: 07:00:00 5536 08:00:00 5613 09:00:00 5473 12:00:00 5614 13:00:00 5412 14:00:00 5633 20:00:00 5528 21 pd.to_datetime 00 00 5534 22 00 5647

Summary:

Invoices ['Type of Meal'] = pd.cut (invoices [' Date of Meal'] .dt.hour, bins= [0mem10, 15pr 24], labels= ['breakfast','lunch','dinner'])

Notice how we use pd.cut to assign categories to digital data here, which makes sense because, after all, it doesn't matter whether breakfast starts at 8: 00 or 9: 00.

In addition, notice how we use .dt.hour, which we can only do because we will

Invoices ['Date of Meal'] is converted to the previous date and time. .dt is a so-called accessor, which has three cat,str and dt. If your data type is correct, you can use these accessors and their methods to operate directly (computationally efficient and concise). Unfortunately, invoices ['Participants'] is a string, and we must first convert it to a legitimate JSON so that we can extract the number of participants.

Def replace (x): return x.replace ("\ n", ","). Replace ("'", ",'). Replace ("'",'') invoices ['Participants'] = invoices [' Participants'] .apply (lambda x: replace (x)) invoices ['Number Participants'] = invoices [' Participants'] .apply (lambda x: len (json.loads (x)

Now, let's merge the data. To do this, we first connect all invoices on the company's ID to order_leads on the left. However, merging the data causes all meals to be added to all orders. There are also ancient meals to recent orders. In order to alleviate this situation, we calculate the time difference between eating and ordering, and only consider the meal for five days around the order.

There are still some orders that have been allocated to multiple meals. This can happen when there are two orders and two meals at the same time. Then, two meals will be assigned to two order leads. To delete these duplicates, we only bring the meal closest to the order.

Combine parts of data

I have created a drawing bar feature that already contains some styles. Drawing through this feature can make visual inspection faster. We will use it in a second.

The impact of the type of meal:

Orders_with_meals ['Type of Meal'] .fillna (' no meal',inplace=True) _ = orders_with_meals.groupby ('Type of Meal'). Agg ({' Converted': np.mean}) plot_bars (_, x_col='Type of Meal',y_col='Converted')

Wow! The conversion rate between meal-related orders and non-meal-related orders is very different. However, it seems that the conversion rate for lunch is slightly lower than that for dinner or breakfast.

The influence of time (that is, before or after a meal):

_ = orders_with_meals.groupby (['Days of meal before order']) .agg ({' Converted': np.mean}) plot_bars (data=_,x_col='Days of meal before order',y_col='Converted')

A negative number of days before ordering indicates that the meal is carried out after the order clue is entered. We can see that if the meal occurs before the order lead enters, it seems to have a positive effect on the conversion rate. The prior knowledge of the order seems to bring an advantage to our sales representative here.

Combine all:

Now, we will use the heat map to visualize multiple dimensions of the data at the same time. To do this, first create a helper function.

Then, we use some final data to argue to consider the relationship between the price of the meal and the value of the order, and to allocate the delivery time to "before ordering", "before and after ordering" and "after ordering", rather than the number of days from negative 4 to positive 4, because this is a bit cumbersome in terms of interpretation.

Running the following code snippet will produce a multidimensional heat map.

Draw_heatmap (data=data, outer_row='Timing of Meal', outer_col='Type of Meal', inner_row='Meal Price / Order Value', inner_col='Number Participants', values='Converted')

A heat map can visualize four dimensions in one picture.

The heat map is certainly beautiful, although it is a little difficult to read at first. So let's take a look. The chart summarizes the impact of four different dimensions:

Meal time: after ordering, before and after ordering, before ordering (out)

Type of meal: breakfast, dinner, lunch (outer column)

Menu price: lowest price, lowest price, proportional price, highest price, highest price (inner row)

Number of participants: 1, 2, 3, 4, 5 (inside column)

Of course, the color at the bottom of the chart seems darker / higher, which indicates:

When eating before ordering, the conversion rate will be higher.

When there is only one participant, the dinner conversion rate seems to be higher.

Compared with the order value, the seemingly more expensive meal has a positive effect on the conversion rate.

Results:

No more than 9 sales representative accounts (conversion rate will drop rapidly)

Make sure that every order lead is accompanied by a meeting / meal (because this doubles the conversion rate when only one employee visits, dinner is the most effective

Your sales representative should pay about 8% to 10% of the order amount for meals.

Time is critical, and ideally, your sales representative should know as soon as possible that a deal is coming.

Click here to view the code: GitHub Repo / Jupyter Notebook

Remarks are heat maps:

To resolve possible format errors, uninstall (and then must do so on the terminal), and then run the following command to downgrade matplotlib to version 3.1.0:

! pip install matplotlib==3.1.0 at this point, the study on "how to understand Python datasets and visualization" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.