Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to avoid writing pandas code

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article focuses on "how to avoid writing pandas code", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to avoid writing pandas code.

Set up

From platform importpython_versionimport numpy as np import pandas as pdnp.random.seed (42) # set the seed tomake examples repeatable

Sample data set

The sample dataset contains booking information for each city, is random, and the sole purpose is to display the sample.

The dataset has three columns:

Id represents a unique identity

City indicates the reserved city information

Booked perc represents the percentage booked at a specific time

There are ten thousand items in the dataset, which makes the speed improvement even more obvious. Note that if the code is written in the correct pandas way, pandas can use DataFrames to calculate millions (or even billions) of rows of statistics.

Size= 10000cities = ["paris", "barcelona", "berlin", "newyork"] df = pd.DataFrame ({"city": np.random.choice (cities,sizesize=size), "booked_perc": np.random.rand (size)}) df ["id"] = df.index.map (str) + "-" + df.city dfdf = df [["id", "city", "booked_perc"] df.head ()

1. How to avoid summation of data

Inspired by the Java world, the "multi-line for loop" is applied to Python.

It doesn't make sense to calculate the sum of the booked perc columns and add up the percentages, but anyway, let's try it and practice it.

% timeitsuma = 0 for _, row in df.iterrows (): suma + = row.booked_perc766ms ±20.9ms per loop (mean ±std. Dev. Of 7 runs, 1 loop each)

A more Python-style way to sum the columns is as follows:

% timeitsum (booked_perc forbooked_perc in df.booked_perc) 989 μ s ±18.5 μ s per loop (mean ±std. Dev. Of 7 runs, 1000 loops each)% timeitdf.booked_perc.sum () 92 μ s ±2.21 μ s per loop (mean ±std. Dev. Of 7 runs, 10000 loops each)

As expected, the first example is the slowest-it takes almost a second to sum 10,000 items. The speed of the second example is surprising.

The right way to do this is to use pandas to sum the data (or use any other operation on the column), which is the third example-and the fastest!

two。 How to avoid filtering data

Although before using pandas, the author was already familiar with numpy and used for loops to filter data. Differences in performance can still be observed when summing.

% timeitsuma = 0 for _, row in df.iterrows (): if row.booked_perc

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report