In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article focuses on "how to avoid writing pandas code", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to avoid writing pandas code.
Set up
From platform importpython_versionimport numpy as np import pandas as pdnp.random.seed (42) # set the seed tomake examples repeatable
Sample data set
The sample dataset contains booking information for each city, is random, and the sole purpose is to display the sample.
The dataset has three columns:
Id represents a unique identity
City indicates the reserved city information
Booked perc represents the percentage booked at a specific time
There are ten thousand items in the dataset, which makes the speed improvement even more obvious. Note that if the code is written in the correct pandas way, pandas can use DataFrames to calculate millions (or even billions) of rows of statistics.
Size= 10000cities = ["paris", "barcelona", "berlin", "newyork"] df = pd.DataFrame ({"city": np.random.choice (cities,sizesize=size), "booked_perc": np.random.rand (size)}) df ["id"] = df.index.map (str) + "-" + df.city dfdf = df [["id", "city", "booked_perc"] df.head ()
1. How to avoid summation of data
Inspired by the Java world, the "multi-line for loop" is applied to Python.
It doesn't make sense to calculate the sum of the booked perc columns and add up the percentages, but anyway, let's try it and practice it.
% timeitsuma = 0 for _, row in df.iterrows (): suma + = row.booked_perc766ms ±20.9ms per loop (mean ±std. Dev. Of 7 runs, 1 loop each)
A more Python-style way to sum the columns is as follows:
% timeitsum (booked_perc forbooked_perc in df.booked_perc) 989 μ s ±18.5 μ s per loop (mean ±std. Dev. Of 7 runs, 1000 loops each)% timeitdf.booked_perc.sum () 92 μ s ±2.21 μ s per loop (mean ±std. Dev. Of 7 runs, 10000 loops each)
As expected, the first example is the slowest-it takes almost a second to sum 10,000 items. The speed of the second example is surprising.
The right way to do this is to use pandas to sum the data (or use any other operation on the column), which is the third example-and the fastest!
two。 How to avoid filtering data
Although before using pandas, the author was already familiar with numpy and used for loops to filter data. Differences in performance can still be observed when summing.
% timeitsuma = 0 for _, row in df.iterrows (): if row.booked_perc
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.