In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly explains "what are the advantages of pandas". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn what are the advantages of pandas.
Here is an example of how to get the data at the end of the article.
> import pandas as pd # Import dataset > df = pd.read_csv ('demand_profile.csv') > df.head () date_time energy_kwh 0 1-1-13 0:00 0.586 1 1-1-13 1:00 0.580 2 1-1-13 2:00 0.572 3 1-1-13 3:00 0.596 4 1-1-13 4:00 0.592
Based on the above data, we are now going to add a new feature, but this new feature is generated based on some time conditions and varies according to the duration (hours), as follows:
So, if you don't know how to speed up, the normal first idea may be to use the apply method to write a function in which the time condition logic code is written.
Def apply_tariff (kwh, hour): "" calculate the electricity bill per hour "" if 0 # does not approve of this operation > @ timeit (repeat=3, number=100). Def apply_tariff_loop (df):... Calculate the enery cost using a for loop and add it to the list. Energy_cost_list = []... For i in range (len (df)):... # get power consumption and time (hours). Energy_used = df.iloc [I] ['energy_kwh']... Hour = df.iloc [I] ['date_time']. Hour. Energy_cost = apply_tariff (energy_used, hour)... Energy_cost_list.append (energy_cost)... Df ['cost_cents'] = energy_cost_list... > > apply_tariff_loop (df) Best of 3 trials with 100function calls per trial: Function `apply_tariff_ loop` ran in average of 3.152 seconds.
For those who write Pythonic style, this design looks natural. However, this cycle will seriously affect efficiency. There are several reasons:
First, it needs to initialize a list of output records.
Second, it uses a loop of opaque object scope (0PowerLen (df)), and then after applying apply_tariff (), it must append the result to the list used to create the new DataFrame column. In addition, so-called linked indexes are performed using df.iloc [I] ['date_time'], which usually leads to unexpected results.
The biggest problem with this method is the calculated time cost. For 8760 rows of data, the loop took 3 seconds.
Next, let's take a look at the optimized speed increase plan.
Use iterrows Loop
The first can introduce the iterrows method through pandas to make it more efficient. These are generator methods that generate one line at a time, similar to the yield usage used in scrapy.
Itertuples produces a namedtuple for each row, with the index value of the row as the first element of the tuple. Nametuple is a data structure in the collections module of Python that behaves like a Python tuple but has fields that can be accessed through attribute lookups.
.iterrows generates (index,series) such tuples for each row in DataFrame.
Using .iterrows in this example, let's see how it works with iterrows.
@ timeit (repeat=3, number=100)... Def apply_tariff_iterrows (df):... Energy_cost_list = []... For index, row in df.iterrows ():... # get power consumption and time (hours). Energy_used = row ['energy_kwh']... Hour = row ['date_time']. Hour. # add cost list. Energy_cost = apply_tariff (energy_used, hour)... Energy_cost_list.append (energy_cost)... Df ['cost_cents'] = energy_cost_list. > > apply_tariff_iterrows (df) Best of 3 trials with 100 function calls per trial: Function `apply_tariff_ iterrows` ran in average of 0.713 seconds.
This syntax is clearer and there is less confusion in row value references, so it is more readable.
Time cost: nearly 5 times faster!
However, there is more room for improvement, and ideally it can be done in a faster way built into pandas.
Apply method of pandas
We can use the .apply method instead of .iterrows to further improve this operation. The .apply method of pandas accepts the function callables and applies it along the axis of DataFrame (all rows or columns). In the following code, the lambda function passes two columns of data to apply_tariff ():
@ timeit (repeat=3, number=100)... Def apply_tariff_withapply (df):... Df ['cost_cents'] = df.apply (... Lambda row: apply_tariff (... Kwh=row ['energy_kwh'],... Hour=row ['date_time'] .hour), Axis=1)... > > apply_tariff_withapply (df) Best of 3 trials with 100function calls per trial: Function `apply_tariff_ withapply` ran in average of 0.272 seconds.
Apply has obvious syntax advantages, such as a small number of lines and high code readability. In this case, the time spent is about half that of the iterrows method.
However, this is not "very fast". One reason is that apply () will internally attempt to loop through the Cython iterator. But in this case, the passed lambda is not something that can be handled in Cython, so it is not that fast to call in Python.
If we use the apply () method to get 10 years of hourly data, it will take about 15 minutes to process. If this calculation is only a small part of the large-scale calculation, then it really should be accelerated. This is where vectorization comes in handy.
Vectorization operation: selecting data using .isin
What is a vectorization operation?
If you are not based on some conditions, but can apply all the power consumption data to the price in one line of code: df ['energy_kwh'] * 28, something like this. So this particular operation is an example of a vectorization operation, which is the fastest way to perform in pandas.
But how to apply conditional computation to vectorization in pandas?
One trick is to select and group DataFrame according to your conditions, and then apply a vectorization operation to each selected group.
In the following code, we will see how to select rows using the .isin () method of pandas, and then add new features in the vectorization operation. Before doing this, it is more convenient to set the date_time column to the index of DataFrame:
# set date_time column to index df.set_index ('date_time', inplace=True) @ timeit (repeat=3, number=100) def apply_tariff_isin (df): # define hour range Boolean array peak_hours = df.index.hour.isin (range (17,24) shoulder_hours = df.index.hour.isin (range (7,17)) off_peak_hours = df.index.hour.isin (range (0) 7)) # use the definition in the apply_traffic function above, Df.Lok [peak _ hours,'cost_cents'] = df.Lok [peak _ hours,' energy_kwh'] * 28df.Lok [peak _ hours,'cost_cents'] = df.Lok [peak _ hours,' energy_kwh'] * 20df.Lok [off _ peak_hours,'cost_cents'] = Df.Lok [off _ peak_hours,' energy_kwh'] * 12
Let's see how it turns out.
> apply_tariff_isin (df) Best of 3 trials with 100function calls per trial: Function `apply_tariff_ isin` ran in average of 0.010 seconds.
Note that the above .isin () method returns an array of Boolean values, as follows:
[False,..., True, True, True]
A Boolean value indicates whether the DataFrame index datetimes falls within the specified hour range. You then pass these Boolean arrays to DataFrame's .loc, and you get a DataFrame slice that matches these hours. Then multiply the slices by the appropriate rate, which is a fast vectorization operation.
The above method completely replaces our original custom function apply_tariff (), which greatly reduces the code and takes off at the same time.
The running time is 315 times faster than Pythonic's for cycle, 71 times faster than iterrows, and 27 times faster than apply!
Could it be faster?
It's so exciting, let's keep speeding up.
In the above apply_tariff_isin, we made some manual adjustments by calling df.loc and df.index.hour.isin three times. If we had a finer time frame, you might say that this solution is not scalable. In this case, however, we can use the pd.cut () function of pandas to automatically complete the cutting:
Timeit (repeat=3, number=100) def apply_tariff_cut (df): cents_per_kwh = pd.cut (x=df.index.hour, bins= [0,7,17,24], include_lowest=True, labels= [12,20) 28]) .astype (int) df ['cost_cents'] = cents_per_kwh * df [' energy_kwh']
The above code pd.cut () applies grouping based on the bin list.
Where the include_lowest parameter indicates whether the first interval should contain the left.
This is a complete vectorization method, and it is the fastest in terms of time:
> apply_tariff_cut (df) Best of 3 trials with 100function calls per trial: Function `apply_tariff_ cut` ran in average of 0.003 seconds.
So far, the time to use pandas processing is basically reaching the limit! It takes less than a second to process a complete 10-year hourly data set.
However, the last other option is to use NumPy, which can be faster!
Continue to accelerate using Numpy
One thing you should not forget when using pandas is that Pandas's Series and DataFrames are designed on top of the NumPy library. Moreover, pandas can be seamlessly connected to NumPy arrays and operations.
Let's take this one step further using the digitize () function of NumPy. It is similar to cut () of pandas above, because the data will be boxed out, but this time it will be represented by an array of indexes representing the bin to which each hour belongs. Then apply these indexes to the price array:
Timeit (repeat=3, number=100) def apply_tariff_digitize (df): prices = np.array ([12,20,28]) bins= np.digitize (df.index.hour.values, bins= [7,17,24]) df ['cost_cents'] = prices [bins] * df [' energy_kwh']. Values
Like the cut function, this syntax is very concise and easy to read.
> apply_tariff_digitize (df) Best of 3 trials with 100function calls per trial: Function `apply_tariff_ tize` ran in average of 0.002 seconds.
0.002 seconds! Although there is still a performance improvement, it has been very marginalized.
At this point, I believe you have a deeper understanding of "what are the advantages of pandas". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.