In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
Today, I will talk to you about how to correctly use the Pandas library in Python to improve the running speed of the project. Many people may not know much about it. In order to make you understand better, the editor has summarized the following for you. I hope you can get something according to this article.
How to use Pandas Library correctly to improve the running Speed of the Project
If you work in big data, you will find a lot of surprises when using Python's Pandas library. Pandas is playing an increasingly important role in data science and analytics, especially for users who move from Excel and VBA to Python.
So, what is Pandas for data scientists, data analysts, data engineers? The introduction to it in the Pandas documentation is:
"A fast, flexible, and easy-to-understand data structure that makes it easier and more intuitive to work with relational and tagged data."
Fast, flexible, simple and intuitive, these are good features. When you build a complex data model, you don't have to spend a lot of development time waiting for data processing. This allows you to focus more on understanding the data.
But some people say that Pandas is slow.
The first time I used Pandas, someone commented that Pandas is a great tool for parsing data, but Pandas is too slow to be used for statistical modeling. When I used it for the first time, it was really, really slow.
However, Pandas is based on the structure of the NumPy array. So many of its operations are written by extension modules that come with NumPy or Pandas, which are written in Cython, compiled to C, and executed on C. So shouldn't Pandas be fast, too?
In fact, Pandas is really fast if you use it in the right posture.
Using pure "python" code is not the most efficient choice when using Pandas. Like NumPy, Pandas is designed for vectorization operations, which can operate on an entire column or dataset in a single scan. The traversal behavior of dealing with each cell or row separately should be used as an alternative.
Just to be clear, this Python tutorial is not a guide to over-optimizing Pandas code. Because Pandas is fast when used correctly. In addition, the difference between optimizing code and writing clear code is huge.
This is a guide to "how to take full advantage of the powerful and easy-to-use features built into Pandas". In addition, you will learn some practical time-saving skills. In this Python tutorial, you will learn:
Advantages of using datetime time series data
A more efficient way to deal with batch computing
Save time with HDFStore
In this Python tutorial, power consumption time series data will be used to demonstrate this topic. After loading the data, we will gradually learn more efficient ways to achieve the final results. For Pandas users, there are several ways to preprocess data. But this does not mean that all methods are suitable for larger and more complex data sets.
[note]
[tools]
Python 3 、 Pandas 0.23.1
Task:
This example uses time series data of energy consumption to calculate the total cost of energy in one year. Because the electricity price is different in different time periods, it is necessary to multiply the power consumption of each period by the electricity price of the corresponding period.
Two columns of data can be read from the CSV file: date time and power consumption (kilowatts)
Each row contains hourly power consumption data, so 8760 (356 × 24) rows of data are generated throughout the year. The hourly data for each row represents the start time of the calculation, so the data at 0:00 on 1-1-13 refers to the power consumption data for the first hour on January 1st.
Save time with the Datetime class
First, read the CSV file using one of Pandas's Imax O functions:
> > z import pandas as pd > pd.__version__'0.23.1' > > df = pd.read_csv ('file path') > > df.head () date_time energy_kwh0 1-1-13 0:00 0.5861 1-1-13 1:00 0.5802 1-1-13 2:00 0.5723 1-1-13 3:00 0.5964 1-1-13 4:00 0.592
The result looks good, but there is a small problem. Pandas and NumPy have a concept of data type dtypes. If you do not specify a parameter, the date_time column will be classified as the default class object:
> df.dtypesdate_time objectenergy_kwh float64dtype: object > > type (df.iat [0,0]) str
The default class object is not only a container for the str class, but also cannot be neatly applied to a particular data type. Date of string str type is very inefficient in data processing, and memory efficiency is also inefficient.
In order to process time series data, the date_time column needs to be formatted as an array of the datetime class, which Pandas calls the timestamp Timestamp. Formatting with Pandas is fairly simple:
> df ['date_time'] = pd.to_datetime (df [' date_time']) > df ['date_time'] .dtypedatetime64 [ns]
At this point, the new df and CSV file content are basically the same. It has two columns and an index.
> > df.head () date_time energy_kwh0 2013-01-01 00:00:00 0.5861 2013-01-01 01:00:00 0.5802 2013-01-01 02:00:00 0.5723 2013-01-01 03:00:00 0.5964 2013-01-01 04:00:00 0.592
The above code is simple and easy to understand, but how fast is it? Here we use the timing decorator, which is called @ timeit. This decorator mimics the timeit.repeat () method in the Python standard library, but it returns the result of the function and prints the average run time of repeated debugging. Timeit.repeat () of Python returns only the debug time result, but not the function result.
Place the decorator @ timeit above the function, and you can print the run time of the function each time you run the function.
@ timeit (repeat=3, number=10)... Def convert (df, column_name):... Return pd.to_datetime (DF [column _ name]) > # Read in again so that we have `object` dtype to start > df ['date_time'] = convert (df,' date_time') Best of 3 trials with 10 function calls per trial:Function `convert` ran in average of 1.610 seconds.
See how it turns out? It takes 1.6 seconds to process 8760 rows of data. There seems to be nothing wrong with that. But when dealing with larger data sets, such as calculating more high-frequency electricity tariff data, give each minute of electricity tariff data to calculate the total cost for a whole year. The amount of data will be 60 times more than it is now, which means you need about 90 seconds to wait for the output. This is a little unbearable.
In fact, the author's work requires the analysis of hourly power data from 330 sites over the past 10 years. According to the above method, it takes 88 minutes to complete the formatting conversion of the time column.
Is there a faster way? In general, Pandas can transform your data faster. In this case, the processing efficiency can be greatly improved by using format parameters to pass a specific time format in the csv file into the to_datetime of Pandas.
> @ timeit (repeat=3, number=100) > def convert_with_format (df, column_name):. Return pd.to_datetime (DF [column _ name],... Format='%d/%m/%y% Havo% M') Best of 3 trials with 100 function calls per trial:Function `convert_with_ format` ran in average of 0.032 seconds.
What's the new result? 0.032 seconds, the speed has increased 50 times! So the data processing time of the previous 330 sites saved 86 minutes.
One detail to note is that the time format in CSV is not the ISO 8601 format: YYYY-mm-dd HH:MM. If no format is specified, Pandas uses the dateuil package to format the date in each string format. On the contrary, if the original time format is already in ISO 8601 format, Pandas can quickly parse the date.
[note] Pandas's read_csv () method also provides parameters for parsing time. For more information, see parse_dates,infer_datetime_format, and date_parser parameters.
Ergodic
The date and time has been formatted, and now we are ready to start calculating the electricity bill. Because the electricity price of each period is different, it is necessary to map the corresponding electricity price to each period of time. In this example, the tariff criteria are as follows:
If the electricity price is uniform throughout the day at 28 cents per kilowatt hour, most people know that the electricity bill can be calculated with one line of code:
> df ['cost_cents'] = df [' energy_kwh'] * 28
This line of code creates a new column that contains the electricity bill for the current period:
Date_time energy_kwh cost_cents0 2013-01-01 00:00:00 0.586 16.4081 2013-01-01 01:00:00 0.580 16.2402 2013-01-01 02:00:00 0.572 16.0163 2013-01 03:00:00 0.596 16.6884 2013-01-01 04:00:00 0.592 16.57
However, the calculation of the electricity tariff depends on the electricity price corresponding to the period of time not in use. Many people here will do this kind of calculation in a non-Pandas way: traversing.
In this article, we will start with the most basic solution and gradually provide a Python-style solution that takes full advantage of the performance advantages of Pandas.
But what is a Python-style solution for the Pandas library? This means that compared to other less friendly languages such as C++ or Java, they are used to "using traversal" to program.
If you are not familiar with Pandas, most people will use the continue traversal method as before. Continue to use the @ timeit decorator to see the efficiency of this approach.
First, create a function of electricity price for different periods of time:
Def apply_tariff (kwh, hour): "" electricity price function "" if 0 # Note: do not try this function! @ timeit (repeat=3, number=100)... Def apply_tariff_loop (df):... Calculate the cost with traversal and incorporate the result into the df. Energy_cost_list = []... For i in range (len (df)):. # get the power consumption per hour. Energy_used = df.iloc [I] ['energy_kwh']... Hour = df.iloc [I] ['date_time'] .hour... Energy_cost = apply_tariff (energy_used, hour)... Energy_cost_list.append (energy_cost)... Df ['cost_cents'] = energy_cost_list... > apply_tariff_loop (df) Best of 3 trials with 100function calls per trial:Function `apply_tariff_ loop` ran in average of 3.152 seconds.
For Python users who have not used Pandas, this traversal is normal: for each x, given the condition y, output z.
But this traversal is cumbersome. The above example can be regarded as a "negative example" of Pandas usage for several reasons.
First, it needs to initialize a list to store the output.
Second, it uses the obscure class range (0, len (df)) to do loops, and then after applying the apply_tariff () function, you must add the results to the list to generate new DataFrame columns.
Finally, it uses the chained index df.iloc [I] ['date_time'], which may produce a lot of bugs.
The biggest problem with this traversal is the time cost of calculation. For 8760 rows of data, it took 3 seconds to complete the traversal. Let's take a look at some iterative schemes based on Pandas data structures.
Traverse with .itertuples () and .iterrow ()
Is there any other way?
Pandas actually makes for i in range (len (df)) syntax redundant by introducing DataFrame.itertuples () and DataFrame.iterrows () methods. Both are generator methods that generate one line at a time.
.itertuples () generates a nametuple class for each row, with the index value of the row as the first element of the nametuple class. Nametuple is a data structure from the collections module of Python, which is similar to tuples in Python, but accessible fields can be found through properties.
.iterrows () generates a tuple of indexes and sequences for each row of the DataFrame.
Compared to .iterrows (), .itertuples () runs faster. The .iterrows () method is used in this example, because many readers probably haven't used nametuple.
@ timeit (repeat=3, number=100)... Def apply_tariff_iterrows (df):... Energy_cost_list = []... For index, row in df.iterrows ():... # get the power consumption per hour. Energy_used = row ['energy_kwh']... Hour = row ['date_time'] .hour... # add cost data to the list list. Energy_cost = apply_tariff (energy_used, hour)... Energy_cost_list.append (energy_cost)... Df ['cost_cents'] = energy_cost_list... > > apply_tariff_iterrows (df) Best of 3 trials with 100 function calls per trial:Function `apply_tariff_ iterrows` ran in average of 0.713 seconds.
Make some good progress. The syntax is clearer and the reference to the line value I is less, making the whole more readable. In terms of time gain, it is almost five times faster!
However, there is still much room for improvement. Since for traversal is still in use, it means that the function needs to be called once per loop, which could have been done in the faster Pandas built-in architecture.
.apply () of Pandas
You can use the .apply () method instead of the .iterrows () method to improve efficiency. The .apply () method of Pandas can be passed in a callable function and applied to the axis of DataFrame, that is, all rows or columns. In this example, two columns of data are passed into apply_tariff () with the help of the lambda function:
@ timeit (repeat=3, number=100)... Def apply_tariff_withapply (df):... Df ['cost_cents'] = df.apply (... Lambda row: apply_tariff (... Kwh=row ['energy_kwh'],... Hour=row ['date_time'] .hour), Axis=1)... > > apply_tariff_withapply (df) Best of 3 trials with 100 function calls per trial:Function `apply_tariff_ withapply` ran in average of 0.272 seconds.
The syntax advantage of .apply () is obvious, with fewer lines of code and easier to read. In terms of speed, this saves about half the time compared to the .iterrows () method.
But that's not fast enough. One reason is that the .apply () internal attempt to complete the loop on the Cython iterator. In this case, however, some input is passed in lambda that cannot be processed in Cython, so the call to .apply () is still not fast enough.
If you use .apply () on 10-year data from 330 sites, it will take about 15 minutes to process. If this calculation is only a small part of a large model, then more improvements are needed. The following vectorization operation can do this.
Filter data with .isin ()
As seen previously, if there is only a single price, you can multiply all power consumption data by that price df ['energy_kwh'] * 28. This operation is a use case for a vectorization operation, which is the fastest way in Pandas.
But how can conditional computation be applied to vectorization operations in Pandas? One method is to filter, group and slice DataFrame according to conditions, and then perform a corresponding vectorization operation on each set of data.
In the following example, you will show how to filter rows using the .isin () method in Pandas, and then use the vectorization operation to calculate the corresponding electricity bill. Before this operation, setting the date_time column to the DataFrame index facilitates the vectorization operation:
Df.set_index ('date_time', inplace=True) @ timeit (repeat=3, number=100) def apply_tariff_isin (df): # define the Boolean array (Boolean) peak_hours = df.index.hour.isin (range (17,24)) shoulder_hours = df.index.hour.isin (range (7,17)) off_peak_hours = df.index.hour.isin (range (0,7)) # calculate electricity charges for different periods 'cost_cents'] = Df.Lok [peak _ hours,' energy_kwh'] * 28df.Lok [peak _ hours,'cost_cents'] = df.Lok [hours,' energy_kwh' _ hours,' energy_kwh'] * 20df.Lok [off _ peak_hours,'cost_cents'] = df.Lok [off _ peak_hours,' energy_kwh'] * 12
The implementation results are as follows:
> apply_tariff_isin (df) Best of 3 trials with 100 function calls per trial:Function `apply_tariff_ isin` ran in average of 0.010 seconds.
To understand this code, you may need to understand that the .isin () method returns a Boolean value, as follows:
[False,..., True, True, True]
These Boolean values mark the time period in which the DataFrame date-time index is located. Then, when you pass these Boolean arrays to the .loc indexer of DataFrame, a DataFrame slice containing only that period is returned. Finally, multiply the slice array by the rate for the corresponding period.
How does this compare to the previous traversal method?
First, there is no need for the apply_tariff () function, because all the conditional logic is applied to the selected row. This greatly reduces the number of lines of code.
In terms of speed, it is 315 times faster than the normal traversal, 71 times faster than the .iterrows () method, and 27 times faster than the .apply () method. Now you can quickly deal with the big data collection.
Is there any room for improvement?
In apply_tariff_isin (), you need to manually call df.loc and df.index.hour.isin () three times. For example, the rate per hour for 24 hours is different, which means that the .isin () method needs to be called manually 24 times, so this scheme is usually not scalable. Fortunately, Pandas's pd.cut () feature is also available:
Timeit (repeat=3, number=100) def apply_tariff_cut (df): cents_per_kwh = pd.cut (x=df.index.hour, bins= [0,7,17,24], include_lowest=True, labels= [12,20,28]) .astype (int) df ['cost_cents'] = cents_per_kwh * df [' energy_kwh']
Pd.cut () generates a corresponding label "rate" according to the interval generated by the packet bins.
[note] the include_lowest parameter sets whether the first interval is included in the group bins, for example, if you want to include data at 0: 00 in the group.
This is a fully vectorized operation, and its execution speed has taken off:
> apply_tariff_cut (df) Best of 3 trials with 100function calls per trial:Function `apply_tariff_ cut` ran in average of 0.003 seconds.
At this point, the data processing time of 330 sites can now be reduced from 88 minutes to less than 1 second. However, a final option is to use the NumPy library to manipulate each NumPy array under DataFrame, and then integrate the results back into the DataFrame data structure.
And NumPy!
Don't forget that Pandas's Series and DataFrame are based on the NumPy library. This provides more flexibility because Pandas and NumPy arrays can operate seamlessly.
In the next example, the digitize () function of NumPy will be demonstrated. It is similar to the cut () function of Pandas, which groups data. In this example, the indexes (date and time) in DataFrame are grouped, and three periods are divided into three groups. Then apply the grouped power consumption array to the electricity price array:
Timeit (repeat=3, number=100) def apply_tariff_digitize (df): prices = np.array ([12,20,28]) bins= np.digitize (df.index.hour.values, bins= [7,17,24]) df ['cost_cents'] = prices [bins] * df [' energy_kwh']. Values
Like cut (), the grammar is easy to read. But what about the speed?
> apply_tariff_digitize (df) Best of 3 trials with 100 function calls per trial:Function `apply_tariff_ tize` ran in average of 0.002 seconds.
In terms of execution speed, there is still an improvement, but this improvement is no longer meaningful. You might as well devote more energy to thinking about other things.
Pandas provides a number of alternative methods for batch processing of data, all of which have been demonstrated above. Here the fastest to slowest methods are sorted as follows:
1. Use vectorization operations: Pandas methods and functions without for traversal.
two。 Use the .apply () method.
3. Use .itertuples (): iterate the DataFrame line as a nametuple class from the collections module of Python.
4. Iterates using .iterrows (): iterates the DataFrame row as a (index,pd.Series) tuple. Although Pandas's Series is a flexible data structure, generating an Series per row and accessing it is still a large overhead.
5. Loop through each element, using df.loc or df.iloc to process each cell or row.
[note] the above order is not my suggestion, but the advice given by the core Pandas developers.
The following is a summary of the debugging time for all the functions in this article:
Using HDFstore to store and preprocess data
Now that you've learned about fast data processing with Pandas, we need to explore how to avoid repetitive data processing. Pandas's built-in HDFStore method is used here.
Usually when building some complex data models, it is very common to preprocess the data. For example, if you have high-frequency data in minutes over a 10-year time span, but the model only needs 20-minute frequency data or other low-frequency data. You don't want to preprocess the data every time you test the analysis model.
One solution is to store the preprocessed data in the processed data table so that it can be called whenever needed. But how do you store data in the correct format? If you save the preprocessed data as CSV, the datetime class will be lost and the format must be reconverted when you read it again.
Pandas has a built-in solution that uses HDF5, a high-performance storage format dedicated to storing arrays. Pandas's HDFstore method stores the DataFrame in an HDF5 file, which can be read and written effectively, while still retaining the data types and other metadata of each column of the DataFrame. It is a dictionary-like class, so it can read and write like the dict class in Python.
Here is how to write the preprocessed power consumption DataFrame to the HDF5 file:
# create a storage class file and name it `processed_ data`data _ store = pd.HDFStore ('processed_data.h6') # write DataFrame to the storage file and set the key (key)' preprocessed_df'data_store ['preprocessed_df'] = dfdata_store.close ()
After the data is stored on the hard disk, the preprocessed data can be accessed anytime and anywhere, and there is no need for repeated processing. The following is about how to access data from an HDF5 file while preserving the data type of data preprocessing:
# access data warehouse data_store = pd.HDFStore ('processed_data.h6') # read key (key) DataFramepreprocessed_df = data_store [' preprocessed_df'] data_store.close () with 'preprocessed_df'
A data warehouse can store multiple tables, each with a key.
[note] HDFStore using Pandas needs to install PyTables > = 3.0.0, so after installing Pandas, you need to update PyTables:
Pip install-upgrade tables
Summary
If you don't think your Pandas project is fast, flexible, simple, and intuitive, it's time to rethink the way you use the library.
This Python tutorial has shown quite intuitively that the correct use of Pandas can greatly improve run time and code readability. Here are some empirical suggestions for using Pandas:
① tries more vectorization operations and tries to avoid operations like for x in df. If there are many for loops in your code, try to use the data structures that come with Python, because Pandas carries a lot of overhead.
② can try the .apply () method if the vectorization operation cannot be used because of the complexity of the algorithm.
If ③ has to loop through the array, use .iterrows () or .itertuples () to improve syntax and speed.
④ Pandas has many options, and there are always several ways to complete the process from A to B, compare the execution of different methods, and choose the one that best suits the project.
After ⑤ completes the data processing script, the preprocessed data of the intermediate output can be saved in HDFStore to avoid reprocessing the data.
⑥ in Pandas projects, the use of NumPy can increase speed and simplify syntax.
After reading the above, do you have any further understanding of how to correctly use the Pandas library in Python to improve the running speed of the project? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.