In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article shares with you what are the common mistakes Pandas beginners make. Xiaobian thinks it is quite practical, so share it with everyone for reference. Let's follow Xiaobian and have a look.
Read large files using Pandas functions
The first error relates to actually using Pandas to accomplish certain tasks. Specifically, the data sets we're actually working with are very large. Reading large files using Pandas 'read_csv will be your biggest mistake.
Why not? Because it's too slow! Looking at this test, we load the TPS October dataset, which has 1M rows and about 300 features, taking up 2.2GB of disk space.
import pandas as pd%%timetps_october = pd.read_csv("data/train.csv")Wall time: 21.8 s
Read_csv takes about 22 seconds. You might say 22 seconds isn't much. But in a project, many experiments need to be performed at different stages. We create a number of separate scripts for cleanup, feature engineering, selection models, and other tasks. Waiting 20 seconds for data to load multiple times becomes too long. In addition, the data set may be larger and the time may be longer. So what's a faster solution?
The solution is to abandon Pandas at this stage and use other alternatives designed for fast IO. My favorite is datatable, but you can also choose Dask, Vaex, CuDF, etc. Here is how long it takes to load the same dataset with datatable:
import datatable as dt # pip install datatble%%timetps_dt_october = dt.fread("data/train.csv").to_pandas()------------------------------------------------------------Wall time: 2 s
Only two seconds, ten times the difference
No vectorization
One of the most important rules of functional programming is never to use loops. It seems that adhering to this "no loop" rule when using Pandas is the best way to speed up calculations.
Functional programming replaces loops with recursion. Although recursion can present problems (which we won't consider here), vectorization is the best option for scientific computation!
Vectorization is at the heart of Pandas and NumPy, which performs mathematical operations on entire arrays rather than individual scalars. Pandas already has an extensive set of vectorized functions, so we don't need to reinvent the wheel, just focus on how our priorities are calculated.
Most of the arithmetic operators (+, -, *,/, **) that Python does in Pandas work vectorized. In addition, any other mathematical function seen in Pandas or NumPy has been vectorized.
To verify the speed increase, we will use the following big_function, which takes three columns as input and performs some meaningless arithmetic as a test:
def big_function(col1, col2, col3): return np.log(col1 ** 10 / col2 ** 9 + np.sqrt(col3 ** 3))
First, we use this function with Pandas fastest iterator--apply:
%time tps_october['f1000'] = tps_october.apply( lambda row: big_function(row['f0'], row['f1'], row['f2']), axis=1 )-------------------------------------------------Wall time: 20.1 s
Operation takes 20 seconds. Let's do the same thing using the core NumPy array in vectorized fashion:
%time tps_october['f1001'] = big_function(tps_october['f0'].values, tps_october['f1'].values, tps_october['f2'].values)------------------------------------------------------------------Wall time: 82 ms
It took 82 milliseconds, about 250 times faster.
In fact, we can't completely abandon the cycle. Not all data operations are mathematical operations. But whenever you find that you need to use some loop function (such as apply, applymap, or itertuples), it's a good habit to take a moment to see if what you want to do can be vectorized.
Data types, dtypes!
We can specify data types based on memory usage.
The worst and most memory-consuming data type in Pandas is object, which also happens to limit some of Pandas 'functionality. The rest we have floating point numbers and integers. Here is a list of all types of pandas:
Pandas naming means that the number after the data type name indicates how many bits of memory each number in the data type will occupy. So the idea is to convert every column in the dataset to the smallest possible subtype. All we have to do is judge by the rules, and here's the rule sheet:
In general, floating-point numbers are converted to float 16/32 and columns with positive and negative integers are converted to int 8/16/32 according to the table above. You can also use uint 8 for booleans and only positive integers to further reduce memory consumption.
This function must be familiar to you, as it is widely used in Kaggle to convert floating-point numbers and integers to their smallest subtype according to the table above:
def reduce_memory_usage(df, verbose=True): numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"] start_mem = df.memory_usage().sum() / 1024 ** 2 for col in df.columns: col_type = df[col].dtypes if col_type in numerics: c_min = df[col].min() c_max = df[col].max() if str(col_type)[:3] == "int": if c_min > np.iinfo(np.int8).min and c_max
< np.iinfo(np.int8).max: df[col] = df[col].astype(np.int8) elif c_min >np.iinfo(np.int16).min and c_max
< np.iinfo(np.int16).max: df[col] = df[col].astype(np.int16) elif c_min >np.iinfo(np.int32).min and c_max
< np.iinfo(np.int32).max: df[col] = df[col].astype(np.int32) elif c_min >np.iinfo(np.int64).min and c_max
< np.iinfo(np.int64).max: df[col] = df[col].astype(np.int64) else: if ( c_min >np.finfo(np.float16).min and c_max
< np.finfo(np.float16).max ): df[col] = df[col].astype(np.float16) elif ( c_min >np.finfo(np.float32).min and c_max
< np.finfo(np.float32).max ): df[col] = df[col].astype(np.float32) else: df[col] = df[col].astype(np.float64) end_mem = df.memory_usage().sum() / 1024 ** 2 if verbose: print( "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format( end_mem, 100 * (start_mem - end_mem) / start_mem ) ) return df 让我们在 TPS 十月份的数据上使用它,看看我们能减少多少: >>> reduce_memory_usage(tps_october)Mem. usage decreased to 509.26 Mb (76.9% reduction)
We compressed the dataset from 2.2GB to 510MB. This reduction in memory consumption is lost when we save df to csv files because csv is still stored as strings, but if we save it with pickle it is fine.
Why reduce memory footprint? Memory footprint and consumption play an important role when processing such datasets with large machine learning models. Once you hit some OutOfMemory errors, you'll start catching up and learning the tricks to keep the computer working happily (who told Kaggle to only give 16 gigabytes of memory, it was forced out).
Do not style
One of Pandas 'most wonderful features is its ability to set different styles when displaying DF, rendering the original DF as an HTML table with some CSS in Jupyter.
Pandas allow styling of their DataFrame via style attributes.
tps_october.sample(20, axis=1).describe().T.style.bar( subset=["mean"], color="#205ff2").background_gradient(subset=["std"], cmap="Reds").background_gradient( subset=["50%"], cmap="coolwarm")
We randomly select 20 columns, create a 5-digit summary for them, and transpose the results, coloring the mean, standard deviation, and median columns according to their size. Adding styles like this makes it easier to spot patterns in the original numbers without having to use additional visualization libraries.
In fact, there is nothing wrong with not styling df. But it's a really nice feature, right?
Save files in CSV format
Just as reading CSV files is slow, so is saving data back to them. Here is how long it takes to save TPS October data to CSV:
%%timetps_october.to_csv("data/copy.csv")------------------------------------------Wall time: 2min 43s
It took almost three minutes. To save time you can save them as parquet, feather or even pickle.
%% timetps_octaber. to_feather ("data/copy. feather ") Wall time: 1.05 s-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
In fact, the most serious mistake for me was not reading Pandas documentation. But normally no one reads documents, right? Sometimes we would rather spend hours searching the internet than reading documents.
But when it comes to Pandas, this is a big mistake. Because it has an excellent user guide, like sklearn, covering everything from basics to how to contribute code and even how to set up beautiful themes (probably because there are so many that no one reads them).
All the errors I mentioned today can be found in the documentation. Even the "Large Data Sets" section of the document specifically tells you to use other packages (such as Dask) to read large files and stay away from Pandas. Actually, if I had time to read the user guide from beginning to end, I would probably make 50 novice mistakes, so check the documentation instead.
Thank you for reading! About "Pandas beginners easy to make mistakes" This article is shared here, I hope the above content can have some help for everyone, so that everyone can learn more knowledge, if you think the article is good, you can share it to let more people see it!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.