How to use Python to process datasets 07/13 Update SLTechnology News&Howtos

How to use Python to process datasets

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the knowledge of "how to use Python to deal with datasets". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Pandas is a gift to the data science community. Ask any data scientist how they like to use Python to process their data sets, and they will no doubt talk about Pandas.

Pandas is the epitome of a great programming library: simple, intuitive, and versatile.

However, for a routine task of data scientists, using Pandas to perform thousands or even millions of calculations is still a challenge. You can't just put the data in, write a Python for loop, and expect to process the data in a reasonable amount of time.

Pandas is designed to handle the vectorization of entire rows or columns at once-iterating through each cell, row or column is not the library's design purpose. Therefore, when using Pandas, you should consider that matrix operations are highly parallel.

This guide will teach you how to use Pandas, which is designed to use matrix operations. In the process, I'll show you some practical time-saving tips and tricks that will make your Pandas code run much faster than those scary Python for loops!

Set up

In this tutorial, we will use the classic Iris dataset. We start by using seaborn to load the dataset and print out the first five lines.

Now let's establish a baseline and use the Python for loop to measure our speed. We will loop through each row to set the calculation to be performed on the dataset, and then measure the speed of the entire operation. This will provide us with a benchmark to see how much our new optimization can help us accelerate.

In the above code, we created a basic function that uses the If-Else statement to select the class of the flower based on the length of the petals. We wrote a for loop, used this function for each line through the loop dataframe, and then measured the total elapsed time of the loop.

On my i7-8700k computer, it takes an average of 0.01345 seconds to run five cycles.

Use .iterrows () to implement the loop

The simplest but most valuable acceleration we can do right away is to use Pandas's built-in .iterrows () function.

When we wrote the for loop in the previous section, we used the range () function. However, when we cycle through a wide range of values in Python, the generator tends to be much faster. In this article (https://towardsdatascience.com/5-advancedfeaturesof-python-and-how-use-them-73bffa373c84), you can read more about how the generator works and speed up.

The .iterrows () function in Pandas internally implements a generator function that "generates" a row of data in each iteration. More precisely, .iterrows () generates pairs (tuples) of (index, Series) for each row in DataFrame. This is actually the same as using something similar to enumerate () in the original Python, but it runs much faster.

Next we modify the code to use .iterrows () instead of the regular for loop. On the same machine I used in the last section, the average running time was 0.005892 seconds-an increase of 2.28 times!

Use .apply () to completely discard the loop

The .iterrows () function greatly increases the speed, but it is far from enough. Always keep in mind that when using libraries designed for vector manipulation, there may be a way to accomplish tasks efficiently without for loops at all.

The Pandas function that provides this functionality is the. Apply () function. Our function. Apply () takes another function as its input and applies it along the axis of the DataFrame (rows, columns, and so on). In the case of a transfer function, lambda can usually easily package everything together.

In the following code, we have completely replaced the for loop with the .apply () and lambda functions to encapsulate the calculation we want. On my machine, the average running time of this code is 0.0020897 seconds-6.44 times faster than the original for loop.

.apply () is much faster because it internally tries to traverse the Cython iterator. If your function happens to be well optimized for Cython, .apply () will make you faster. As a bonus, you can use built-in functions to generate cleaner, more readable code.

Finally, use cut

As I mentioned earlier, if you are using a library designed for vectorization operations, you should always look for a way to do any calculations without using for loops.

Similarly, many libraries designed in this way, including Pandas, have convenient built-in functions to perform the exact calculations you are looking for-but faster.

The .cut () function of Pandas takes a set of bins as input, which defines the range of each If-Else, and a set of labels as input, which defines which value to return for each range. It then performs exactly the same operation that we manually wrote with the compute_class () function.

Take a look at the following code to see how .cut () works. Once again, we have cleaner, more readable code. Finally, the .cut () function runs for an average of 0.001423 seconds-9.39 times faster than the original for loop!

That's all for "how to use Python to process datasets". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.