How to use pandas to analyze large datasets 07/09 Update SLTechnology News&Howtos

How to use pandas to analyze large datasets

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to use pandas to analyze large data sets, the article is very detailed, has a certain reference value, interested friends must read it!

1. Brief introduction

Although pandas is a very popular data analysis tool, many friends often reflect that pandas operation is "slow" and memory overhead is "large" when using pandas to deal with large data sets.

In particular, many student parties are often discouraged by stretched math when they try to deal with large data sets with their mediocre notebooks. But in fact, as long as you master certain pandas skills, the general configuration of the machine also has the ability to hold to live in the analysis of large data sets.

Figure 1

This paper will demonstrate how to use a series of strategies to analyze large data sets with pandas quickly and economically, taking real data sets and ordinary laptops with 16G storage as examples.

2. Pandas multi-speed, good-saving strategy

The dataset we used came from the "TalkingData AdTracking Fraud Detection Challenge" competition (https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection) on kaggle, using its corresponding training set, which is a 7.01G csv file.

Let's explore the balance between memory overhead and computing time cost step by step. First, we use pandas's read_csv () to read the train.csv file without any optimization:

Import pandas as pd raw = pd.read_csv ('train.csv') # View data frame memory usage raw.memory_usage (deep=True)

Figure 2

We can see that it took us nearly three minutes to read the whole data set, and the whole process was about to burst our 16-gigabyte running memory space because of the creation of temporary variables.

In this way, it is impossible for us to carry out further analysis later, because any small operation may burst memory due to a large number of temporary variables in the intermediate process, resulting in a panic blue screen. So the first step we need to do is to reduce the memory occupied by the data box:

(1) specify data types to save memory

Because by default, pandas will not optimize the memory cost for you when reading the data set for each field to determine the data type. For example, let's use the parameter nrows to read the first 1000 rows of the data set to see what type each field is:

Raw = pd.read_csv ('train.csv', nrows=1000) raw.info ()

Figure 3

No wonder our data set is so big to read. Originally, all integer columns are converted to int64 to store. In fact, the value range of each integer field in our original data set does not need to be stored with such a high precision, so we use dtype parameters to reduce the numerical accuracy of some fields:

Raw = pd.read_csv ('train.csv', nrows=1000, dtype= {' ip': 'int32',' app': 'int16',' device': 'int16',' os': 'int16',' channel': 'int16' 'is_attributed':' int8'}) raw.info ()

Figure 4

You can see that after modifying the data precision, the memory size of the first 1000 rows of data sets has been reduced by nearly 54.6%, which is a great improvement. According to this method, we try to read all the data and view its info () information:

Figure 5

We can see that with the optimization of data accuracy, the data set occupies a considerable reduction in memory, which makes it easier for us to carry out further data analysis, such as packet counting:

(raw # group count by app and os. Groupby (['app',' os']) .agg ({'ip':' count'}))

Figure 6

If the data type of the dataset cannot be optimized, is there any way to complete the calculation and analysis task without bursting memory?

(2) read only the required columns

If we do not need to use all the columns in the original dataset in our analysis process, there is no need to read them all and use the usecols parameter to specify the field names to be read in:

Raw = pd.read_csv ('train.csv', usecols= [' ip', 'app',' os']) raw.info ()

Figure 7

As you can see, even if we do not optimize the data accuracy, the size of the data box read in is only 4.1 GB. If combined with data accuracy optimization, the effect will be better:

Figure 8

In some cases, even if we optimize the data accuracy and filter the columns to be read, and the amount of data is still large, we can also process the data in a block-by-block way:

(3) read the analysis data in blocks

Using the chunksize parameter, we can create a chunk read IO stream for a specified dataset, reading up to the set chunksize rows at a time, so that we can split the task for the entire dataset into a small task and then summarize the results:

From tqdm.notebook import tqdm # reduces data accuracy and filters specified columns With a behavior block size of 10 million raw = pd.read_csv ('train.csv', dtype= {' ip': 'int32',' app': 'int16',' os': 'int16'}, usecols= [' ip', 'app') 'os'], chunksize=10000000) # iteratively extract each block from raw and aggregate it into groups Finally, summarize the result result =\ (pd .concat ([chunk .groupby (['app',' os'], as_index=False) .agg ({'ip':' count'}) for chunk in tqdm (raw)]) .groupby (['app',' os']) .agg ({'ip':' sum'})) result

Figure 9

As you can see, by using the strategy of partitioned read processing, we can maintain a low memory load from beginning to end, and complete the required analysis tasks all the same. With the same idea, if you find the above blocking processing a bit troublesome, here's a big trick:

(4) using dask instead of pandas for data analysis.

Dask believes that many friends have heard that its idea is actually very similar to the above block processing, but it is more concise, and the scheduling of system resources is more intelligent, and it can easily scale and scale from standalone to cluster.

Figure 10

It is recommended to use conda install dask to install dask-related components. After installation, we only need to replace import pandas as pd with import dask.dataframe as dd, and other major pandas API methods are fully compatible to help us seamlessly convert the code:

Figure 11

You can see that the whole read process took only 313ms, which, of course, is not really read into memory, but the delayed loading technology of dask, so that it can handle "datasets that are out of memory".

Next, we just need to write the code normally like manipulating the data object of pandas, and finally add .compute (), and dask will perform the formal operation of the result based on the previously built calculation chart:

(raw # Group count by app and os. Groupby (['app',' os']) .agg ({'ip':' count'}) .compute () # activate the compute chart)

And dask will very intelligently dispatch system resources, so that we can easily run all CPU:

Figure 12

You can learn more about dask by yourself on the official website (https://docs.dask.org/en/latest/).

Figure 13

The above is all the content of the article "how to use pandas to analyze large datasets". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.