A method of DataFrame memory Optimization for python data Analysis 07/02 Update SLTechnology News&Howtos

A method of DataFrame memory Optimization for python data Analysis

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "DataFrame memory optimization method of python data analysis". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

? To explain the situation first, there is no problem for pandas to deal with hundreds of megabytes of dataframe, but when we deal with a few gigabytes or more of data, it will especially take up memory, which is especially bad for users with small memory, so it is necessary to compress the data.

1. Pandas view data occupancy size

Let me show you how much memory you have (user_log is the name of dataframe)

# method 1 is to use the command user_log.info () # method 2 to view dataframe information using memory_usage () or getsizeof (user_log) import timeimport sysprint ("all_data occupies memory: {: .2f} GB" .format (user_log.memory_usage (). Sum () / (1024 memory 3)) print ("all_data occupies memory contract: {: .2f} GB" .format (sys.getsizeof (user_log) / (1024 memory 3)

I have a dataframe file called user_log, the original size is 1.91g, and then pandas reads it out, using 2.9g of memory.

Take a look at the original data size: 1.91G

Memory consumption after pandas read: 2.9G

two。 Compress the data

Columns of numeric types are degraded ('int16', 'int32', 'int64', 'float16', 'float32', 'float64')

Columns of string type are converted to category type (category)

When the number of categories of columns of string type is more than half of the total number of rows, it is recommended to use the object type

We mainly use logarithmic data to downgrade here. What does demotion mean? it can be compared to a drawer. You have a big drawer, but you only have a key, so there will be a lot of space wasted. If we put the key in a small drawer, we can save a lot of space, just like the character type int32 takes up much more space than int8. But it is enough for our data to use the int8 type, which causes the data to take up a lot of space. What we need to do is to convert the data type to save memory space.

This code for compressing the value is seen from a project in the Tianchi Competition. After consulting the data, we found that everyone's compressed memory is basically fixed in the form of functions.

Def reduce_mem_usage (df): starttime = time.time () numerics = ["int16", "int32", "int64", "float16", "float32" "float64"] start_mem = df.memory_usage (). Sum () / 1024 for col in df.columns: col_type = DF [col] .dtypes if col_type in numerics: c_min = DF [col] .min () c_max = DF [col] .max () if pd.isnull (c_min) or pd.isnull (c_max) : continue if str (col_type) [: 3] = "int": if c_min > np.iinfo (np.int8) .min and c_max

< np.iinfo(np.int8).max: df[col] = df[col].astype(np.int8) elif c_min >

Np.iinfo (np.int16). Min and c_max

< np.iinfo(np.int16).max: df[col] = df[col].astype(np.int16) elif c_min >

Np.iinfo (np.int32). Min and c_max

< np.iinfo(np.int32).max: df[col] = df[col].astype(np.int32) elif c_min >

Np.iinfo (np.int64). Min and c_max

< np.iinfo(np.int64).max: df[col] = df[col].astype(np.int64) else: if c_min >

Np.finfo (np.float16). Min and c_max

< np.finfo(np.float16).max: df[col] = df[col].astype(np.float16) elif c_min >

Np.finfo (np.float32). Min and c_max < np.finfo (np.float32). Max: df [col] = DF [col] .astype (np.float32) else: df [col] = DF [col] .astype (np.float64) end_mem = df.memory_usage (). Sum () / 1024 Mem 2 print ("--Mem. Usage decreased to {: 5.2f} Mb ({: .1f}% reduction), time spend: {: 2.2f} min ".format (end_mem, 100* (start_mem-end_mem) / start_mem (time.time ()-starttime) / 60)) return df

Import data into user_log2 by compression

# first read how to pass the new csvuser_log2=reduce_mem_usage (pd.read_csv (r "/ Users/liucong/MainFiles/ML/tianchi/tianmiao/user_log_format1.csv")) into csv

Read successfully: the size of internal training is 890.48m, reduced by 69.6%, and the effect is remarkable.

View the compressed dataset information: the type has changed and the number has become smaller

This is the end of the content of "DataFrame memory Optimization method for python data Analysis". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.