How does Pandas deal with big data 07/12 Update SLTechnology News&Howtos

How does Pandas deal with big data

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to deal with big data Pandas, I believe that most people do not know much about it, so share this article for your reference, I hope you will learn a lot after reading this article, let's go to understand it!

Reading and writing of large text data

Sometimes we will get some very large text files and read them into memory completely, and the process of reading will be very slow, and it may even be impossible to read into memory, or we may not be able to read into memory, but can not carry out further calculation. at this time, if we do not have to carry out very complex operations, we can use the chunksize or iterator parameters provided by read_csv to partially read the files, and then pass the mode='a' of to_csv after processing. Write the results of each part to the file step by step.

To_csv, the choice of to_excel

Output results collectively will encounter the choice of output format, usually we use the most .csv, .xls, .xlsx, the latter two is excel2003, the other is excel2007, my experience is csv > xls > xlsx, large file output csv is much faster than the output excel, xls only supports 60000 + records, although xlsx supports more records, but if the content in Chinese often appears weird content loss. Therefore, if the number is small, you can choose xls, and if the number is large, it is suggested that there is still a limit on the number of output to csv,xlsx, and a large amount of data will make you think that python is dead.

Read-time processing date column

Before, I used to process the date column through the to_datetime function after reading the data. If the amount of data is large, this is a time-wasting process. In fact, when reading the data, you can directly specify the column parsed to the date through the parse_dates parameter. It has several parameters. When TRUE, it parses index to date format, and passing column name as list parses each column to date format.

To say a few more words about the to_datetime function, we often get some strange data in the format of the period. When we encounter these data, the to_datimetime function will report an error by default. In fact, these data can be ignored. You only need to set the errors parameter to 'ignore'' in the function.

In addition, to_datetime returns a timestamp, as shown by the name of the function. Sometimes we only need the date part, and we can make this change on the date column, datetime_col = datetime_col.apply (lambda x: x.date ()), and the same datetime_col = datetime_col.map (lambda x: x.date ()) with the map function.

Convert some numerical codes into text

When I mentioned the map method earlier, I thought of another trick. Some of the data we get are often digitally encoded. For example, we have the gender column, where 0 is male and 1 is female. Of course, we can do it by index.

In fact, we have a simpler way to pass a dict to the column to be modified, and it will have the same effect.

Calculate the time difference of the user's two adjacent login records through the shift function

Previously, there was a project that needed to calculate the time difference between two adjacent login records of a user. It seems that this requirement is actually very simple, but if the amount of data is large, it is not a simple task. If you take it apart, you need to take two steps: * group the login data by users, and then calculate the interval between two logins for each user. The format of the data is simple, as follows

If the amount of data is small, you can first unique uid, and then calculate the two login intervals of one user at a time, like this.

Although the calculation logic of this method is relatively clear and easy to understand, the disadvantage is also very obvious, and the amount of calculation is huge, which is equivalent to how many times there are records.

So why is pandas's shift function suitable for this calculation? Let's take a look at what the shift function does.

Just misplaced the value down one bit, is it exactly what we need? Let's modify the above code with the shift function.

The above code gives full play to the advantages of pandas vectorization calculation, avoiding the most time-consuming uid loop in the calculation process. If our uid is a sort and then use shift (1) to get all the previous login time, but there are a lot of unused uid in the real login data, so then shift the uid as uid0, and keep the record that uid and uid0 match.

The above is all the contents of the article "how to deal with big data by Pandas". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.