The skills of dealing with big data by pandas 07/02 Update SLTechnology News&Howtos

The skills of dealing with big data by pandas

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Refer: https://yq.aliyun.com/articles/530060?spm=a2c4e.11153940.blogcont181452.16.413f2ef21NKngz#

Http://www.datayuan.cn/article/6737.htm

Https://yq.aliyun.com/articles/210393?spm=a2c4e.11153940.blogcont381482.21.77131127S0t3io

Reading and writing of large text data

Sometimes we will get some very large text files and read them into memory completely, and the process of reading will be very slow, and it may even be impossible to read into memory, or we may not be able to read into memory, but can not carry out further calculation. at this time, if we do not have to carry out very complex operations, we can use the chunksize or iterator parameters provided by read_csv to partially read the files, and then pass the mode='a' of to_csv after processing. Write the results of each part to the file step by step.

To_csv, the choice of to_excel

Output results collectively will encounter the choice of output format, usually we use the most .csv, .xls, .xlsx, the latter two is excel2003, the other is excel2007, my experience is csv > xls > xlsx, large file output csv is much faster than the output excel, xls only supports 60000 + records, although xlsx supports more records, but if the content in Chinese often appears weird content loss. Therefore, if the quantity is small, you can choose xls, but if the quantity is large, it is suggested that there is still a quantity limit for the output to csv,xlsx, and big data's quantity will make you think that python is dead.

Read-time processing date column

Before, I used to process the date column through the to_datetime function after reading the data. If the amount of data is large, this is a time-wasting process. In fact, when reading the data, you can directly specify the column parsed to the date through the parse_dates parameter. It has several parameters. When TRUE, it parses index to date format, and passing column name as list parses each column to date format.

To say a few more words about the to_datetime function, we often get some strange data in the format of the period. When we encounter these data, the to_datimetime function will report an error by default. In fact, these data can be ignored. You only need to set the errors parameter to 'ignore'' in the function.

In addition, to_datetime returns a timestamp, as shown by the name of the function. Sometimes we only need the date part, and we can make this change on the date column, datetime_col = datetime_col.apply (lambda x: x.date ()), and the same datetime_col = datetime_col.map (lambda x: x.date ()) with the map function.

Convert some numerical codes into text

When I mentioned the map method earlier, I thought of another trick. Some of the data we get are often digitally encoded. For example, we have the gender column, where 0 is male and 1 is female. Of course, we can do it by index.

In fact, we have a simpler way to pass a dict to the column to be modified, and it will have the same effect.

Calculate the time difference of the user's two adjacent login records through the shift function

Previously, there was a project that needed to calculate the time difference between two adjacent login records of a user. It seems that this requirement is actually very simple, but if the amount of data is large, it is not a simple task. If you take it apart, you need two steps. The first step is to group the login data by users, and then calculate the time interval between two logins for each user. The format of the data is simple, as follows

If the amount of data is small, you can first unique uid, and then calculate the two login intervals of one user at a time, like this.

Although the calculation logic of this method is relatively clear and easy to understand, the disadvantage is also very obvious, and the amount of calculation is huge, which is equivalent to how many times there are records.

So why is pandas's shift function suitable for this calculation? Let's take a look at what the shift function does.

Just misplaced the value down one bit, is it exactly what we need? Let's modify the above code with the shift function.

The above code gives full play to the advantages of pandas vectorization calculation, avoiding the most time-consuming uid loop in the calculation process. If our uid is a sort and then use shift (1) to get all the previous login time, but there are a lot of unused uid in the real login data, so then shift the uid as uid0, and keep the record that uid and uid0 match.

-Python data preprocessing: acceleration using Dask and Numba parallelization

[direction] 2018-03-12 11:11:49 browse 2650 comments 0

Python

Big data

Abstract: this paper designs a parallel data processing solution for Python-- using Dask and Numba parallelization to accelerate computing speed. The operation speed of several different methods is compared and analyzed in the case, which is very intuitive and can be used for reference.

If you are good at using Pandas to transform data, create features, and clean data, you can easily use Dask and Numba to speed up your work in parallel. Purely in terms of speed, Dask beats Python, while Numba beats Dask, then Numba+Dask is basically invincible. Divide numerical calculations into Numba sub-function and use Dask map_partition+apply instead of Pandas. For 1 million rows of data, using the Pandas method and mixed numerical calculations to create new features is many times slower than using the Numba+Dask method.

Python:60.9x | Dask:8.4x | Numba:5.8x | Numba+Dask:1x

8be99f10ed908533e525b81fcd04bcdf3b27db2d

As a master of data science from the University of San Francisco, I often deal with data. Using the Apply function is one of the many techniques I use to create new features or clean up data. Right now, I'm just a data scientist, not an expert in computer science, but I'm a programmer who likes to make trouble and make code run faster. Now, I will share my experience in parallel applications.

Most Python enthusiasts probably know the global interpreter lock (GIL) implemented by Python, and GIL takes up all the CPU performance on your computer. To make matters worse, our main data processing packages, such as Pandas, rarely implement parallel processing code.

Apply function vs Multiprocessing.map

Tidyverse has done some wonderful things for processing data, and Plyr is one of my favorite packets that allows R language users to easily parallelize their data applications. Hadley Wickham said:

"plyr is a set of tools for dealing with a set of problems: you need to break down a large data structure into uniform blocks, then apply a function to each block, and finally put all the results together."

As far as Python is concerned, I want packets like plyr to be available. Currently, however, such packets do not exist, but I can use parallel packets to form a simple solution.

Dask

Bbcc3ca9a96dc7ad7129d9047a2d58be57a4ed84

I spent some time on Spark before, so when I started using Dask, it was relatively easy to grasp its key content. Dask is designed to process tasks in parallel on multi-core CPU. In addition, it also draws lessons from many Pandas syntax rules.

Let's start with the examples given in this article. For the recent data challenge, I tried to get an external data source (containing many geocoding points) and match it to a large number of blocks to be analyzed. While calculating the Euclidean distance, use the maximum heuristic to assign the maximum value to a block.

8809febd555c55a69522a58770971c8cf0c57af5

The original apply:

Dask apply:

The two look very similar, the core statement of apply is map_partitions, and finally there is a compute () statement. In addition, npartitions has to be initialized. Partitioning works by dividing Pandas data frames into chunks. For my computer, the configuration is 6 cores-12 threads. I just need to tell it that 12 partitions are used, and Dask will do the rest.

Next, apply the lambda function of map_partitions to each partition. Because much of the data processing code runs independently, you don't have to worry too much about the sequence of these operations. Finally, the compute () function tells Dask to handle the rest and give me the final result of the calculation. Here, compute () calls Dask to apply apply to each partition and make it process in parallel.

Because I iterate through rows to generate a new queue (feature), and Dask apply works only on columns, I don't use Dask apply. Here's the Dask program:

Numba, Numpy and Broadcasting

Since I classify the data according to some simple linear operations (basically the Pythagorean theorem), I think using Python code similar to the following will run faster.

D31908d0ecfefd263b3e5373461b34374de9adf5

Broadcasting is used to describe the processing mechanism of mathematical calculation of two matrices with different shapes in Numpy. Suppose I had an array, and I would change it by iterating and transforming each cell one by one.

Instead, I can skip the for loop and operate on the entire array. Numpy is mixed with broadcasting to perform the intelligent product of elements (bitwise multiplication).

Broadcasting can do more, so now take a look at the skeleton code:

Essentially, the function of the code is to change the array. The good thing is that it runs very fast and can even be compared with Dask parallel processing speed. Second, if you are using the most basic Numpy and Python, you can compile any function in time. The downside is that it is only suitable for Numpy and simple Python syntax. I have to convert all the numerical calculations from my functions to subfunctions, but the speed of calculation will increase very fast.

Use it together

You can simply use map_partition () to combine the Numba function with Dask, and if parallel operations and broadcasting work closely together to speed up the operation, you will see a significant increase in speed for large datasets.

09e60c6e34586f4760449a2159928877d49958cf

D9d0d60dc749ba864cbb200bb05b60e71ff6adcf

The first figure above shows that linear computing without broadcasting does not perform well, and parallel processing and Dask are also effective in improving speed. In addition, it is obvious that the performance of the combination of Dask and Numba is better than that of other methods.

The second picture above is a little more complex, with the Abscissa taking the logarithm of the number of lines. From the second figure, we can see that for data sets as small as 1k to 10k, the performance of using Numba alone is better than that of using Numba+Dask together, although the performance of Numba+Dask on the big data set is very good.

Optimize

In order to be able to compile JIT using Numba, I rewrote the function to make better use of broadcasting. Later, after rerunning these functions, we found that, on average, JIT executes about 24% faster for the same code.

C9f6a34759b5b1298033c2e4ffd5d78a63994af5

It is safe to say that there must be further optimization methods to make the execution faster, but it has not been found so far. Dask is a very friendly tool, and the best result of this article using Dask+Numba is to increase the running speed by 60 times.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.