Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to write Python code to make data processing 4 times faster

2025-03-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to write Python code to make data processing 4 times faster". In daily operation, I believe many people have doubts about how to write Python code to make data processing 4 times faster. Xiaobian consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the question of "how to write Python code to make data processing 4 times faster." Next, please follow the editor to study!

General Python data processing method

For example:

We have a folder full of image data and want to use Python to create thumbnails for each image.

Here is a brief script:

Use Python's built-in glob function to get a list of all JPEG images in the folder

Then use the Pillow image processing library to save 128pixel thumbnails for each image:

This script follows a simple pattern

You will often see this method in data processing scripts:

First get a list of the files (or other data) you want to process

Write an auxiliary function that can process the individual data of the above file

Use for loops to call helper functions to process each individual data, one at a time.

Let's test this script with a folder containing 1000 JPEG images

See how long it will take to run it:

It took 8.9 seconds to run the program, but what is the real intensity of the computer?

Let's run the program again.

Take a look at the activity monitor while the program is running:

75% of the computer's processing resources are idle! What's going on?

The reason for this problem is that my computer has four CPU, but only one Python is used.

So the program just tried to use one of the CPU, while the other three had nothing to do.

So I need a way to divide the workload into four separate parts that I can process in parallel.

Fortunately, there is an easy way for us to do this in Python!

Try to create multiple processes

Here is a way for us to process data in parallel:

Divide the JPEG file into 4 small chunks. Four separate instances of the Python interpreter are running.

Let each Python instance process one of these four pieces of data.

Merge the processing results of these four parts to get the final list of results.

4 Python copy programs run on 4 separate CPU

The workload of processing should be about four times higher than that of an CPU.

Right?

Best of all, Python has done the most troublesome part of the work for us.

All we have to do is tell it which function it wants to run and how many instances to use, and it will do the rest.

We only need to change three lines of code during the whole process.

First

We need to import the concurrent.futures library

This library is built into Python:

Next, we need to tell Python to start four additional Python instances.

We do this by having Python create a Process Pool:

By default:

It creates a Python process for each CPU on your computer

So if you have four CPU, you will start four Python processes.

* one step:

Let the created Process Pool use these four processes to execute our helper function on the data list.

To complete this step, we need to change the existing for loop:

Replace with a new call to executor.map ():

When the executor.map () function is called, you need to enter an auxiliary function and a list of data to be processed.

This function can help me do all the troublesome work.

This includes dividing the list into multiple sublists, sending sublists to each child process, running child processes, merging results, and so on.

Well done!

This can also return the result of each function call for us.

The Executor.map () function returns the results in the same order as the input data.

So I used Python's zip () function as a shortcut to get the original file name and the matching result in each step.

Here is the program code after these three steps:

Let's run this script.

See if it completes data processing faster:

The script finished processing the data in 2.2 seconds! Four times faster than the original version!

The reason why you can process data faster

Because we used four CPU instead of one.

But

If you look closely, you will find that the "user" time is almost 9 seconds.

Then why is the program processing time 2.2 seconds, but somehow the running time is still 9 seconds?

That seems impossible, huh?

This is

Because "user" time is the sum of all CPU time

The total CPU time for us to finish the work is the same, all 9 seconds.

But we use 4 CPU to complete, the actual data processing time is only 2.2 seconds!

Note:

Enabling more Python processes and allocating data to child processes takes time, so this approach does not guarantee a significant increase in speed. Does this always speed up my data processing scripts?

If you have a column of data,

Using what we call Process Pools is a good way to speed up when each data can be processed separately.

Here are some examples that are suitable for parallel processing:

Crawl statistics from a series of separate web server logs.

Parse the data from a stack of XML,CSV and JSON files.

A large number of picture data are preprocessed to establish a machine learning data set.

But also keep in mind that Process Pools is not *.

Using Process Pool requires passing data back and forth between separate Python processing processes.

This method will not work if the data you are dealing with cannot be effectively transmitted during the process.

In short, the data you are dealing with must be the type that Python knows how to deal with.

meanwhile

It is also impossible to process the data in a desired order.

If you need the results of the previous step to proceed to the next step, this method won't work either.

What about GIL's problem?

You probably know that Python has something called a global interpreter lock (Global Interpreter Lock), that is, GIL.

This means that even if your program is multithreaded, each thread can only execute one Python instruction.

GIL ensures that only one Python thread executes at any one time.

In other words:

Multithreaded Python code does not really run in parallel, making it impossible to make full use of multicore CPU.

But Process Pool can solve this problem!

Because we are running separate Python instances, each instance has its own GIL.

In this way we get Python code that can really be processed in parallel!

At this point, the study on "how to write Python code to make data processing 4 times faster" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report