How to write Python code to improve the speed of data processing script 07/12 Update SLTechnology News&Howtos

How to write Python code to improve the speed of data processing script

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of "how to write Python code to improve the speed of data processing script". The editor shows you the operation process through an actual case. The operation method is simple and fast, and it is practical. I hope this article "how to write Python code to improve the speed of data processing script" can help you solve the problem.

General Python data processing method

For example, we have a folder full of image data and want to use Python to create thumbnails for each image.

Here is a brief script that uses Python's built-in glob function to get a list of all JPEG images in the folder, and then uses the Pillow image processing library to save 128pixel thumbnails for each image:

Import globimport osfrom PIL import Imagedef make_image_thumbnail (filename): # thumbnails will be named "_ thumbnail.jpg" base_filename, file_extension = os.path.splitext (filename) thumbnail_filename = f "{base_filename} _ thumbnail {file_extension}" # create and save thumbnails image = Image.open (filename) image.thumbnail (size= (128,128) image.save (thumbnail_filename) "JPEG") all JPEG images in the return thumbnail_filename # circular folder Create a thumbnail for image_file in glob.glob ("* .jpg") for each image: thumbnail_file = make_image_thumbnail (image_file) print (f "A thumbnail for {image_file} was saved as {thumbnail_file}")

This script follows a simple pattern, which you will often see in data processing scripts:

First get a list of the files (or other data) you want to process

Write an auxiliary function that can process the individual data of the above file

Use for loops to call helper functions to process each individual data, one at a time.

Let's test this script with a folder of 1000 JPEG images to see how long it takes to run it:

$time python3 thumbnails_1.pyA thumbnail for 1430028941_4db9dedd10.jpg was saved as 1430028941_4db9dedd10_thumbnail.jpg [... About 1000 more lines of output...] real 0m8.956suser 0m7.086s sys 0m0.743s

It took 8.9 seconds to run the program, but what is the real intensity of the computer?

Let's run the program again to see the activity monitor when the program is running:

75% of the computer's processing resources are idle! What's going on?

The reason for this problem is that my computer has four CPU, but only one Python is used. So the program just tried to use one of the CPU, while the other three had nothing to do.

So I need a way to divide the workload into four separate parts that I can process in parallel. Fortunately, there is an easy way for us to do this in Python!

Try to create multiple processes

Here is a way for us to process data in parallel:

Divide the JPEG file into 4 small chunks.

Four separate instances of the Python interpreter are running.

Let each Python instance process one of these four pieces of data.

Merge the processing results of these four parts to get the final list of results.

Four Python copy programs run on four separate CPU, and the processing effort should be about four times higher than that of a CPU, right?

Best of all, Python has done the most troublesome part of the work for us. All we have to do is tell it which function it wants to run and how many instances to use, and it will do the rest.

We only need to change three lines of code during the whole process.

First, we need to import the concurrent.futures library, which is built into Python:

Import concurrent.futures

Next, we need to tell Python to start four additional Python instances. We do this by having Python create a Process Pool:

With concurrent.futures.ProcessPoolExecutor () as executor:

By default, it creates one Python process for each CPU on your computer, so if you have four CPU, it starts four Python processes.

The final step is to have the created Process Pool use these four processes to execute our helper function on the data list.

To complete this step, we need to change the existing for loop:

For image_file in glob.glob ("* .jpg"): thumbnail_file = make_image_thumbnail (image_file)

Replace with a new call to executor.map ():

Image_files = glob.glob ("* .jpg") for image_file, thumbnail_file in zip (image_files,executor.map (make_image_thumbnail, image_files)):

When the executor.map () function is called, you need to enter an auxiliary function and a list of data to be processed.

This function helps me do all the troublesome work, including dividing the list into multiple sublists, sending sublists to each child process, running child processes, merging results, and so on. Well done!

This can also return the result of each function call for us.

The Executor.map () function returns the results in the same order as the input data. So I used Python's zip () function as a shortcut to get the original file name and the matching result in each step.

Here is the program code after these three steps:

Import globimport osfrom PIL import Imageimport concurrent.futures def make_image_thumbnail (filename): # thumbnails will be named "_ thumbnail.jpg" base_filename, file_extension = os.path.splitext (filename) thumbnail_filename = f "{base_filename} _ thumbnail {file_extension}" # create and save thumbnails image = Image.open (filename) image.thumbnail (size= (128,128) image.save (thumbnail_filename) "JPEG") return thumbnail_filename # create Process Pool By default, create a with concurrent.futures.ProcessPoolExecutor () as executor for each CPU of the computer: # get the list of files to be processed image_files = glob.glob ("* .jpg") # process the list of files, but divide the work through ProcessPool, using all CPU! For image_file, thumbnail_file in zip (image_files, executor.map (make_image_thumbnail, image_files)): print (f "A thumbnail for {image_file} was saved as {thumbnail_file}")

Let's run this script to see if it completes data processing faster:

$time python3 thumbnails_2.py A thumbnail for 1430028941_4db9dedd10.jpg was saved as 1430028941_4db9dedd10_thumbnail.jpg [... About 1000 more lines of output...] Real 0m2.274s user 0m8.959s sys 0m0.951s

The script finished processing the data in 2.2 seconds! Four times faster than the original version! The reason why we can process data faster is because we use four CPU instead of one.

But if you look closely, you will find that the "user" time is almost 9 seconds. Then why is the program processing time 2.2 seconds, but somehow the running time is still 9 seconds? That seems impossible, huh?

This is because the "user" time is the sum of all the CPU time, and the total CPU time for us to finish the work is the same, all 9 seconds, but we use 4 CPU to do it, and the actual data processing time is only 2.2 seconds!

Note: enabling more Python processes and allocating data to child processes takes time, so this approach does not always guarantee a significant increase in speed.

Does this always speed up my data processing scripts?

If you have a column of data and each data can be processed separately, using what we call Process Pools is a good way to speed up. Here are some examples that are suitable for parallel processing:

Crawl statistics from a series of separate web server logs.

Parse the data from a stack of XML,CSV and JSON files.

A large number of picture data are preprocessed to establish a machine learning data set.

But keep in mind that Process Pools is not everything. Using Process Pool requires passing data back and forth between separate Python processing processes. This method will not work if the data you are dealing with cannot be effectively transmitted during the process. In short, the data you are dealing with must be the type that Python knows how to deal with.

At the same time, it is not possible to process the data in a desired order. If you need the results of the previous step to proceed to the next step, this method won't work either.

What about GIL's problem?

You probably know that Python has something called a global interpreter lock (Global Interpreter Lock), that is, GIL. This means that even if your program is multithreaded, each thread can only execute one Python instruction. GIL ensures that only one Python thread executes at any one time. In other words, multithreaded Python code doesn't really run in parallel, making it impossible to make full use of multicore CPU.

But Process Pool can solve this problem! Because we are running separate Python instances, each instance has its own GIL. In this way we get Python code that can really be processed in parallel!

Don't be afraid of parallel processing!

With the concurrent.futures library, Python allows you to simply modify the script and immediately put all the CPU on your computer to work.

This is the end of the introduction to "how to write Python code to improve the speed of data processing scripts". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.