How to quickly parse and organize tens of thousands of data files by Python code 04/28 Update SLTechnology News&Howtos

How to quickly parse and organize tens of thousands of data files by Python code

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

How to quickly parse Python code, organize tens of thousands of data files, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

In this world, people do different things with Python every day. File operation is one of the most common tasks that need to be solved. With Python, you can easily generate beautiful reports for others, and you can quickly parse and organize tens of thousands of data files in just a few lines of code.

When we write file-related code, we usually focus on these things: is my code fast enough? Does my code get twice the result with half the effort? In this article, I will share with you a few programming suggestions related to it. I will recommend to you an undervalued Python standard library module, demonstrate the best way to read large files, and finally share my thoughts on function design.

Next, let's enter the first "module Amway" time.

Note: because the file systems of different operating systems vary greatly, the main writing environment of this article is the Mac OS/Linux system, and some of the code may not be applicable to the Windows system.

Recommendation 1: use the pathlib module

If you need to do file processing in Python, then the os and os.path brothers in the standard library must be two modules that you can't avoid. In these two modules, there are many tool functions related to file path processing, file reading and writing, and file status viewing.

Let me use an example to show how they are used. There is a directory with a lot of data files, but their suffixes are not uniform, both .txt and .csv. We need to change all the files ending in .txt to the .csv suffix name.

We can write a function like this:

If you don't know what you don't know in the learning process, you can add my python learning communication QQ qun,784758214 group. There are good learning video tutorials, development tools and e-books. Share with you the current talent needs of python enterprises and how to learn python from zero. And learn what 1. `import os`2. `import os.path`5. `txt file name under the unified directory has the suffix .csv`7. `"" `8. `for filename in os.listdir (path): `9. `basename, ext = os.path.splitext (filename) `10. `if ext = '.txt': `11. `abs_filepath = os.path.join (path) Filename) `12. `os.rename (abs_filepath, os.path.join (path, f' {basename} .csv')) `

Let's see which functions related to file processing are used in the above code:

Os.listdir (path): lists all files (including folders) in the path directory

Os.path.splitext (filename): split the base name and suffix part of the file name

Os.path.join (path,filename): the file name of the operation required by the combination is the absolute path

Os.rename (...): rename a file

Although the above functions can complete the requirements, to be honest, even after many years of writing Python code, I still find these functions not only difficult to remember, but also the final finished code is not very pleasant.

Rewrite code using pathlib module

To make file processing easier, Python introduced a new standard library module in version 3.4: pathlib. It is based on object-oriented design and encapsulates a lot of functions related to file operation. If you use it to rewrite the above code, the results will be very different.

Code after using the pathlib module:

1. `from pathlib import Path`3. `def unify_ext_with_pathlib (path): `4. `for fpath in Path (path) .glob ('* .txt'): `5. `fpath.rename (fpath.with_suffix ('.csv')) `

Compared with the old code, the new function only takes two lines of code to complete the work. And these two lines of code mainly do a few things:

First use Path (path) to convert the string path to a Path object

Call .glob ('* .txt') to match the pattern of everything under the path and return it as a generator, and the result is still a Path object, so we can move on

Use .with _ suffix ('.csv') to directly get the full path of the file with the new suffix

Call .rename (target) to complete the rename

Compared with os and os.path, the code after the introduction of pathlib module is significantly more concise and more unified. All file-related operations are done in one stop.

Other uses

In addition, the pathlib module provides many interesting uses. For example, use the / operator to combine file paths:

1. `#? Old friends: use os.path module `2. > import os.path`3. `> os.path.join ('/ tmp', 'foo.txt') `4.` / tmp/foo.txt' `6. `# ✨ new trend: use / operator `7.` > from pathlib import Path`8. `> Path (' / tmp') / 'foo.txt' `9. `PosixPath (' / tmp/foo.txt')`

Or use .read _ text () to quickly read the contents of the file:

1. `# standard practice, using with open (...) Open the file `2. `> > with open ('foo.txt') as file: `3.`. Print (file.read ()) `4. `. `5. `foo`7.` # using pathlib can make this easier `8. `> from pathlib import Path`9.` > print (Path ('foo.txt'). Read_text ()) `10. `foo`

In addition to what I introduced in this article, the pathlib module also provides a lot of useful methods, and it is strongly recommended that you go to the official documentation for more information.

If none of the above is enough to tempt you, then I'll give you one more reason to use pathlib: a new object protocol specifically for "file path" is defined in PEP-519, which means that from the version 3.6 of Python after the PEP takes effect, the Path objects in pathlib can be used compatible with most standard library functions that only accept string paths:

1. `> p = Path ('/ tmp') `2.` # you can join Path object p directly. `> os.path.join (p, 'foo.txt') `4.` / tmp/foo.txt' `

So, don't hesitate to use the pathlib module.

Hint: if you are using an earlier version of Python, you can try installing the pathlib2 module.

Suggestion 2: master how to read large files by stream

Almost everyone knows that there is a "standard practice" for reading files in Python: first get a file object using the withopen (fine_name) context manager, and then iterate through it using a for loop to get the contents of the file line by line.

Here is a simple example function that uses this "standard practice":

1. `def count_nine (fname): `2. `"" calculates the number of digits'9' `3.` "`4. `count = 0`5. `with open (fname) as file: `6. `for line in file: `7. `count + = line.count ('9') `8. `return count`

If we have a file small_file.txt, we can easily calculate the number of 9 using this function.

1. `# small_ file.txt`2. `feiowe9322nasd9233rl`3. `aoeijfiowejf8322kaf9a`5. `# OUTPUT: 3`6. `print (count_nine ('small_file.txt'))`

Why does this way of reading files become standard? This is because it has two benefits:

The with context Manager automatically closes open file descriptors

When iterating over a file object, the content is returned line by line and does not take up too much memory

Shortcomings of standard practice

But this set of standard practices is not without shortcomings. If there is no newline character in the file being read, then the second benefit above does not hold. When the code is executed to forlineinfile, line becomes a very large string object, consuming a considerable amount of memory.

Let's do an experiment: there is a large 5GB file, big_file.txt, which is filled with the same random strings as small_file.txt. It's just that it stores content in a slightly different way, with all the text on the same line:

1. `# FILE: big_ file.txt`2. `df2if283rkwefh... .. `

If we continue to use the previous count_nine function to count the number of 9 in this large file. So on my laptop, this process takes 65 seconds and eats the machine's 2GB memory during execution.

Read in blocks using the read method

To solve this problem, we need to put this "standard practice" aside for a while and use the lower-level file.read () method. Unlike iterating through the file object directly, each call to file.read (chunk_size) returns the contents of the chunk_size-sized file read back from the current location without having to wait for any newline characters to appear.

So, if you use the file.read () method, our function can be rewritten like this:

1. `def count_nine_v2 (fname): `2. `"" calculates the number of numbers in the file, 8kb`3.` "`4. `count = 0`5. `block_size = 1024 * 8`6. `with open (fname) as fp: `7. `while True: `8. `chunk = fp.read (block_size) `9. `# when there is no more content in the file The read call will return the empty string '`10. `if not chunk: `11. `break`12. `count + = chunk.count (' 9') `13. `return count`

In the new function, we use a while loop to read the contents of the file, up to the 8kb size at a time, which avoids the need to concatenate a huge string and reduces the memory footprint a lot.

Decoupling code using generator

Suppose we are not talking about Python, but other programming languages. So it can be said that the above code is very good. But if you carefully analyze the count_nine_v2 function, you will find that inside the loop body, there are two separate logic: data generation (read call and chunk judgment) and data consumption. And these two independent logic are coupled together.

To improve reuse, we can define a new chunked_file_reader generator function that is responsible for all the logic related to "data generation". So the main loop in count_nine_v3 is only responsible for counting.

1. `def chunked_file_reader (fp, block_size=1024 * 8): `2. `"" generator function: read file contents in blocks `3. "`4. `while True: `5. `chunk = fp.read (block_size) `6.` # when there is no more content in the file The read call will return the empty string'`7. `if not chunk: `8. `break`9. `yield chunk`12. `def count_nine_v3 (fname): `13. `count = 0`14. `with open (fname) as fp: `15. `for chunk in chunked_file_reader (fp): `16. `count + = chunk.count ('9') `17. `return count`

At this point, the code seems to have no room for optimization, but it is not. Iter (iterable) is a built-in function for constructing iterators, but it also has a less well-known usage. When we call it using iter (callable,sentinel), a special object is returned, and iterating over it will continue to produce the result of calling the callable object callable until the iteration ends when the result is setinel.

1. `def chunked_file_reader (file, block_size=1024 * 8): `2. `"generator function: read the contents of the file in blocks, and use the iter function `3."`4.` # first use partial (fp.read, block_size) to construct a new function without parameters `5. `# loop will continue to return the result of fp.read (block_size) call Terminates `6. `for chunk in iter (partial (file.read, block_size),''): `7. `yield chunk` until it is''

In the end, with only two lines of code, we completed a reusable chunked file read function. So, how does this function perform in terms of performance?

Compared with the initial 2GB memory / time of 65 seconds, using the generator version only takes 7MB memory / 12 seconds to complete the calculation. The efficiency has increased nearly four times, and the memory footprint is less than 1%.

Recommendation 3: design functions that accept file objects

After counting the "9" in the file, let's change the demand. Now, I want to count how many English vowels (aeiou) appear in each file. With a little adjustment to the previous code, you can quickly write a new function count_vowels.

If you don't know what you don't know in the learning process, you can add my python learning communication QQ qun,784758214 group. There are good learning video tutorials, development tools and e-books. Share with you the current talent needs of python enterprises and how to learn python well from zero basics, and what to learn 1. `def count_vowels (filename): `2. Statistics in a file The number of vowels (aeiou): `3. `"`4.` VOWELS_LETTERS = {'a','e', 'ify,' oval,'u'} `5.` count = 0`6. `with open (filename) `r') as fp: `7. `for line in fp: `8. `for char in line: `9. `if char.lower () in VOWELS_LETTERS: `10. `count + = 1`11. `return count`14. `# OUTPUT: 16`15. `print (count_vowels ('small_file.txt'))`

Compared with the previous "Statistics 9" function, the new function becomes a little more complex. In order to ensure the correctness of the program, I need to write some unit tests for it. But when I was ready to write the test, I found it very troublesome, and the main problems were as follows:

Function takes a file path as an argument, so we need to pass an actual file

In order to prepare the test case, I can either provide several template files or write some temporary files

And whether the file can be opened and read normally has become the boundary condition that we need to test.

If you find it difficult to write unit tests for your function, it usually means that you should improve its design. How should the above function be improved? The answer is: let the function rely on the "file object" instead of the file path.

The modified function code is as follows:

1. `def count_vowels_v2 (fp): `2. `"Statistics in a file The number of vowels (aeiou): `3. "" `4. `VOWELS_LETTERS = {'aqu,' eBay, 'iTunes,' oasis,'u'} `5. `count = 0`6. `for line in fp: `7. `for char in line: `8. `if char.lower () in VOWELS_LETTERS: `9. `count + = 1`10. `return count`13.` # after modifying the function The responsibility of opening the file is transferred to the upper-level function caller `14. `with open ('small_file.txt') as fp: `15.`print (count_vowels_v2 (fp)) `

The main change brought about by this change is that it improves the applicability of the function. Because Python is a "duck type", although the function needs to accept file objects, we can actually pass any "class file object (file-like object)" that implements the file protocol into the count_vowels_v2 function.

There are a lot of "class file objects" in Python. For example, the StringIO object in the io module is one of them. It is a special memory-based object with almost the same interface design as the file object.

With StringIO, we can easily write unit tests for functions.

1. `# Note: the following test functions need to be executed using pytest. `2. `import pytest`3. `from io import StringIO`6.` @ pytest.mark.parametrize (`7. `"content,vowels_count", [`8.` # use the parameterized testing tool provided by pytest Define the list of test parameters `9. `# (file contents, expected results) `10.` ('', 0), `11. `('Hello wordwaters, 3), `12.` (' HELLO world, 3), `13. `('Hello World', 0), `14. `] `15.`) `16. `def test_count_vowels_v2 (content, vowels_count): `17.` # using StringIO to construct a class file object "file" `18. `file = StringIO (content) `19. `assert count_vowels_v2 (file) = = vowels_ count`

Running the test using pytest shows that the function can pass all the use cases:

1. `❯ pytest vowels_ counter.py`2. `= test session starts = `3. ``collected 4 items`5. `vowels_counter.py. [100%] `7.` = 4 passed in 0.06 seconds = `

Making it easier to write unit tests is not the only benefit of modifying functional dependencies. In addition to StringIO, the subprocess module is used to store standard output PIPE objects when calling system commands, which is also a kind of "class file object". This means that we can directly pass the output of a command to the count_vowels_v2 function to count the vowels:

1. `import subprocess`3. `# count the number of vowels in all sub-file names (directory names) below / tmp `4. `p = subprocess.Popen (['ls',' / tmp'], stdout=subprocess.PIPE, encoding='utf-8') `6.` # p.stdout is a streaming file object, which can be directly passed into the function `7. `# OUTPUT: 42`8. `print (count_vowels_v2 (p.stdout))`

As mentioned before, the biggest advantage of changing the function parameter to "file object" is to improve the applicability and composability of the function. By relying on more abstract "class file objects" rather than file paths, it opens up more possibilities for the use of functions. StringIO, PIPE, and any other object that meets the protocol can become customers of the function.

However, this modification is not without shortcomings, and it will also cause some inconvenience to the caller. If the caller wants to use the file path, it must handle the file opening operation on its own.

How to write a function that is compatible with both

Is there a way to have the flexibility of "accepting file objects" while making it more convenient for callers who pass the file path? The answer is: yes, and there are such examples in the standard library.

Open the xml.etree.ElementTree module in the standard library and open the ElementTree.parse method inside. You will find that this method can be called using the file object and accept the file path of the string. And the way it does this is very easy to understand:

1. `def parse (self, source, parser=None): `2. "* source* is a file name or file object, * parser* is an optional parser`3. `"`4.` close_source = false `5. `# determine whether the source has a" read "attribute to determine whether it is a" class file object "`6.` # if not Then call the open function to open it and assume the responsibility of closing it at the end of the function. `7. ``if not hasattr (source, "read"): `8. `source = open (source, "rb") `9. `close_source = True`

Using this flexible detection method based on the "duck type", the count_vowels_v2 function can also be modified to be more convenient, so I won't repeat it here.

If you are still confused in the world of programming, you can join our Python Learning qun:784758214 to see how our predecessors learn! Exchange experiences! I am a senior python development engineer, from basic python scripts to web development, crawlers, django, data mining, etc., zero foundation to project actual combat materials have been sorted out. To every little friend of python! To share some learning methods and small details to pay attention to, click to join our python learner gathering place.

File manipulation in areas that we often need to come into contact with in our daily work, using more convenient modules, using generators to save memory, and writing functions with a wider range of applications can enable us to write more efficient code.

Let's make a final summary:

Using the pathlib module can simplify file and directory-related operations and make the code more intuitive

PEP-519 defines a standard protocol that represents a "file path", which is implemented by the Path object

Memory can be saved by defining generator functions to read large files in blocks.

Using iter (callable,sentinel) can simplify code in some specific scenarios

Code that is difficult to write tests, and usually code that needs improvement

Making a function depend on "class file object" can improve the applicability and composability of the function.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.