How does Python deal with big data? Three strategies to improve the efficiency of skills 04/26 Update SLTechnology News&Howtos

How does Python deal with big data? Three strategies to improve the efficiency of skills

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

If you had a 5-or 6-gigabyte file and wanted to read out the contents of the file, do some processing and save it to another file, what would you use for processing? No need to wait online, give a few error examples: some people use multiprocessing processing, but the efficiency is very low. Therefore, some people use python to deal with large files will still have efficiency problems. Because efficiency is only related to the expected time, it will not report an error, which means that there is something wrong with the program itself.

So, why use python to deal with the total efficiency of large files?

If you need to deal with a large file right away, you need to pay attention to two points:

I. the efficiency of reading large files

In the face of 100w rows of large data, after testing various file reading methods, a conclusion is drawn:

With open (filename, "rb") as f:

For fLine in f:

Pass

The method is the fastest, with a full traversal of 2.7 seconds in 100w lines. Basically meet the efficiency requirements of medium and large document processing. If rb is changed to r, it will be 6 times slower. However, files are handled in this way, and fLine is of type bytes. However, python breaks the line on its own and can still handle the read content in a behavior unit.

II. The efficiency of text processing

Here is an example of ascii fixed-length file, because this is not a delimiter file, so we intend to use list operation to achieve data segmentation. But the problem is to process 20w pieces of data, and the time rises sharply to 12s. I thought byte.decode added time. Then the whole bytes treatment of decode was removed. But found that the efficiency is still very poor.

Finally, it is tested in the simplest way, running for the first time, and the simplest way is 7.5 seconds and 100w times.

So about python's skills of dealing with large files, let's take a look at three points from the network: list, file attributes, and dictionary.

1. List processing

Def fun (x): try to select data types of collections and dictionaries, but never select lists. The query speed of lists will be super slow. Similarly, when collections or dictionaries are already used, do not convert them to lists for operation, such as:

Values_count = 0

# Don't use this one

If values in dict.values ():

Values_count + = 1

# try to use this kind of

If keys,values in dict:

Values_count + = 1

The speed of the latter will be much faster than the former.

2. For file attributes

If you encounter a file that has the same attributes, but cannot be deduplicated, and there is no way to use a collection or dictionary, you can add attributes, such as remapping the original data into a column of counting attributes, so that each attribute is unique, so that it can be processed with a dictionary or collection:

Return'('+ str (x) +', 1)'

List (map (fun, [1J 2je 3]))

Use the map function to add different items to multiple identical properties.

3. For dictionaries

Use iteritems () more than items (), and iteritems () returns an iterator:

> d = {'axiaqizhuanglu 1jingbaozi 2}

> for i in d.items ():

.... Print i

(`axiaqingjin1)

('baked page2)

> for KBI v in d.iteritems ():

... Print k,v

(`axiaqingjin1)

('baked page2)

The dictionary's items function returns a list of tuples of key-value pairs, while iteritems uses generator,items of key-value pairs. When used, the entire list iteritems is called. When used, only values are called.

Apart from the following five python usage modules, do you have any skills to solve the problem of running large files efficiently? Scan to communicate with us, learn more about Python utility modules, and quickly improve work efficiency.

1. The technology of reading and writing files will be used in the parameterization of test data and the writing function of test report in the future.

two。 Data processing technology, the future test script test data processing process can be used ~

3. Data statistical analysis technology will be used in the analysis of test results in the future.

4. Chart display technology, which will be used in related test reports in future test frameworks

5. Program automatic trigger technology, which can be used to test the automatic execution of script programs.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.