Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Java to read large files efficiently

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to use Java to read large files efficiently". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to read large files efficiently with Java".

Memory read

In the first version, A Fan uses the way of reading in memory, and all the data is first read into memory. The program code is as follows:

Stopwatch stopwatch = Stopwatch.createStarted (); / / List lines = FileUtils.readLines (new File ("temp/test.txt"), Charset.defaultCharset ()) in memory read by all rows; for (String line: lines) {/ / pass} stopwatch.stop (); System.out.println ("read all lines spend" + stopwatch.elapsed (TimeUnit.SECONDS) + "s"); / / Compute memory occupies logMemory ()

The logMemory method is as follows:

MemoryMXBean memoryMXBean = ManagementFactory.getMemoryMXBean (); / / Heap memory usage MemoryUsage memoryUsage = memoryMXBean.getHeapMemoryUsage (); / / initial total memory long totalMemorySize = memoryUsage.getInit (); / / memory used long usedMemorySize = memoryUsage.getUsed (); System.out.println ("Total Memory:" + totalMemorySize / (1024 * 1024) + "Mb"); System.out.println ("Free Memory:" + usedMemorySize / (1024 * 1024) + "Mb")

In the above program, A Fan uses Apache Common-Io open source third-party library, FileUtils#readLines will read all the contents of the file into memory.

There is nothing wrong with the simple testing of this program, but when you get the real data file and run the program, OOM occurs soon.

The main reason why OOM occurs is that the data file is too large. Suppose the above test file test.txt has a total of 200W lines of data, and the file size is: 740MB.

After reading the memory through the above program, the memory usage on my computer is as follows:

You can see a file with an actual size of more than 700 M and read that it takes up as much as 1.5 gigabytes of memory. In my previous program, the virtual machine set the memory size to only 1G, so OOM occurred in the program.

Of course, the easiest way here is to add memory and set the virtual machine memory to 2G or more. However, the machine memory is always limited, and if the files are larger, there is still no way to load them all into memory.

But when you think about it, do you really need to load all the data into memory at once?

Obviously, no!

In the above scenario, we put the data into the loaded memory and end up processing the data one by one.

So let's change the read mode to read line by line.

Read line by line

There are many ways to read line by line. Here, A Fan mainly introduces two ways:

BufferReader

Apache Commons IO

Java8 stream

BufferReader

We can use BufferReader#readLine to read data line by line.

Try (BufferedReader fileBufferReader = new BufferedReader (new FileReader ("temp/test.txt") {String fileLineContent; while ((fileLineContent = fileBufferReader.readLine ())! = null) {/ / process the line. } catch (FileNotFoundException e) {e.printStackTrace ();} catch (IOException e) {e.printStackTrace ();}

Apache Commons IOCommon-IO

There is a method FileUtils#lineIterator that can be read line by line, using the following code:

Stopwatch stopwatch = Stopwatch.createStarted (); LineIterator fileContents = FileUtils.lineIterator (new File ("temp/test.txt"), StandardCharsets.UTF_8.name ()); while (fileContents.hasNext ()) {fileContents.nextLine (); / / pass} logMemory (); fileContents.close (); stopwatch.stop (); System.out.println ("read all lines spend" + stopwatch.elapsed (TimeUnit.SECONDS) + "s")

This method returns an iterator, a row of data that we can get each time.

In fact, when we look at the code, we can find that FileUtils#lineIterator is actually using BufferReader, and interested students can check the source code for themselves.

As the outer chain cannot be inserted within the official account, follow "Java geek technology" and reply "20200610" to get the source code.

Java8 stream

A new lines has been added to the Java8 Files class, which can return Stream and we can process the data row by line.

Stopwatch stopwatch = Stopwatch.createStarted (); / / lines (Path path, Charset cs) try (Stream inputStream = Files.lines (Paths.get ("temp/test.txt"), StandardCharsets.UTF_8) {inputStream .filter (str-> str.length () > 5) / / filter data. ForEach (o-> {/ / pass do sample logic});} logMemory (); stopwatch.stop () System.out.println ("read all lines spend" + stopwatch.elapsed (TimeUnit.SECONDS) + "s")

One advantage of using this method is that we can easily use Stream chain operations to do some filtering operations.

Note: here we use the try-with-resources method to safely ensure that the read ends and the stream can be safely closed.

Concurrent read

Read line by line to solve our OOM problem. However, if there is a lot of data, it will take a lot of time for us to process it line by line.

In the above way, there is only one thread processing data, so we can actually have a few more threads to increase the degree of parallelism.

Below, on the basis of the above, A Fan threw a brick to attract jade, introducing the two parallel processing methods commonly used by A Fan himself.

Line-by-line batch packing

The first way is to read the data line by line, load it into memory, wait until some data is accumulated, and then give it to the thread pool for asynchronous processing.

@ SneakyThrows public static void readInApacheIOWithThreadPool () {/ / create a thread pool with a maximum of 10 threads and a maximum of 100 queues ThreadPoolExecutor threadPoolExecutor = new ThreadPoolExecutor (10,10,60l, TimeUnit.SECONDS, new LinkedBlockingDeque (100)); / / read data line by line using Apache LineIterator fileContents = FileUtils.lineIterator (new File ("temp/test.txt"), StandardCharsets.UTF_8.name (); List lines = Lists.newArrayList () While (fileContents.hasNext ()) {String nextLine = fileContents.nextLine (); lines.add (nextLine); / / if (lines.size () = = 100000) {/ / split into two 50000 to be processed by asynchronous thread List partition = Lists.partition (lines, 50000); List futureList = Lists.newArrayList () For (List strings: partition) {Future future = threadPoolExecutor.submit (()-> {processTask (strings);}); futureList.add (future);} / wait for two threads to finish executing the task, and then read the data again. For this purpose, too many tasks and too much data are loaded, resulting in OOM for (Future future: futureList) {/ / waiting for the end of execution future.get ();} / clear content lines.clear () }} / / lines if there is any left, continue execution to end if (! lines.isEmpty ()) {/ / continue to execute processTask (lines);} threadPoolExecutor.shutdown () } private static void processTask (List strings) {for (String line: strings) {/ / simulate business execution try {TimeUnit.MILLISECONDS.sleep (10L);} catch (InterruptedException e) {e.printStackTrace ();}

In the above method, when the data in memory reaches 10000, two tasks are unsealed and handed over to the asynchronous thread to execute, and each task processes 50000 rows of data respectively.

Then use future#get () to wait for the asynchronous thread to finish before the main thread can continue to read the data.

The main reason for this is that there are too many tasks in the thread pool, which once again leads to the problem of OOM.

The second way to split a large file into small files is to split a large file into several small files, and then use multiple asynchronous threads to process the data line by line.

Public static void splitFileAndRead () throws Exception first split the large file into small files List fileList = splitLargeFile ("temp/test.txt"); / / create a thread pool with a maximum of 10 threads and a maximum of 100 queues ThreadPoolExecutor threadPoolExecutor = new ThreadPoolExecutor (10,10,60l, TimeUnit.SECONDS, new LinkedBlockingDeque (100); List futureList = Lists.newArrayList () For (File file: fileList) {Future future = threadPoolExecutor.submit (() -)-> {try (Stream inputStream = Files.lines (file.toPath (), StandardCharsets.UTF_8)) {inputStream.forEach (o-> {/ / simulate business execution try {TimeUnit.MILLISECONDS.sleep (10L)) } catch (InterruptedException e) {e.printStackTrace ();}});} catch (IOException e) {e.printStackTrace ();}}); futureList.add (future) } for (Future future: futureList) {/ / wait for all tasks to finish execution future.get ();} threadPoolExecutor.shutdown ();} private static List splitLargeFile (String largeFileName) throws IOException {LineIterator fileContents = FileUtils.lineIterator (new File (largeFileName), StandardCharsets.UTF_8.name ()); List lines = Lists.newArrayList (); / / File serial number int num = 1 List files = Lists.newArrayList (); while (fileContents.hasNext ()) {String nextLine = fileContents.nextLine (); lines.add (nextLine); / / 10w lines of data per file if (lines.size () = = 100000) {createSmallFile (lines, num, files); num++ }} / / lines if there is any left, continue execution to end if (! lines.isEmpty ()) {/ / continue to execute createSmallFile (lines, num, files);} return files;}

In the above method, a large file is first divided into a number of small files that hold 10W rows of data, and then the small files are handed over to the thread pool for asynchronous processing.

Since the asynchronous thread here reads data from small files line by line each time, you don't have to worry about OOM like the above method.

In addition, above we use Java code to split large files into small ones. Here, A Fan also has a simple way. We can directly use the following command to split a large file into small files:

# split large files into 100000 small files split-l 100000 test.txt

Subsequent Java code only needs to read small files directly.

To sum up, when we read data from a file, if the file is not very large, we can consider reading it into memory at once and then processing it quickly.

If the file is too large, we can't load it into memory at once, so we need to consider reading it line by line and then processing the data. However, single-threaded data processing is limited after all, so we consider using multi-threading to speed up data processing.

Thank you for your reading, the above is the content of "how to use Java to read large files efficiently". After the study of this article, I believe you have a deeper understanding of how to use Java to efficiently read large files, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report