How to generate files and optimize tens of millions of levels of data by Java 07/15 Update SLTechnology News&Howtos

How to generate files and optimize tens of millions of levels of data by Java

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "Java how to generate files and optimize tens of millions of level data", the content of the explanation is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "Java how to generate tens of millions of levels of data files and optimize" it!

An overview of the questions raised at the scene:

Amount of data: generate xml with 100W + pieces of data per file

Memory control: preferably no more than 512m

Problem details: memory overflow when dealing with about 70W

First, let's take a look at the structure of the xml file to be generated by the program 1 12 03 004 5 0006 1000000 103507 19507 1 20110303 20110419 45000. 2. Tell you how to generate xml files from big data. 1. In the case of a small amount of data, < 1W pieces of data

An easier way to use is to use an open source framework, such as XStream, to generate javabean directly into xml

Advantages: api is easy to operate and easy to maintain

Disadvantages: too much memory consumption in the case of a large amount of data

2. Generate a xml file with a large amount of data (the method used in this program)

A self-made xml file framework that can generate unlimited large xml files with very little memory from three parts.

Part one: generate file headers

For example: xxx.toXML (Object obj, String fileName)

The second part: generate the file block by appending 3000 (configurable) pieces of data to the file each time.

For example: xxx.appendXML (Object object); / / object can be ArrayList or a separate javaBean

Part III: generate the tail of the xml file

For example: xxx.finishXML ()

Call in the program: after calling xxx.toXML (Object obj, String fileName) to generate the file header, you can read the data from the database to generate ArrayList, append it to the xml file through the xxx.appendXML (Object object) method, and xxx.finishXML () to finish the file.

Description of the framework: the examples I provided above are file header + file block + file tail. If it is not consistent with your actual file, you can refer to the ideas provided above to modify it. The main method is to separate the same file block and write it into the xml file by appending it.

With the idea, you can try to write a similar big data processing framework (10 million level or above). If you need any help, you can contact me directly, because it is the program of the company, I am afraid to let it out.

Third, how do I test performance and optimize 1. Manual exclusion

According to the log when the file crashed, it was found that the error was reported in the framework that generated xml. The first thought was that some resources in the framework were not released. So I checked the file generation framework as a whole, wrote a simple program to generate 2 million pieces of data, and used xml framework to generate a xml file. In the whole generation process, the memory used by the task manager (xp) to view the corresponding java process is about 20m, so the problem of the framework is excluded. It is suspected that there is a problem with the department of database query and calling framework.

Tested the key part of the main program code, optimized the string processing. Manually release the memory of some objects (for example, call ArrayList.clear (), or empty the object, etc.), run the program after allocating 512 memory, 600000 data when the memory overflows, because the objects that can be released actively have been released, still not solved, decisively give up looking at the code, ready to use JProfile for memory detection.

2. Manual exclusion is not solved, with the help of memory analysis tool JProfile.

By generating 300W pieces of data in the database, running more programs on JProfile, while running, while calling the execute GC button provided by JProfile to actively run garbage collection, after running 50W data, it is found that the number of java.long.String [] and oracle.jdbc.driver.Binder [] two objects has been kept in a self-increasing state, and the number of objects is basically the same, the number of objects is more than 200W. Because the java.long.String [] object needs to depend on the object, it is concluded that the problem lies in oracle.jdbc.driver.Binder []. Due to the existence of references in the modified object, String [] can not be recycled normally.

3. View the management of objects in the JProfile object

It is detected that oracle.jdbc.driver.Binder is caused by oracle.jdbc.driver.T4CPreparedStatement, and T4CPreparedStatement happens to be the concrete implementation of jdbc OraclePreparedStatement by Oracle, so it is concluded that the problem of oracle.jdbc.driver.Binder object can not be released normally due to the problem of database processing. Through another purposeful detection code, the problem of jdbc data query is checked, and the spearhead of the problem is up to the batch processing and transaction processing of the database. Therefore, after each successful generation of a file, the program will transfer the processed data to the corresponding history table for backup, while batch processing and transactions are used in the operation of another table, and the use of batch processing is mainly to ensure the speed of execution. the main purpose of using transactions is to ensure success and failure at the same time.

4. Therefore, the program queries 3000 pieces of data processing from the database at a time.

So prepare to monitor whether the number of objects in oracle.jdbc.driver.Binder corresponds to the number of queries. By outputting the number of queries in Sysout + JProfile running GC test Binder in the program, data matching proves that there are some problems with java in the process of database batch processing.

5. Specially extract the batch code through JProfile memory analysis. Finally, the problem has been located.

The reasons are as follows: in the process of generating a file with 100W data, the data in the database can not be backed up to the history table until the file is generated. Only at this time can the transaction commit, that is, execute commit (), and delete the original table data. 100W data is written to the file according to 3000 batches, and each batch is only written through PreparedStatement.addBatch (). To add to the batch, instead of executing PreparedStatement.executeBatch (), PreparedStatement.executeBatch () is called uniformly before commit (), so that PreparedStatement will cache 100W pieces of data information, resulting in a memory overflow.

The wrong method is as follows:

Try {conn.setAutoCommit (false); pst = conn.prepareStatement (insertSql); pstDel = conn.prepareStatement (delSql); pstUpdate = conn.prepareStatement (sql);. / / totalSize = 100W data / 3000 batch for (int I = 1; I

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.