The skills of senior Java R & D personnel in solving big data's problem 07/09 Update SLTechnology News&Howtos

The skills of senior Java R & D personnel in solving big data's problem

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

As we all know, when Java is dealing with a large amount of data, loading into memory will inevitably lead to memory overflow, but in some data processing, we have to deal with massive data. In data processing, our common methods are decomposition, compression, parallelism, temporary files and so on.

Cdn.xitu.io/2018/7/23/164c5afdf20d4a8c?imageView2/0/w/1280/h/960/format/webp/ignore-error/1 ">

For example, we want to export data from a database (whatever it is) to a file, usually Excel or CSV in text format For Excel, for the interface between POI and JXL, you often have no way to control when memory is written to disk, which is disgusting, and the size of objects constructed by these API in memory will be many times larger than the original size of the data, so you have to split Excel. Fortunately, POI began to realize this problem. After version 3.8.4, it began to provide the number of rows of cache and provided the interface of SXSSFWorkbook. The number of rows that can be set in memory, but unfortunately, when you exceed this number of lines, every time you add one, it writes the first line in front of the relative number of rows to disk (if you set 2000 lines, when you write line 20001, he will write the first line to disk). In fact, he has some temporary files at this time, so that no memory is consumed, but you will find that the frequency of flashing the disk will be very high. We really don't want to do this, because we want him to reach a range that brushes data such as disk, such as 1m at a time, but unfortunately there is no such API yet. It is very painful. I have tested it myself. It is more efficient to write a small Excel than to write a large file using API, which currently provides disk brushing. Moreover, if there are a little more visitors, the disk IO may not be able to handle it. Since IO resources are very limited, opening files is the best policy. And when we write CSV, that is, the text type of the file, we often can control, but you do not use the API provided by CSV, it is not very controllable, CSV itself is a text file, you can write according to the text format can be recognized by CSV; how to write it? Let's talk about it.

At the data processing level, such as reading data from the database, generating local files, writing code, for convenience, we do not need 1m to deal with, this is left to the underlying driver to split, for our program, we think it can be written continuously; for example, we want to export a database table of 1000W data to a file. At this point, you can either paging oracle with three layers of packaging and MySQL with limit, but each paging will make a new query, and as you turn the page, it will get slower and slower. In fact, we want to get a handle, and then swim down, compiling some data (such as 10000 lines) will write the file once (write the file details, this is the most basic), we need to pay attention to the data of each buffer, when writing with outputstream. It's best to flush and empty the buffer. Next, if you execute a SQL without where conditions, will the memory burst? Yes, this question is worth thinking about. Through API, we can find that we can do some operations on SQL, for example, by: PreparedStatement statement = connection.prepareStatement (sql), which is the default precompilation, but also by setting: PreparedStatement statement = connection.prepareStatement (sql, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)

To set the cursor so that the cursor does not cache the data directly to local memory, and then sets the size of each cursor traversal by setting statement.setFetchSize OK, in fact, I have used this. There is no difference between using oracle and not using it, because the jdbc API of oracle will not cache the data into the memory of java by default, and the setting in mysql is not valid at all. I said a lot of nonsense above, hehe, I just want to say that the standard API provided by java may not be effective either. Very often, it depends on the manufacturer's implementation mechanism, and this setting is effective on the Internet, but it is pure plagiarism. As for oracle, you don't have to worry about it. It is not cache to memory itself, so java memory will not cause any problems. If it is mysql, you must first use version 5 or above, and then add the parameter useCursorFetch=true to the connection parameter. As for the cursor size, you can add: defaultFetchSize=1000 to the connection parameter, for example:

Jdbc:mysql://xxx.xxx.xxx.xxx:3306/abc?zeroDateTimeBehavior=convertToNull&useCursorFetch=true&defaultFetchSize=1000

Last time was entangled by this problem for a long time (mysql data often lead to program memory expansion, parallel 2 direct systems on the crash), but also to see a lot of source code to find that the miracle is here, finally confirmed by mysql documents, and then tested, and the amount of data are more than 500W, and the amount of data are more than 500W, will not lead to memory expansion, GC everything is normal, this problem is finally over.

After reading the above, you should be tired. Here the editor recommends his own big data study qun531629188 to everyone, whether they are Daniel or college students who want to change careers and want to study.

I welcome editors. There is a [free] big data live course at 20:10 in the evening, focusing on big data analysis methods, big data programming, big data warehouse, big data case, artificial intelligence, data mining are all pure practical information sharing.

Let's talk about other things, data splitting and merging, when there are many data files, we want to merge, when the files are too big and want to split, the process of merging and splitting will encounter similar problems, fortunately, this is within our control, if the data in the file can eventually be organized, then when splitting and merging, don't do it according to the number of logical rows of data. Because in the end, you need to explain the data itself to determine the number of rows, but it is not necessary to just split it. What you need is to do binary processing. In this binary processing process, you should pay attention to the fact that you do not use the same way as usual for read files. Usually, most of them only use a read operation to read a file. If the memory of a large file must be hung up directly, needless to say. At this time, because you should read a controllable range of data at a time, the read method provides the range of overloaded offset and length, which can be calculated by yourself during the loop. Writing to a large file is the same as above. If you don't read a certain program, you have to flush it to disk through the write stream. In fact, the processing of a small amount of data is also useful in modern NIO technology. For example, multiple terminals request a large file download at the same time, such as video download. Under normal circumstances, if the container of java is used for processing, two situations will occur:

One is memory overflow, because each request has to load a file size of memory or more, because java wrapper will generate a lot of other memory overhead, if you use binary, it will generate less, and you will go through several memory copies as you go through the input and output stream, of course, if you have middleware like nginx, then you can send it through send_file mode. But if you want to use the program to deal with the memory, unless you are big enough, but no matter how large the java memory will be GC, if your memory is really large, GC will die, of course, this place can also consider itself through the direct memory call and release to achieve, but the remaining physical memory is required to be large enough, so how big is it? It's hard to say. It depends on the size of the file itself and the frequency of access.

The second is that if the memory is large enough and unlimited, then the limit at this time is thread. the traditional IO model is that a thread is a request thread, which is allocated from the thread pool from the main thread, and then starts to work. After your Context wrapper, Filter, interceptor, all levels of business code and business logic, access to the database, access to files, rendering results, etc., in fact, the whole process thread is suspended. So this part of the resources are very limited, and if the large file operation is an IO-intensive operation, a lot of CPU time is free, the most direct way is to increase the number of threads to control, of course, the memory is large enough and there is enough space to apply for thread pool, but generally speaking, the thread pool of a process is generally limited and not recommended too much, but under the limited system resources, it is necessary to improve performance. We began to have new IO technology, that is, NIO technology, the new version also has AIO technology, NIO can only be regarded as asynchronous IO, but in the middle of the read and write process is still blocked (that is, in the real read and write process, but do not care about the midway response), has not yet achieved the real asynchronous IO, when listening to connect, it does not need many threads to participate, there are separate threads to deal with The connection has also become selector from the traditional socket, and there is no need to assign threads for those that do not need data processing. And AIO is completed through a so-called callback registration, of course, it also needs the support of OS. When it falls, it allocates threads. At present, it is not very mature, and its performance is the same as that of NIO. However, with the development of technology, AIO is bound to surpass NIO. At present, node.js driven by Google's V8 virtual machine engine is a similar mode. This technology is not the focus of this article.

The combination of the above two is to solve the problem of large files and parallelism. The most native way is to reduce the size of each request to a certain extent, such as 8K (this size is a suitable size for network transmission after testing, and local reading files do not need to be so small). If you go further, you can do a certain degree of cache, multiple requested files, cache in memory or distributed cache. You don't have to cache the whole file in memory, just use the recently used cache for a few seconds, or you can use some hot algorithms to cooperate with it. Similar to Thunderbolt download breakpoint transmission (but Thunderbolt's network protocol is not quite the same), it may not be continuous when dealing with download data, as long as it can be eventually merged, on the server side can be reversed, who happens to need this piece of data, just give it to it. After using NIO, you can support a large number of connections and concurrency. Socket connection testing is done locally through NIO. 100 terminals request a server with one thread at the same time. The normal WEB application is that the first file is not sent, and the second request either waits or times out, or directly refuses to get a connection. After being changed to NIO, all 100 requests can be connected to the server, and the server only needs one thread to process the data. Pass a lot of data to these connection request resources, read part of the data and pass it out each time, but it can be calculated that the overall efficiency will not be improved in the overall long connection transmission process, but the relative corresponding and spent memory is quantitatively controlled. this is the charm of technology, maybe not too many algorithms, but you have to understand him.

There are many similar data processing, and sometimes we will focus on efficiency issues. For example, in the process of splitting and merging HBase files, it is more difficult not to affect online business. Many problems are worth studying, because different scenarios have different ways to solve them, but they are more or less the same, understand ideas and methods, understand memory and architecture, and understand that you are facing a scene in Shenyang. It's just that a change in details can have an amazing effect.

Author: wind and fire data

Link: https://juejin.im/post/5b556c846fb9a04f9963a8b5

Source: Nuggets

The copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please indicate the source.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.