Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the method of file IO operation

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what is the method of file IO operation". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

01

/ background /

The past middleware performance challenge and the ongoing first PolarDB data performance contest all involve file manipulation. Reasonable design of the architecture and correct squeezing of the machine's reading and writing performance have become the key to achieve better results in the competition. When I was participating in the competition, I received feedback from several official account readers, most of whom expressed such annoyance: "very interested in the competition, but do not know how to get started", "can run the results, but compared with the front row contestants, the performance is more than 10 times worse". In order to allow more readers to participate in similar competitions in the future, I will briefly sort out some best practices for file IO operations, without involving the architecture design of the overall system. I hope that through the introduction of this article, you can happily participate in similar performance challenges.

02

/ knowledge points combing /

This article focuses on Java-related file operations, understanding them requires some prerequisites, such as PageCache,Mmap (memory mapping), DirectByteBuffer (out-of-heap cache), sequential read and write, random read and write. It is not necessary to fully understand them, but at least know what they are, because this article will mainly focus on these knowledge points.

03

/ know FileChannel and MMAP/ for the first time

First of all, the most important thing about the file IO type competition is to choose a good way to read and write files. How many kinds of file IO are there in JAVA? Native reading and writing methods can be divided into three types: normal IO,FileChannel (file channel) and MMAP (memory mapping). It is also very simple to distinguish between them. For example, FileWriter,FileReader exists in the java.io package, and they belong to the ordinary IO;FileChannel exists in the java.nio package, belonging to a kind of NIO, but note that NIO does not necessarily mean non-blocking, where the FileChannel is blocked; the more special is the latter MMAP, which is a special way of reading and writing files derived from FileChannel calling the map method, which is called memory mapping.

How to use FIleChannel:

FileChannel fileChannel = new RandomAccessFile (new File ("db.data"), "rw") .getChannel ()

How to get MMAP:

MappedByteBuffer mappedByteBuffer = fileChannel.map (FileChannel.MapMode.READ_WRITE, 0, filechannel.size ()

MappedByteBuffer is the operation class of MMAP in JAVA.

The traditional IO mode oriented to byte transmission has been spurned by us, and we focus on the difference between FileChannel and MMAP.

04

/ FileChannel read and write /

/ / write byte [] data = new byte [4096]; long position = 1024L data / specify the data fileChannel.write (ByteBuffer.wrap (data), position) that position writes to 4kb; / / write data fileChannel.write (ByteBuffer.wrap (data)) to 4kb from the location of the current file pointer; / / read ByteBuffer buffer = ByteBuffer.allocate (4096); long position = 1024L leading / specify position to read 4kb data fileChannel.read (buffer,position) / / read 4kb data fileChannel.read (buffer) from the location of the current file pointer

FileChannel deals with ByteBuffer most of the time. You can think of it as a wrapper class of byte [], which provides rich API to manipulate bytes, and students who don't know it can familiarize themselves with its API. It is worth mentioning that both the write and read methods are thread-safe, and concurrency is controlled within FileChannel through a privatefinalObjectpositionLock=newObject (); lock.

Why is FileChannel faster than normal IO? This may not be rigorous, because you want to use it, FileChannel can only play the actual performance when writing an integer multiple of 4kb, which is due to the fact that FileChannel uses a memory buffer like ByteBuffer, which allows us to accurately control the size of the write disk, which can not be achieved by ordinary IO. Does 4kb have to be fast? It is also not rigorous, it mainly depends on the disk structure of your machine, and affected by the operating system, file system, CPU, such as the disk in the middleware performance challenge, at least once to write 64kb to play the highest IOPS.

However, the PolarDB this set is completely different, it can be said to be extremely tough, the specific performance as the game is still in progress, do not delve into, but with the skills of benchmark everyting, we can fully measure.

In addition, the efficiency of FileChannel is achieved. Before I introduce this, I would like to ask a question: does FileChannel write data from ByteBuffer directly to disk? Think for a few seconds... The answer is NO. There is a layer between the data in ByteBuffer and the data on disk, and this layer is PageCache, which is a layer of cache between the user's memory and disk. We all know that the speed of disk IO and memory IO is several orders of magnitude different. We can think that writing filechannel.write to PageCache is the completion of the disk drop operation, but in fact, the operating system finally helps us to complete the final write from PageCache to disk. By understanding this concept, you should be able to understand why FileChannel provides a force () method to inform the operating system to flush the disk in a timely manner.

Similarly, when we use FileChannel for read operations, we also go through the three stages: disk-> PageCache- > user memory. For daily users, you can ignore PageCache, but as a challenger, PageCache cannot be ignored in the tuning process. Read operation is not introduced here too much, and we will mention it again in the following summary, which leads to the concept of PageCache.

05

/ MMAP read and write /

/ / write byte [] data = new byte [4]; int position = 8 lead / data mappedByteBuffer.put (data) that writes 4b from the position of the current mmap pointer; / / specify the data MappedByteBuffer subBuffer = mappedByteBuffer.slice () that position writes to 4b; subBuffer.position (position); subBuffer.put (data); / / read byte [] data = new byte [4]; int position = 8 mappedByteBuffer.slice / read data mappedByteBuffer.get (data) of 4b from the position of the current mmap pointer / / specify position to read 4b data MappedByteBuffer subBuffer = mappedByteBuffer.slice (); subBuffer.position (position); subBuffer.get (data)

FileChannel is strong enough, what else can MappedByteBuffer do? Please allow me to tell you a little bit about the use of MappedByteBuffer.

When we execute fileChannel.map (FileChannel.MapMode.READ_WRITE,0,1.5*1024*1024*1024); after that, we observe the changes on disk and immediately get a 1.5G file, but at this time the contents of the file are all 0 (byte 0). This is in line with the Chinese description of MMAP: a memory-mapped file, and whatever we do with the MappedByteBuffer in memory will eventually be mapped to the file.

Mmap maps the file to the virtual memory in the user space, eliminating the process of copying from the kernel buffer to the user space. The location in the file has a corresponding address in the virtual memory, and you can manipulate the file like memory, which is equivalent to putting the whole file into memory, but there is no physical memory consumption or read / write disk operation before the data is actually used. Only when the data is really used, that is, when the image is ready to be rendered on the screen, the virtual memory management system VMS renders the corresponding data blocks from disk to physical memory according to the mechanism of page fault loading. This way of reading and writing files reduces the copy of data from kernel cache to user space, and is very efficient.

Read a little more official description, you may have a little curious about MMAP, if there is such a powerful cool techs, there is no point in the existence of FileChannel! And many articles on the Internet are saying that the performance of MMAP operating large files is an order of magnitude better than that of FileChannel! However, from my understanding of the game, MMAP is not a silver bullet for the file IO, and it can only perform slightly better than FileChannel in a scenario where a small amount of data is written at a time. Then I will tell you something that frustrates you. At least using MappedByteBuffer in JAVA is a very troublesome and painful thing, mainly manifested in three points:

When using MMAP, you must specify the size of memory mapping, and the size of map is limited to about 1.5g at a time. Repeating map will lead to the problem of virtual memory recycling and reallocation, which is really unfriendly to the case of uncertain file size.

MMAP uses virtual memory, which is controlled by the operating system like PageCache. Although it can be controlled manually by force (), the timing is not good, which can be a headache in a small memory scenario.

MMAP recycling problem, when MappedByteBuffer is no longer needed, you can manually release the occupied virtual memory, but … In a very weird way.

Public static void clean (MappedByteBuffer mappedByteBuffer) {ByteBuffer buffer = mappedByteBuffer; if (buffer = = null | |! buffer.isDirect () | | buffer.capacity () = = 0) return; invoke (invoke (viewed (buffer), "cleaner"), "clean");} private static Object invoke (final Object target, final String methodName, final Class... Args) {return AccessController.doPrivileged (new PrivilegedAction () {public Object run () {try {Method method = method (target, methodName, args); method.setAccessible (true); return method.invoke (target);} catch (Exception e) {throw new IllegalStateException (e) });} private static Method method (Object target, String methodName, Class [] args) throws NoSuchMethodException {try {return target.getClass (). GetMethod (methodName, args);} catch (NoSuchMethodException e) {return target.getClass (). GetDeclaredMethod (methodName, args);} private static ByteBuffer viewed (ByteBuffer buffer) {String methodName = "viewedBuffer"; Method [] methods = buffer.getClass () .getMethods () For (int I = 0; I < methods.length; iTunes +) {if (methods.getName (). Equals ("attachment")) {methodName = "attachment"; break;}} ByteBuffer viewedBuffer = (ByteBuffer) invoke (buffer, methodName); if (viewedBuffer = = null) return buffer; else return viewed (viewedBuffer);}

Yes, you read it correctly, such a long code is just for the purpose of recycling MappedByteBuffer.

Therefore, I suggest that you first use FileChannel to complete the submission of the initial code, and then replace it with the implementation of MMAP when you must use a small amount of data (for example, a few bytes). Other scenarios FileChannel can be cover (as long as you understand how to use FileChannel properly). As for why MMAP performs better than FileChannel in a scenario where a small amount of data is written at a time, I haven't found the theoretical basis yet. if you have any clues, please leave a message. According to the theoretical analysis, FileChannel also writes to memory, but it has one more process than MMAP in which kernel buffer and user space copy each other, so MMAP performs better in extreme scenarios. As to whether the virtual memory allocated by MMAP is the real PageCache, I think it can be approximately understood as PageCache.

06

/ Sequential reading is faster than random reading, sequential writing is faster than random writing.

Whether you are a mechanical hard disk or a SSD, this conclusion must be true, although the reasons behind it are not quite the same, today we will not discuss the mechanical hard disk as an ancient storage medium, focus on foucs on SSD, let's see why random read and write on it is slower than sequential read and write. Even though the composition of each SSD and file system is different, our analysis today is also of reference value.

First of all, what is sequential reading, what is random reading, what is sequential writing, what is random writing? Maybe we didn't have such doubts when we first came into contact with the file IO operation, but it was written that I began to doubt my understanding. I don't know if you have ever experienced such a similar stage. Anyway, I did doubt it for a while. So, let's take a look at two pieces of code:

Write mode 1: 64 threads, the user uses an atomic variable to record the location of the write pointer, and write concurrently

ExecutorService executor = Executors.newFixedThreadPool (64); AtomicLong wrotePosition = new AtomicLong (0); for (int iFixt I {fileChannel.write (ByteBuffer.wrap (new byte [4x1024]), wrote.getAndAdd (4x1024);})}

Write mode 2: lock the write to ensure synchronization.

ExecutorService executor = Executors.newFixedThreadPool (64); AtomicLong wrotePosition = new AtomicLong (0); for (int iTunes: I {write (new byte [4x 1024]);})} public synchronized void write (byte [] data) {fileChannel.write (ByteBuffer.wrap (new byte [4x 1024]), wrote.getAndAdd (4x 1024));}

The answer is the second way to write sequentially, and the same is true of sequential reading. For file operations, locking is not a very terrible thing, do not dare to synchronize write/read is terrible! Some people may ask: isn't there already a positionLock inside FileChannel to ensure the thread safety of writing? why do you need to synchronize yourself? Why is this fast? My answer in vernacular is that multithreading concurrently write without synchronization will result in a hollow file, which may be executed in the order of

Timing 1:thread1 write position [0,4096]

Time series 2:thread3 write position [8194012288]

Timing 2:thread2 write position [409608194)

So it is not completely "written in sequence". However, don't worry that locking will lead to performance degradation. We will introduce an optimization in the following summary: file fragmentation to reduce lock conflicts during multi-threaded reading and writing.

Let's analyze the principle, why is sequential reading faster than random reading? Why is sequential writing faster than random writing? Both of these comparisons are actually the same thing at work: PageCache, as we mentioned earlier, is a layer of cache between application buffer (user memory) and disk file (disk).

Taking sequential reading as an example, when a user initiates a fileChannel.read (4kb), two things actually happen

The operating system loads 16kb from disk into PageCache, which is called pre-read

Operation passes from PageCache to copy 4kb into user memory

Finally, we access the 4kb in the user's memory, why read fast sequentially? It is thought that when users continue to access the next [4kb] 16kb disk content, they will access it directly from PageCache. Just imagine, when you need to access the disk contents of 16kb, is the disk IO fast 4 times or the disk IO+4 secondary memory IO fast 1 time? The answer is obvious, all of which are optimizations brought about by PageCache.

Think deeply: will the allocation of PageCache be affected when memory is tight? How to determine the size of PageCache? is it a fixed 16kb? Can I monitor PageCache's hit? In which scenarios will PageCache fail, and if it fails, what remedies do we need?

I have a simple self-question and answer, and the logic behind it still needs to be examined by the reader:

When the memory is tight, the pre-reading of PageCache will be affected. In the actual measurement, there is no literature support.

PageCache is dynamically adjusted, which can be adjusted through the system parameters of linux. It occupies 20% of the total memory by default.

The last tool in https://github.com/brendangregg/perf-tools github can monitor PageCache

This is an interesting optimization point. If using PageCache for caching is uncontrollable, how about doing your own pre-reading?

The principle of sequential writing is consistent with sequential reading, which is influenced by PageCache and left to the reader to deliberate.

07

/ Direct memory VS in-heap memory /

In-heap memory has been used in the previous sample code of FileChannel: ByteBuffer.allocate (4x1024), and ByteBuffer provides another way for us to allocate out-of-heap memory: ByteBuffer.allocateDirect (4x1024). This leads to a series of questions: when should I use in-heap memory and when should I use direct memory?

I do not spend too much ink to elaborate, directly on the comparison:

Some best practices for in-heap and out-of-heap memory:

When you need to apply for large chunks of memory, in-heap memory is limited and only out-of-heap memory can be allocated.

Out-of-heap memory is suitable for objects with a medium or long life cycle. If it is an object with a short life cycle, it is recycled at the time of YGC, and there is no performance impact on the application caused by objects with large memory and long life cycle in FGC.

A direct file copy operation, or an Imax O operation. Using out-of-heap memory directly can reduce the consumption of copying memory from user memory to system memory.

At the same time, the combination of pool and out-of-heap memory can also be used for the reuse of out-of-heap memory for objects with a short life cycle but involving I _ UBO operations (this method is used in Netty). In the competition, try not to appear frequent newbyte [], creating memory areas and recycling is also a lot of overhead, the use of ThreadLocal and ThreadLocal will often bring you a surprise ~

Creating out-of-heap memory consumes more than creating in-heap memory, so when out-of-heap memory is allocated, reuse it as much as possible.

08

/ dark magic: UNSAFE/

Public class UnsafeUtil {public static final Unsafe UNSAFE; static {try {Field field = Unsafe.class.getDeclaredField ("theUnsafe"); field.setAccessible (true); UNSAFE = (Unsafe) field.get (null);} catch (Exception e) {throw new RuntimeException (e);}

We can use the dark magic of UNSAFE to achieve a lot of unimaginable things, so let me introduce a little bit here.

Implement direct memory and memory copy:

ByteBuffer buffer = ByteBuffer.allocateDirect (4 * 1024 * 1024); long addresses = ((DirectBuffer) buffer). Address (); byte [] data = new byte [4 * 1024 * 1024]; UNSAFE.copyMemory (data, 16, null, addresses, 4 * 1024 * 1024)

The copyMemory method enables copying between memory, both in-heap and out-of-heap, with 2 parameters on the source side, 3 on 4 on the target side, and 5 on the size of the copy. If it is a byte array in the heap, pass the first address of the array and the fixed ARRAYBYTEBASE_OFFSET offset constant of 16; if it is out-of-heap memory, pass the offset of null and direct memory, which can be obtained through ((DirectBuffer) buffer) .address (). Why not copy it directly and use UNSAFE instead? Of course it's because it's fast! Teenager! In addition: MappedByteBuffer can also use UNSAFE to copy to achieve the effect of writing / reading disk.

As for UNSAFE and those cool techs, you can specifically learn about them, so I won't repeat them here.

09

/ file partition /

It has been mentioned earlier that we need to lock write,read when reading and writing sequentially, and I have repeatedly stressed that locking is not scary, and file IO operations are not so dependent on multithreading. But the sequential read and write after locking must not be able to fill the disk IO, now the system is strong CPU can not help squeezing, right? We can use file partitioning to achieve the effect of killing two birds with one stone: it not only satisfies sequential reading and writing, but also reduces lock conflicts.

So here comes the question again. How much is appropriate? When there are too many files, lock conflicts are reduced; there are too many files, too much fragmentation, and too few values for a single file, so the cache is not easy to hit. How can trade off be balanced? There is no theoretical answer, benchmark everything~

ten

/ Direct IO/

Finally, let's talk about a way of IO that has never been mentioned before, Direct IO, what, Java and this thing? Blogger, you lied to me? How to tell me before there are only three ways of IO! Don't scold me. Strictly speaking, this is not the way JAVA natively supports it, but it can be done by calling the native method through JNA/JNI. We can see from the picture above: Direct IO bypassed PageCache, but as we mentioned earlier, PageCache is a good thing, so why not use him? If you take a closer look, there are some scenarios where Direct IO can work, and yes, that's what we didn't mention much earlier: random reading. When using IO methods such as fileChannel.read (), which trigger PageCache pre-reading, we don't really want the operating system to do too much for us. Unless we really step on dog shit, random reading will hit PageCache, but the odds are conceivable. Although Direct IO has been brainless by Linus, it still has its value in random reading scenarios, reducing the overhead from Block IO Layed (approximately understood as disk) to Page Cache.

Then again, how does Java use Direct IO? Are there any restrictions? As mentioned earlier, Java is not natively supported, but some kind-hearted people have packaged Java's JNA library and implemented Java's Direct IO.

Int bufferSize = 20 * 1024 * 1024 DirectRandomAccessFile directFile = new DirectRandomAccessFile (new File ("dio.data"), "rw", bufferSize); for (int I = 0 bufferSize / 4096 bufferSize +) {byte [] buffer = new byte [4 * 1024]; directFile.read (buffer); directFile.readFully (buffer);} directFile.close ()

This is the end of the content of "what is the method of file IO operation". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report