The consistency Model of HDFS of Hadoop 07/03 Update SLTechnology News&Howtos

The consistency Model of HDFS of Hadoop

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Some parts of HDFS may not conform to POSIX for performance (yes, you read it correctly, POSIX is not only applicable to linux/unix,Hadoop using POSIX's design to read the file system file stream), so it may look different than you expect, be careful.

After you create a file, it can be seen in the namespace:

Path p = new Path ("p"); fs.create (p); assertThat (fs.exists (p), is (true))

But any data written to this file is not guaranteed to be visible, even if you flush the data that has been written, the length of the file may still be zero:

Path p = new Path ("p"); OutputStream out = fs.create (p); out.write ("content" .getBytes ("UTF-8")); out.flush (); assertThat (fs.getFileStatus (p). GetLen (), is (0L))

This is because, in Hadoop, the contents of the file are visible only after a full amount of block data is written to the file (that is, the data is written to the hard disk), so the content in the block currently being written is always invisible.

Hadoop provides a way to force the contents of buffer to be flushed to datanode, and that is the sync () method of FSDataOutputStream. After calling the sync () method, Hadoop ensures that all data that has been written is washed into the datanode in the pipeline and is visible to all readers:

Path p = new Path ("p"); FSDataOutputStream out = fs.create (p); out.write ("content" .getBytes ("UTF-8")); out.flush (); out.sync (); assertThat (fs.getFileStatus (p). GetLen (), is (long) "content" .length ()

This method is like a fsync system call in POSIX (it flushes all buffered data from a given file descriptor to disk). For example, by writing a local file using java API, we can ensure that we can see what has been written after calling flush () and synchronizing:

FileOutputStream out = new FileOutputStream (localFile); out.write ("content" .getBytes ("UTF-8")); out.flush (); / / flush to operating systemout.getFD () .sync (); / / sync to disk (getFD () returns the file descriptor corresponding to the stream) assertThat (localFile.length (), is (long) "content" .length ()

Closing a stream in HDFS implicitly calls the sync () method:

Path p = new Path ("p"); OutputStream out = fs.create (p); out.write ("content" .getBytes ("UTF-8")); out.close (); assertThat (fs.getFileStatus (p). GetLen (), is (long) "content" .length ()

Due to the conformance model limitations in Hadoop, we are likely to lose as much data as a block if we don't call the sync () method. This is unacceptable, so we should use the sync () method to ensure that the data has been written to disk. But calling the sync () method frequently is also not good, because it will incur a lot of extra overhead. We can write a certain amount of data and call the sync () method again, and the specific amount of data depends on your application, and it should be as large as possible without affecting the performance of your application.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.