Combing knowledge points of HDFS (distributed File system) 07/01 Update SLTechnology News&Howtos

Combing knowledge points of HDFS (distributed File system)

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

HDFS hdfs, which stands for hadoop distributed filesystem, is the most important component of the hadoop ecosystem. It is at the bottom of the hadoop ecosystem and is responsible for data storage. What problem can hdfs solve? 1. The size of the dataset exceeds the storage capacity of a computer and partitions the data. two。 Manage file systems stored by multiple computers on a network. Features: 1. Hdfs with streaming data access mode: write once, read multiple times. two。 Storage of very large files, GB, TB and even PB files 3. Run on a commercial hardware cluster. For example, ordinary servers have a high node failure rate. 4. Not suitable for data access with low time latency 5. Not suitable for storing a large number of small files: the total number of files stored in the file system is limited to 6. 5% of namanode memory. Multi-user write, arbitrarily modify the file. Files in HDFS may have only one write, and writes always add data to the end. HDFS data block 1. The default block size is 64MB, but files smaller than one block size do not take up the entire block space. 2. HDFS blocks are larger than disk blocks to minimize addressing overhead. 3. Can store files larger than any disk capacity 4. Convenient storage management (metadata and data are managed separately, easy backup provides fault tolerance and improves availability. ) 5. Hadoop fsck /-files-blocks to see which blocks of files constitute namenode and datanode namenode and datanode 1. Namenode manager, datanode worker 2. Namenode manages the namespace of the file system and maintains the files and directories under the file system tree. Namespace image file and edit log file 4. The data node information of each block is not permanently saved, and the information is reconstructed by the data node when the system is started. 5. Datanode stores file blocks and periodically sends a list of stored blocks to namenode Namenode fault tolerance 1. The first mechanism backs up the files that make up the file system metadata persistent state. Write the persistent state to disk as well as to a remotely mounted network file system (NFS). two。 The second method runs SecondaryNamenode to edit the log periodically and merge the spatial image to save a copy of the spatial image. 3. The third method copies the metadata of the namenode on NFS to namenode and serves as the new primary namenode Namenode highly available 1. Sharing of logs between namenode through highly available shared storage items. (zookeeper) 2.datanode needs to send block processing reports to both namenode at the same time. File access permissions in HDFS provide a total of three types of permissions: read-only permissions (r), write permissions (w), and executable permissions (x) access to children of a directory requires executive permission to read files list directory contents requires read permission to write to a file, new or a file or directory requires write permission only for users in the partner community It cannot be used to protect resources in an unfriendly environment. The super user is the identity of the namenode process. For the super user, the system will not perform a task permission check. Hadoop file system interface 1. Http: the first method, direct access, HDFS background process directly serves the request from the client. The second method is accessed by one or more agents, the client sends a http request to the agent, and the agent communicates with the Namenode,Datanode through RPC. 2. In the first case, a web server (50070) embedded in namenode provides directory services, directory fission is stored in XML or JSON format, and file data is transmitted in the form of data streams by datanode's web server (50075). 3.WebHDFS implements the system operation of files (read and write), including Kerberos authentication. Dfs.webhdfs.enable is set to true before it can be used. 4. The use of proxy server can achieve load balancing, more stringent firewall policy and bandwidth restriction policy. 5.HttpFS proxy server has the ability of reading and writing, provides the same interface as WebHDFS, encapsulates the configuration of client or server using HTTP REST API Java API 1. Configuration object, and achieves (conf/core-site.xml) 2. Public staic LocalFileSystem getLocal (Configuration conf) throws IOException to get local file system instance by setting configuration file reading classpath. 3. FSDataInputStream supports random access, so that files can be read from anywhere in the stream, inheriting the Seekable interface. Seek (long pos) navigates to an absolute position, and getPos () gets the current position. The PositonedReadable interface to read a portion of a file from a specified offset. Int Read (long position, byte [] buffer, int offset, int length) reads up to length bytes of data from position and stores it in the buffer buffer pointer offset offset. The number of bytes actually read by the return value: readFully will read the data in length length bytes to buffer, and all read operations will save the current offset of the file. 4. The FSDataOutputStream, create () method can create a parent directory for files that need to be written and do not currently exist. Progressable is used to pass the callback interface, the appen () method appends data to the end of an existing file, and getPos () queries the current location of the file. 5.FileStatus [] globStatus (Path pathPattern, PathFilter filter) returns an array of FileStatus objects for all files that match the path to the specified pattern, Fs.globStatus (new Path ("/ 2007 Universe *"), new RegexExcludeFilter ("^. * / 2007-12-31 $") listStatus (Path path, PathFilter filter) profiling file read 1. The client uses the open () method of the FileSystem (DistributedFileSystem) object to open the file that it wants to read. 2. DistributedFileSystem acquires the corresponding relationship between the block and datanode from namenode. Datanode sorts them by their distance from the client. 3. DistributedFileSystem returns a FSDataInputStream object, which is encapsulated into a DFSInputStream object, which manages the IO 4 of namenode and datanode. The client calls the read method to the input stream, and when it reaches the end of the block, it closes the connection with datanode and looks for the next datanode. 5. The client read is complete, and FSDataInputStream calls close () 6. 0. When an error is encountered, it looks for the nearest block and remembers the fault datanode. The profiling file is written in 1. The client opens the file that you want to read through the create () method of the FileSystem (DistributedFileSystem) object. DistributedFileSystem sends a RPC call to namenode, and creates a new file in the namespace in the file system. At this time, there is no data block in the file. Namenode performs various checks, the file does not exist and the client has permission to create the file. 4. DistributedFileSystem returns a FSDataOutputStream object to the client, which is encapsulated into a DFSoutputStream object, which is responsible for the communication between datanode and namenode. 5. DFSOutputStream writes the packet to the data queue (64KB). 6. The Datanode list forms a pipeline and writes data in turn. 7. DFSOutputStream also maintains an acknowledgement queue, and the packet will not be deleted until it receives acknowledgements from all nodes in the pipeline. 8. After the client finishes writing the data, the close () method is called on the data. Consistent model 1. When you create a new file, it is immediately visible in the namespace, but what is written is not guaranteed to be immediately visible. 2.The sync () method of FSDataOutputStream can synchronize the written data. (close ()). Integrity of Hadoop data 1. Io.bytes.per.checksum specifies the data in bytes to calculate the checksum. By default, it is 512 bytes and the checksum of CRC-32 is 4 bytes. 2. Datanode is responsible for storing the data checksum. When the client reads the data, it will be compared with the stored checksum. 3. Before the open () method reads the file, pass the false value to the setVerifyChecjsum () method of the FileSystem object to disable the check. 4.-get-ignoreCrc-copyToLocal disables verification 5. LocalFileSystem of Hadoop performs client verification. Checksum is disabled for the read operation of the RawLocalFileSystem instance. Fs.file.impl is set to org.apache.hadoop.fs.RawLocalFileSystem compression 1. Benefits: reduce the disk space required for files and speed up the transfer of data in the network. 2. For data processed by Mapreduce, the compressed format needs to support sharding. 3. To compress the output of a Mapreduce job, set mapred.output.compress to true,mapred.output.compression.codec to use the class name of the compressed codec during the job configuration process. FileOutputFormat.setCompressOutput (job, true) FileOutputFormat.setOutputCompressorClass (job, GzipCodec.class) 4. Mapred.output.compression.type defaults to RECORD, that is, each record is compressed. BLOCK, compression for a set of records. Map task compression mapred.compress.map.output is set to true Mapred.map.output.compression.codec serialization 1. Serialization refers to converting structured objects into byte streams for transmission over the network or writing to disk for permanent storage. Deserialization refers to the reverse process of transferring a byte stream back to a structured object. 2. Hadoop serialization format Writable3. The Writable interface defines two methods Void write (DataInput out) throws IOExcepthion;Void readFields (DataInput in) throws IOException;4 to create the Intwritable object Intwritable writable = new IntWritable (5); writable.set (5); the WritableComparable interface and comparatorIntWritable implement the original WritableComparable interface, which inherits from the Writable and java.lang.Comparable interfaces. Types are more important, and there is a key-based sorting phase in the mapreduce process. WritableComparator provides a default implementation of the original compare () method, which deserializes the object to be compared in the stream and calls the object's compare () method.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.