In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces the example analysis of HDFS short circuit reading, which is very detailed and has certain reference value. Friends who are interested must finish it!
Background
One of the important ideas of Hadoop is mobile computing, not mobile data. We prefer to move the calculation to the node where the data is located as much as possible. Therefore, there are often clients and data on the same node in HDFS, and local reads occur when the client reads a block of data. For example, in the HBase scenario, ResionServer write data usually stores three backup copies in HDFS and will definitely write a backup to the local node. When ResionServer reads the data, it will also give priority to reading the data from the same node.
The evolution of short-circuit reading
1. Network reading
Initially, local reading in HDFS is handled in the same way as remote reading, which is also achieved through network reading. The client connects to DataNode through TCP sockets and transmits data through DataTransferProtocol protocol, as shown in the following figure:
This approach is simple, but there are obvious problems:
DataNode must reserve one thread and one TCP socket for each client that is reading the block. There will be TCP protocol overhead in the kernel, as well as DataTransferProtocol protocol overhead, so there is a lot of room for optimization.
2. HDFS-2246 unsafe short circuit reading
The key idea of short-circuit reading is that because the client and the data block are on the same node, the DataNode does not need to appear in the read data path. The client itself can read data directly from the local disk. This will greatly improve the read performance. The short-circuit read implemented in HDFS-2246 is that DataNode opens the permissions of all block paths to the client, which reads data directly through the local disk path, as shown in the following figure:
But this approach introduces a lot of problems:
(1) the system administrator must change the permissions of the DataNode data directory to allow the client to open the relevant files. Users who can use short-circuit reading are specifically whitelisted, and other users are not allowed to use them. Typically, these users must also be placed in a special Unix group.
(2) these permission changes introduce a security vulnerability in which users with access to block files on DataNode nodes can read all data blocks on the path at will, not just the data blocks they need to access, which seems to make users become superusers, which may be acceptable to a small number of users, such as HBase users. But on the whole, it will bring great security risks.
3. HDFS-347 safe short circuit reading
The main problem with HDFS-2246 is that it opens the DataNode data directory to the client, and what we really need to read is only part of the block file. Unix has a mechanism that can do this, called file descriptor delivery. HDFS-347 uses this mechanism for secure short-circuit reading. Instead of passing directories to the client, DataNode opens block files and metadata files and passes their file descriptors to the client through domain socket, as shown in the following figure:
Based on the following two aspects, safe short-circuit reading solves the security problems of HDFS-2246.
(1) the file descriptor is read-only, so the client cannot modify the file that passes the descriptor.
(2) the client cannot access the block directory itself, so it cannot read any other block files that it should not access.
HDFS safe short circuit reading
1. Short-circuit read shared memory
Now that we understand the evolution of HDFS short-circuit reading, let's take a look at how HDFS implements secure short-circuit reading. DataNode passes the file descriptor of the short-circuited copy to the DFSClient,DFSClient cached copy file descriptor. Since the state of the replica can change at any time, DFSClient and DataNode are required to synchronize the replica status in real time. At the same time, DFSClient and DataNode on the same machine, shared memory can map files to memory through the mmap interface provided by POSIX, and the mapped data is synchronized in real time (as shown in the following figure), so shared memory can maintain the state of all short-circuited read copies, so that DFSClient and DataNode synchronize copy information in real time through shared memory.
Shared memory will have many slots, each of which corresponds to the information of a short-circuited read copy. The shared memory holds the binary information of all slots. However, the binary slot information in the mapping data is not easy to manage, so a slot in the Slot object operation mapping data is defined, as shown below:
Slot slot size is 64 bytes, slot data format, the first 4 bytes are Slot flag bits, 5 to 8 bytes are anchor bits, and the remaining bytes are reserved for future use, such as statistics.
Two flag bits:
(1) VALID_FLAG: indicates whether the slot is valid.
DFSClient sets this flag when it allocates a new slot in shared memory. DataNode will eliminate this flag bit when the copy associated with this slot is no longer valid. DFSClient itself eliminates this slot and believes that DataNode no longer uses this slot for communication.
(2) ANCHORABLE_FLAG: indicates whether the copy corresponding to the slot has been cached.
This flag bit is set when DataNode caches the copy of the slot through the mlock interface provided by POSIX. When the flag bit is set, DFSClient no longer needs to verify when shorting to read the copy, because the copy cache has already been verified, and the copy also supports zero copy reading. When DFSClient reads such a copy, it needs to add 1 to the corresponding slot anchor count, and DataNode can delete the copy from the cache only if the slot anchor count is 0.
The maximum shared memory segment is 8192 bytes, and when DFSClient does a large number of short-circuit reads, there may be multiple segments of shared memory between DFSClient and DataNode. In HDFS, DFSClient defines that the DFSClientShm class abstracts a segment of shared memory on the DFSClientShm side, while the RegisteredShm class manages all the shared memory, while the DataNode side defines a section of shared memory on the abstract DataNode side of the DFSC class, and the ShortShop Register class manages all the shared memory on the DataNode side, as shown in the following figure:
In secure short-circuit reading, DFSClient and DataNode synchronize shared memory slot information through domain socket.
DFSClient requests a section of shared memory to save the state of short-circuited read copies. DataNode creates shared memory, maps the shared memory file to DataNode memory, creates a RegisteredShm to manage the shared memory, and then returns the file descriptor of the shared memory file to DFSClient through domain socket.
DFSClient opens the shared memory file based on the file descriptor, maps the file to DFSClient's memory, and creates a DfsClientShm object to manage the shared memory.
DFSClient requests file descriptors for block files and metadata files from DataNode through domain socket, and synchronizes the status of slot slots in shared memory. DFSClient will request a slot slot for the data block in the shared memory managed by DfsClientShm, and then synchronize the information with DataNode through domain socket. DataNode will create the corresponding slot slot in the shared memory managed by RegisteredShm, then obtain the file descriptors of the block file and metadata file, and send it to DFSClient through domain socket, as shown below:
Shared memory mechanism
2. Short-circuit reading process
When the client performs a short-circuit read of the block copy, the interaction between DFSClient and DataNode is as follows:
Short circuit reading simplified flow chart
(1) DFSClient requests DataNode to create shared memory through the requestShortCircuitShm () interface, and DataNode creates shared memory files and returns the shared memory file descriptor to DFSClient.
(2) DFSClient requests slots in shared memory through the allocShmSlot () interface, and requests the copy file descriptor to be read from DataNode through the requestShortCircuitFds () interface. DataNode opens the copy file and returns the file descriptors of the block file and metadata file to DFSClient.
(3) after DFSClient reads the copy, asynchronously requests the DataNode to release the file descriptor and the corresponding slot through the releaseShortCircuitFds () interface.
Optimization of safe short-circuit Reading of HDFS by Xiaomi
1. Slow release of Slot slot
In several stress scenarios, we found that multiple short-circuit readers of Hbase ResionServer often block the reading and writing of domain socket. A large number of ShortCircuitShm for short-circuit reading are found in the dump of DataNode. So we simulate the situation online through YCSB and find that when the number of short-circuit read requests is large, the QPS allocated by BlockReaderLocal is very high, and the allocation of BlockReaderLocal depends on the allocation of ShortCircuitShm and slot of synchronous read blocks.
YCSB GET QPS
YCSB GET LATENCY
By counting the QPS allocated and released by slot, we found that the QPS allocated by slot can reach 3000 cycles, while the released QPS can only reach 1000 cycles. After about 1 hour of YCSB test, FULL GC appeared in DataNode. It can be seen that the accumulation of slot in DataNode and the release of slot in time is the main reason for GC.
In the current implementation of short-circuit reading, each time slot is released, a new domain socket connection is created. For each newly established domain socket connection, DataNode reinitializes a DataXceiver to process the request. Through profile DataNode, it is found that SlotReleaser threads spend a lot of time establishing and cleaning up these connections.
Therefore, we reuse the domain socket connection of SlotReleaser. By reusing domain socket, the QPS released by slot is consistent with the allocated QPS on the same test set. As a result, the extrusion of expired slot in DataNode is eliminated. At the same time, the QPS of YCSB's GET has increased by about 20% due to the decrease in DataNode Young GC.
2. Low efficiency of shared memory allocation
In the profile HBase short-circuit reading process, we also found another problem, that is, every once in a while, a batch of reads will have a delay of about 200ms, and these delays occur almost at the same time. At first we suspected that Hbase ResionServer's Minor GC caused. However, by comparing ResionServer's GC logs, it is found that the time does not exactly match. Some of it coincides with DataNode Minor GC time. The real read operation of the block copy does not pass through DataNode at all. if it is the influence of DataNode, the problem can only be caused by the interaction between ResionServer and DataNode when short-circuit reading is established. By further adding trace log during short-circuit reading, we find that these delays are due to DataNode Minor GC causing the ShortCircuitShm allocation request to be blocked. When allocating a ShortCircuitShm, it causes a lot of slot allocation blocking. The allocation delay of slot will cause the delay of BlockReaderLocal, which will lead to the delay of short-circuit reading. This is why I found that there was a batch of readings, always reporting similar delays at the same time. In order to solve this problem, we pre-allocate ShortCircuitShm to reduce the effect of DataNode Minor GC on short circuit degree and make the delay smoother.
3. Short-circuit reading of the building block is prohibited.
Short-circuit reading of the building block is prohibited, that is, the last block prohibits short-circuit reading. Because of the read-write mode of HBase, this problem does not have a great impact on it, but has a great impact on HDFS-based streaming services. The optimization work we are doing is mainly by ensuring that the short-circuit reading only occurs after the flush operation, while checking the validity of the block during the reading process, handling the read exception and, if it does fail, switching to remote reading.
Appendix
(1) document descriptor (File Descriptor)
The file descriptor is formally a non-negative integer. In fact, it is an index value that points to the record table of files opened by the kernel for each process maintained by that process. When a program opens an existing file or creates a new file, the kernel returns a file descriptor to the process.
Both Unix and Windows systems allow file descriptors to be passed between processes. One process opens a file and passes the file descriptor to another process, which can access the file. File descriptor delivery is very useful for security because it removes the restriction that you need to have sufficient access to the second process to open the file, and the file descriptor is read-only. So this approach can also prevent defective programs or malicious clients from damaging files. On Unix systems, file descriptor delivery can only be done through Unix domain socket.
(2) Unix domain socket
Unix domain socket is a communication endpoint for exchanging data between processes executed on the same host operating system. The valid Unix domain socket types are SOCK_STREAM (for stream-oriented sockets) and SOCK_DGRAM (Datagram-oriented sockets for preserving message boundaries), and like most Unix implementations, Unix domain datagram socket is always reliable and does not reorder datagrams. Unix domain socket is a standard component of the POSIX operating system.
Unix domain socket's API is similar to the network socket, but does not use the underlying network protocol, and all communication takes place entirely in the operating system kernel. Unix domain socket uses the file system as its address namespace. The process references Unix domain socket as the file system inode, so the two processes can communicate by opening the same socket. In addition to sending data, the process can also send file descriptors on the Unix domain socket connection using sendmsg () and recvmsg () system calls. And only when the sending process is authorized to the receiving process can the receiving process access the permissions of the file descriptor.
(3) shared memory (Shared Memory)
Shared memory is a method of interprocess communication, that is, a method of exchanging data between programs running at the same time. A process will create an area in RAM that other processes can access. Because two processes can access the shared memory area as if they were accessing their own memory, it is a very fast way to communicate. But its scalability is poor, for example, communication must run on the same machine. And it must be avoided if processes that share memory are running on different CPU and the underlying architecture is not cache consistent.
POSIX provides POSIX standardized API that uses shared memory. Use the function shm_open in sys/mman.h. POSIX inter-process communication includes shared functions shmat,shmctl,shmdt and shmget. Shared memory created by shm_open is persistent. It remains in the system until it is explicitly deleted by the process. One drawback is that if the process crashes and cannot clean up the shared memory, it will remain until the system shuts down. POSIX also provides mmap API for mapping files to memory, which can be shared, allowing the contents of files to be used as shared memory.
The above is all the contents of the article "sample Analysis of HDFS short Circuit Reading". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.