Apache Arrow official documentation-IPC Interprocess Communication 07/15 Update SLTechnology News&Howtos

Apache Arrow official documentation-IPC Interprocess Communication

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Encapsulated message format

Data components in streams and file formats are represented as encapsulated messages, including:

Length prefix message metadata indicating the size of metadata fills bytes into the 8-byte boundary message body as an Flatbuffer tiling buffer

A sign that we have:

The metadata_size of the includes the size of the flatbuffer filling. The Message flatbuffer includes a version number, a specific message (as a flatbuffer federation), and the size of the message body:

Table Message {version: org.apache.arrow.flatbuf.MetadataVersion; header: MessageHeader; bodyLength: long;}

currently supports four types of messages:

SchemaRecordBatchDictionaryBatchTensor streaming format

We provide a stream format for RecordBatch. It is presented as a sequence of encapsulated messages, each of which follows the above format. This pattern is first in the stream and is the same for all subsequent RecordBatch. If any field in the pattern is dictionary-encoded, one or more DictionaryBatch messages will follow the pattern.

when the stream reader implements reading the stream, after each message, it can read the next four bytes to know the size of the following message metadata. After reading the message flatbuffer, you can read the message body.

The stream writer can send a stream terminal (EOS) signal by writing a zero length as the int32 or simply shutting down the stream interface.

file format

We define a "file format" that supports random access in a format that is very similar to the stream format. The file begins and ends with the magic string ARROW1 (plus padding). The content in the file is the same as the stream format. At the end of the file, we write a footer, including the offset and size of each block in the file, so that random access is possible. For exact details on the file footer, see format / File.fbs.

A sign that we have:

RecordBatch body structure

The RecordBatch metadata contains metadata for a depth-first (first order) flattening setting field and a physical memory buffer (some comments in Message.fbs have been shortened / removed):

Table RecordBatch {length: long; nodes: [FieldNode]; buffers: [Buffer];} struct FieldNode {length: long; null_count: long;} struct Buffer {/ The shared memory page id where this buffer is located. Currently this is / / not used page: int; / / The relative offset into the shared memory page where the bytes for this / buffer starts offset: long; / The absolute length (in bytes) of the memory buffer. The memory is found / / from offset (inclusive) to offset + length (non-inclusive). Length: long;}

does not use the page,Buffer offset as the reference frame at the beginning of the message body in the context of the file. Therefore, in generic IPC settings, these offsets can be anywhere in one or more shared memory regions, starting with 0 in the file format.

The location of the RecordBatch and the size of the metadata block and buffer are stored in the file footer:

Struct Block {offset: long; metaDataLength: int; bodyLength: long;}

has some comments about this.

The Block offset represents the starting byte of the record batch. The metadata length includes the flat buffer size, the recording batch metadata buffer, and any padded byte dictionary Batches

The dictionary Batch is not implemented yet, but they are provided in the metadata. Currently, the DICTIONARY fragments displayed in the file do not appear in any file implementation.

Tensor (multidimensional array) information format

The Tensor message type provides a way to write multidimensional arrays of fixed-size values (such as a NumPy's ndarray) that use Arrow's shared memory tools. Although we provide a reference implementation in C + +, there is usually no need to implement an Arrow implementation of this data format.

when writing independently encapsulated Tensor messages, we use the above format, but otherwise align the start offset (if writing to the shared memory area) to a multiple of 8:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.