Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Apache Arrow official documentation-metadata

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Metadata: logical types, schemas, headers

This is the document of the Arrow metadata specification, which enables the system to communicate

Logical array type (implemented using the physical memory layout specified in Layout.md) the pattern "data header" of the table collection of the Arrow data structure indicates the physical location of the memory buffer, which is sufficient to reconstruct the Arrow data structure without copying memory. Specification implementation

We are using Flatbuffers to read and write Arrow metadata with low overhead. See Message.fbs.

Architecture

The    Schema type describes a tabular structure consisting of any number of Arrow arrays, each Arrow array can be interpreted as a column in the table. The schema itself does not describe the physical structure of any particular dataset.

The pattern consists of a series of fields that are metadata that describe the column. One of the fields of Flatbuffers IDL is:

Table Field {/ / Name is not required, in i.e. A Listname: string;nullable: bool;type: Type;// present only if the field is dictionary encoded// will point to a dictionary provided by a DictionaryBatch messagedictionary: long;// children apply only to Nested data types like Struct, List and Unionchildren: [Field]; / / layout of buffers produced for this type (as derived from the Type) / does not include children/// each recordbatch will return instances of those Buffers.layout: [VectorLayout]; / / User-defined metadatacustom_metadata: [KeyValue];}

Type is the logical type of the field. Nested types, such as List,Struct and Union, have a series of subfields.

A JSON representation of the pattern is also provided: fields:

{"name": "name_of_the_field", "nullable": false, "type": / * Type * /, "children": [/ * Field * /], "typeLayout": {"vectors": [/ * VectorLayout * /]}}

VectorLayout:

{"type": "DATA | OFFSET | VALIDITY | TYPE", "typeBitWidth": / * int * /} Type: {"name": "null | struct | list | int | floatingpoint | binary | fixedsizebinary | bool | decimal | date | time | timestamp | interval" / / fields as defined in the Flatbuffer depending on the type name} Union: {"name": "union", "mode": "Sparse | Dense", "typeIds": [/ * integer * /]}

The typeIds field in Union is used to represent each type of encoding, which can be different from the index from a subarray. This allows you to use the union type ids without having to enumerate from 0.

Int:

{"name": "int", "bitWidth": / * integer * /, "isSigned": / * boolean * /}

FloatingPoint:

{"name": "floatingpoint", "precision": "HALF | SINGLE | DOUBLE"}

Decimal:

{"name": "decimal", "precision": / * integer * /, "scale": / * integer * /}

Timestamp:

{"name": "timestamp", "unit": "SECOND | MILLISECOND | MICROSECOND | NANOSECOND"}

Date:

{"name": "date", "unit": "DAY | MILLISECOND"}

Time:

{"name": "time", "unit": "SECOND | MILLISECOND | MICROSECOND | NANOSECOND", "bitWidth": / * integer: 32 or 64 * /}

Interval:

{"name": "interval", "unit": "YEAR_MONTH | DAY_TIME"} Schema: {"fields": [/ * Field * /]} record header

A    RecordBatch is a top-named collection of equal-length Arrow arrays (or vectors). If one of the arrays contains nested data, its subarray does not need to be the same length as the top-level array.

   can be thought of as an implementation of a particular pattern. Metadata that describes a particular RecordBatch is called a "data header". This is the IDL of Flatbuffers for RecordBatch headers

Table RecordBatch {length: long; nodes: [FieldNode]; buffers: [Buffer];}

The RecordBatch metadata described by    provides recording batches longer than 2 ^ 31-1, but the Arrow implementation does not require implementation support to exceed this size.

   generates nodes and buffers fields by depth-first traversing / flattening the schema (which may contain nested types) of a given in-memory dataset.

Buffer zone

   buffers are metadata that describe contiguous areas of memory relative to some virtual address space. This may include:

Shared memory, such as memory-mapped files that receive RPC messages in memory

Data in a file

The key form of the buffer type is:

Struct Buffer {offset: long;length: long;}

   in the context of record batch, each field has a number of buffers associated with it, derived from its physical memory layout.

   each logical type (separate from its child nodes, if it is a nested type) has a deterministic number of buffers associated with it. These will be specified in the logical types section.

Field metadata

The FieldNode value described by    contains metadata about each level at the nested type level.

Struct FieldNode {/ The number of value slots in the Arrow array at this level of a nested/// treelength: long;/// The number of observed nulls.null_count: lohng;}

The FieldNode metadata described by    provides fields longer than 2 ^ 31-1, but Arrow does not require the implementation to support large arrays.

Tile nested data

   nested types are flattened in record batch in depth-first order. When you access each field in the nested type tree, the metadata is appended to the top-level fields array and the buffer associated with the field (but not its children) is appended to the buffers array.

   for example, let's consider the pattern:

Col1: Structcol2: Utf8

The flattened version is:

FieldNode 0: Struct name='col1'FieldNode 1: Int32 name=a'FieldNode 2: List name='b'FieldNode 3: Int64 name='item' # arbitraryFieldNode 4: Float64 name='c'FieldNode 5: Utf8 name='col2'

   will have the following for the generated buffer (such as a more detailed description of each type below):

Buffer 0: field 0 validity bitmapbuffer 1: field 1 validity bitmapbuffer 2: field 1 values buffer 3: field 2 validity bitmapbuffer 4: field 2 list offsets buffer 5: field 3 validity bitmapbuffer 6: field 3 values buffer 7: field 4 validity bitmapbuffer 8: field 4 values buffer 9: field 5 validity bitmapbuffer 10: field 5 offsets buffer 11: field 5 data logical type

   logical types consist of type names and metadata as well as explicit mappings to physical memory representations. These may fall into different categories:

Types represented as fixed-width primitive type arrays (for example: C-style integers and floating-point numbers) with memory layouts equal to physical nested types (for example, strings are represented by lists (List), but are not logically nested types) integers

   in the first version of Arrow, we provide standard 8-bit to 64-bit standard C integer types, both signed and unsigned:

● signed type: Int8,Int16,Int32,Int64

● unsigned type: UInt8,UInt16,UInt32,UInt64

   IDL looks like:

Table Int {bitWidth: int;is_signed: bool;}

The    integer byte order is currently set globally at the mode level. If a mode is set to little-endian (low-order addressing), then all integer types that appear in it must be little-endian. Integers as part of other data representations, such as list offsets and union types, must have the same byte order as the entire Record Batch.

Floating point number

   we provide three types of floating-point numbers as basic arrays of fixed width.

Semi-precision, 16-bit width single precision, 32-bit width double precision, 64-bit width

IDL is similar to enum Precision:int {HALF, SINGLE, DOUBLE} table FloatingPoint {precision: Precision;} Boolean

The    Boolean logical type is represented as a 1-bit wide basic (primitive) physical type. These bits are numbered using the least significant bit (LSB) sort.

  , like other fixed bitwidth primitive types, Boolean data is displayed as two buffers in the header (one bitmap is a valid vector and the other is a value).

List

The    List logical type is the logical correspondence (with the same name) to the physical type of the list.

   in the header format, the List field node contains 2 buffers:

Valid bitmap list offset

   buffers associated with subfields of List are processed recursively according to sublogical types (for example, List versus List). Utf8 and Binary

   We specify two logical types for variable length bytes:

Utf8 data is a Unicode value with UTF-8 encoding Binary is any other variable length byte

   these types all have the same memory layout as the nested type List, with the constraint that internal bytes cannot contain null values. From the perspective of logical types, they are primitive, not nested types.

   in the header format, although List is displayed as 2 field nodes (List and UInt8) and 4 buffers (2 for each node, as described above), these types also have a simple representation of single field nodes (Utf8 or Binary logical type, no child nodes) and 3 buffers: valid bitmap list offset byte data Decimal

   TBD

Timestamp

All    timestamps are stored in 64-bit integers and have four units: seconds, milliseconds, microseconds, and nanoseconds.

Date

   supports two different date types:

Days since UNIX epoch are 32-bit integers milliseconds from UNIX epoch are 64-bit integers Time

   time supports the same units: seconds, milliseconds, microseconds and nanoseconds. We represent time as the smallest integer that holds the specified unit. For seconds and milliseconds: 32 bits, other 64 bits.

Dictionary coding

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report