In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Metadata: logical types, schemas, headers
This is the document of the Arrow metadata specification, which enables the system to communicate
Logical array type (implemented using the physical memory layout specified in Layout.md) the pattern "data header" of the table collection of the Arrow data structure indicates the physical location of the memory buffer, which is sufficient to reconstruct the Arrow data structure without copying memory. Specification implementation
We are using Flatbuffers to read and write Arrow metadata with low overhead. See Message.fbs.
Architecture
The Schema type describes a tabular structure consisting of any number of Arrow arrays, each Arrow array can be interpreted as a column in the table. The schema itself does not describe the physical structure of any particular dataset.
The pattern consists of a series of fields that are metadata that describe the column. One of the fields of Flatbuffers IDL is:
Table Field {/ / Name is not required, in i.e. A Listname: string;nullable: bool;type: Type;// present only if the field is dictionary encoded// will point to a dictionary provided by a DictionaryBatch messagedictionary: long;// children apply only to Nested data types like Struct, List and Unionchildren: [Field]; / / layout of buffers produced for this type (as derived from the Type) / does not include children/// each recordbatch will return instances of those Buffers.layout: [VectorLayout]; / / User-defined metadatacustom_metadata: [KeyValue];}
Type is the logical type of the field. Nested types, such as List,Struct and Union, have a series of subfields.
A JSON representation of the pattern is also provided: fields:
{"name": "name_of_the_field", "nullable": false, "type": / * Type * /, "children": [/ * Field * /], "typeLayout": {"vectors": [/ * VectorLayout * /]}}
VectorLayout:
{"type": "DATA | OFFSET | VALIDITY | TYPE", "typeBitWidth": / * int * /} Type: {"name": "null | struct | list | int | floatingpoint | binary | fixedsizebinary | bool | decimal | date | time | timestamp | interval" / / fields as defined in the Flatbuffer depending on the type name} Union: {"name": "union", "mode": "Sparse | Dense", "typeIds": [/ * integer * /]}
The typeIds field in Union is used to represent each type of encoding, which can be different from the index from a subarray. This allows you to use the union type ids without having to enumerate from 0.
Int:
{"name": "int", "bitWidth": / * integer * /, "isSigned": / * boolean * /}
FloatingPoint:
{"name": "floatingpoint", "precision": "HALF | SINGLE | DOUBLE"}
Decimal:
{"name": "decimal", "precision": / * integer * /, "scale": / * integer * /}
Timestamp:
{"name": "timestamp", "unit": "SECOND | MILLISECOND | MICROSECOND | NANOSECOND"}
Date:
{"name": "date", "unit": "DAY | MILLISECOND"}
Time:
{"name": "time", "unit": "SECOND | MILLISECOND | MICROSECOND | NANOSECOND", "bitWidth": / * integer: 32 or 64 * /}
Interval:
{"name": "interval", "unit": "YEAR_MONTH | DAY_TIME"} Schema: {"fields": [/ * Field * /]} record header
A RecordBatch is a top-named collection of equal-length Arrow arrays (or vectors). If one of the arrays contains nested data, its subarray does not need to be the same length as the top-level array.
can be thought of as an implementation of a particular pattern. Metadata that describes a particular RecordBatch is called a "data header". This is the IDL of Flatbuffers for RecordBatch headers
Table RecordBatch {length: long; nodes: [FieldNode]; buffers: [Buffer];}
The RecordBatch metadata described by provides recording batches longer than 2 ^ 31-1, but the Arrow implementation does not require implementation support to exceed this size.
generates nodes and buffers fields by depth-first traversing / flattening the schema (which may contain nested types) of a given in-memory dataset.
Buffer zone
buffers are metadata that describe contiguous areas of memory relative to some virtual address space. This may include:
Shared memory, such as memory-mapped files that receive RPC messages in memory
Data in a file
The key form of the buffer type is:
Struct Buffer {offset: long;length: long;}
in the context of record batch, each field has a number of buffers associated with it, derived from its physical memory layout.
each logical type (separate from its child nodes, if it is a nested type) has a deterministic number of buffers associated with it. These will be specified in the logical types section.
Field metadata
The FieldNode value described by contains metadata about each level at the nested type level.
Struct FieldNode {/ The number of value slots in the Arrow array at this level of a nested/// treelength: long;/// The number of observed nulls.null_count: lohng;}
The FieldNode metadata described by provides fields longer than 2 ^ 31-1, but Arrow does not require the implementation to support large arrays.
Tile nested data
nested types are flattened in record batch in depth-first order. When you access each field in the nested type tree, the metadata is appended to the top-level fields array and the buffer associated with the field (but not its children) is appended to the buffers array.
for example, let's consider the pattern:
Col1: Structcol2: Utf8
The flattened version is:
FieldNode 0: Struct name='col1'FieldNode 1: Int32 name=a'FieldNode 2: List name='b'FieldNode 3: Int64 name='item' # arbitraryFieldNode 4: Float64 name='c'FieldNode 5: Utf8 name='col2'
will have the following for the generated buffer (such as a more detailed description of each type below):
Buffer 0: field 0 validity bitmapbuffer 1: field 1 validity bitmapbuffer 2: field 1 values buffer 3: field 2 validity bitmapbuffer 4: field 2 list offsets buffer 5: field 3 validity bitmapbuffer 6: field 3 values buffer 7: field 4 validity bitmapbuffer 8: field 4 values buffer 9: field 5 validity bitmapbuffer 10: field 5 offsets buffer 11: field 5 data logical type
logical types consist of type names and metadata as well as explicit mappings to physical memory representations. These may fall into different categories:
Types represented as fixed-width primitive type arrays (for example: C-style integers and floating-point numbers) with memory layouts equal to physical nested types (for example, strings are represented by lists (List), but are not logically nested types) integers
in the first version of Arrow, we provide standard 8-bit to 64-bit standard C integer types, both signed and unsigned:
● signed type: Int8,Int16,Int32,Int64
● unsigned type: UInt8,UInt16,UInt32,UInt64
IDL looks like:
Table Int {bitWidth: int;is_signed: bool;}
The integer byte order is currently set globally at the mode level. If a mode is set to little-endian (low-order addressing), then all integer types that appear in it must be little-endian. Integers as part of other data representations, such as list offsets and union types, must have the same byte order as the entire Record Batch.
Floating point number
we provide three types of floating-point numbers as basic arrays of fixed width.
Semi-precision, 16-bit width single precision, 32-bit width double precision, 64-bit width
IDL is similar to enum Precision:int {HALF, SINGLE, DOUBLE} table FloatingPoint {precision: Precision;} Boolean
The Boolean logical type is represented as a 1-bit wide basic (primitive) physical type. These bits are numbered using the least significant bit (LSB) sort.
, like other fixed bitwidth primitive types, Boolean data is displayed as two buffers in the header (one bitmap is a valid vector and the other is a value).
List
The List logical type is the logical correspondence (with the same name) to the physical type of the list.
in the header format, the List field node contains 2 buffers:
Valid bitmap list offset
buffers associated with subfields of List are processed recursively according to sublogical types (for example, List versus List). Utf8 and Binary
We specify two logical types for variable length bytes:
Utf8 data is a Unicode value with UTF-8 encoding Binary is any other variable length byte
these types all have the same memory layout as the nested type List, with the constraint that internal bytes cannot contain null values. From the perspective of logical types, they are primitive, not nested types.
in the header format, although List is displayed as 2 field nodes (List and UInt8) and 4 buffers (2 for each node, as described above), these types also have a simple representation of single field nodes (Utf8 or Binary logical type, no child nodes) and 3 buffers: valid bitmap list offset byte data Decimal
TBD
Timestamp
All timestamps are stored in 64-bit integers and have four units: seconds, milliseconds, microseconds, and nanoseconds.
Date
supports two different date types:
Days since UNIX epoch are 32-bit integers milliseconds from UNIX epoch are 64-bit integers Time
time supports the same units: seconds, milliseconds, microseconds and nanoseconds. We represent time as the smallest integer that holds the specified unit. For seconds and milliseconds: 32 bits, other 64 bits.
Dictionary coding
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.