Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Apache Arrow official documentation-memory structure

2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Definition / terminology

   because different projects use different words to describe concepts, here is a small glossary to help disambiguate.

Array: a sequence of values of the same type of known length. Slot or array slot: a continuous memory area of a single logical value in an array of some specific data types: a sequential virtual address space of a given length. Any byte can be obtained by a single pointer offset that is less than the length of the region. Contiguous memory buffer: a contiguous area of memory that stores multi-valued components of Array. It is sometimes called a "buffer". Basic type: the data type that occupies a fixed-size memory slot, specifying the occupied memory size in bit width or byte width. Nesting or parameter types: the complete structure depends on the data types of one or more other sub-object types. Two fully specified nested types are equal when and only if the subtypes are equal. For example, if U and V are different relative (simple) types, List and List are also different. Relative type or simple type (disqualified): a specific base type or a fully specified nested type. When we talk about slots, we mean relative type values, not necessarily any physical storage area. Logical types: data types implemented using some relative (physical) types. For example, decimal values stored in 16 bytes can be stored in a byte array of 16 slots. Similarly, strings can be stored as List. Parent and child arrays: names that represent the relationship between arrays of physical values in a nested type structure. For example, the List type: the array of the parent type has a T-shaped array as its child element (see the list below). Leaf node or leaf: an array of original values, which may or may not be a subarray of some arrays with nested types. Requirements, goals, and non-goals basically require a physical memory layout that allows zero deserialization of data exchange between various systems that handle flat and nested data, including the Spark,Drill,Impala,Kudu,Ibis,Spark,ODBC protocol and proprietary systems that leverage open source components. All array slots are accessible in uninterrupted time, and complexity increases linearly at the nesting level to represent fully materialized and decoded / decompressed Parquet data all contiguous memory buffers are aligned at 64-byte boundaries and filled to multiples of 64 bytes. Any relative type can have an empty slot array that is immutable once created. The implementation can provide API to mutate arrays, but applying mutations will require the construction of new array data structures. Arrays can be relocated (for example, for RPC / transient storage) without adjusting the pointer. Another approach is that contiguous memory areas can be migrated to different address spaces (for example, through memcpy-type operations) without changing their contents. The goal (for this document) describes the relative type, sufficient explicit description of the implementation (physical value types and a set of initial nested types) the memory layout and random access mode null values of each relative type represent non-target (for this document) enumeration or specify logical types that can be implemented as basic (fixed-width) value types. For example: signed and unsigned integers, floating point numbers, Boolean values, exact decimals, date and time types, CHAR (K), VARCHAR (K), etc. Specifies the data layout for standardized metadata or RPC or temporary file storage. Define the selection or masking vector (vector) construct to implement the details of the user or developer C/C++/Java API details. Any "table" structure consisting of an array named by a table, each with its own type, or any other structure that makes up the array. Any memory management or reference counting subsystem enumerates or specifies the type byte order (Endianness) supported by encoding or compression

   by default, the Arrow format is low-order addressing (low-order bytes are stored at the starting address). Schema metadata has a field that indicates the byte order of the RecordBatches. Usually this is the byte order of the system that generated the RecordBatch. The main use case is to exchange RecordBatches between systems with the same bytecode. First, an error is returned when trying to read a pattern in byte order that does not match the underlying system. The reference implementation focuses on status addressing and provides testing for this. Eventually we can provide automatic conversion through byte swapping.

Align and fill

   as mentioned above, all buffers are designed to align memory with 64-byte boundaries and fill up to 64-byte multiples. Alignment requires following best practices for optimizing memory access:

The elements in the numeric array are guaranteed to be read through aligned access. In some architectures, alignment can help restrict partial use of cache lines. 64-byte alignment is recommended by the Intel performance wizard for data structures that exceed 64 bytes (this will be a common case of Arrow format arrays).

   requires a multiple of 64 bytes to allow the SIMD instruction to be used consistently in the loop without additional conditional checking. This allows simpler and more efficient code.

   chose a specific padding length because it matches the largest known SIMD instruction register available in April 2016 (Intel AVX-512). Guaranteed padding also allows some compilers to generate more optimized code directly (for example, Intel-qopt-assume-safe-padding can be used safely).

   unless otherwise stated, padding bytes do not need to have a specific value. Array length

   any array has a known and fixed length and is stored as a 32-bit signed integer, so you can store up to (2 ^ 31-1) elements. We chose a signed int32 for two reasons:

Enhanced compatibility with Java and client languages may have different support qualities for unsigned integers. To encourage developers to form smaller arrays (each array contains contiguous memory in its leaf node) to create a larger array structure that may exceed (2 ^ 31-1) elements, rather than allocating very large contiguous blocks of memory. Null count

The number of    null slots is an attribute of the physical array and is considered part of the data structure. The null count is stored as a 32-bit signed integer because it may be as large as the length of the array.

Null bitmap

Any relative type of    can have empty value slots, whether primitive or nested.

Arrays with null values in    must have contiguous memory buffers called empty (or valid) bitmaps, which are multiples of 64 bytes in length (as described above) and large enough that each array slot has at least 1 bit.

Whether any    array slot is valid (not empty) is encoded in each bit of the bitmap. An index (set bit) j value of 1 indicates that the value is not empty, while 0 (bit is not set) indicates that the value is empty. The bitmap is initialized so that none of it is set at the allocation time (this includes padding).

Is_ valid [j]-> bitmap [j / 8] & (1

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report