What is Apache Arrow? 07/12 Update SLTechnology News&Howtos

What is Apache Arrow?

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is Apache Arrow". In daily operation, I believe many people have doubts about what Apache Arrow is. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts about "what is Apache Arrow?" Next, please follow the editor to study!

Apache Arrow: an introduction to Arrow, a standard data format for memory storage in heterogeneous big data systems

The existing big data analysis systems are basically based on their own different in-memory data structures, which will lead to a series of repetitive work: from the point of view of the computing engine, the algorithm must be based on the project-specific data structure and unnecessary coupling between the API and the algorithm; from the point of view of data acquisition, the data must be deserialized when loading, and each data source needs to implement the corresponding loader separately. From the perspective of ecosystem, cross-project and cross-language cooperation is blocked invisibly. Can the cost of serialization and deserialization of data between different systems be reduced or eliminated? Can algorithms and IO tools be reused across projects? Can we promote a broader collaboration that brings together developers of data analysis systems? Driven by such a mission, Arrow was born.

Unlike other projects, the grass-roots team of the Arrow project is composed of 5 Apache Members, 6 PMC Chairs and some other project PMC and committer. They go directly to the ASF board of directors and start directly as a top-level Apache project after obtaining approval. For a detailed history of the project, you can read this blog written by Project Chair,Jacques Nadeau. In addition, this google sheet records the naming process of the project, named Arrow for the reason: "math symbol for vector. And arrows are fast. Also alphabetically will show up on top." It can be said that it has been considered quite comprehensively.

The vision of the Arrow project is to provide a development platform for in-memory data analysis (in-memory analytics) to make data move and process faster between heterogeneous big data systems:

The project is mainly composed of three parts:

Memory data format designed for analyzing query engine (analytical query engines) and data frame (data frames)

Binary protocol for IPC/RPC

A development platform for building data processing applications

The cornerstone of the entire project is the memory-based column data format, which is now featured as follows:

Standardization (standardized), independent of language (language-independent)

Both flat and hierarchical data structures are supported

Hardware awareness (hardware-aware)

Memory-based column storage format

For a detailed and accurate format definition, please read the official documentation. This section refers to the official documentation and this blog post by Daniel Abadi.

In practice, engineers usually model the data in the system through multiple two-dimensional data tables, with one row representing an entity (entity) and one column representing the same attribute. However, in hardware, memory is usually one-dimensional, that is, computer programs can only read data from memory or hard disk linearly and in the same direction, so there are two typical schemes for storing two-dimensional data tables: row storage and column storage. Generally speaking, the former is suitable for OLTP scenarios, and the latter is suitable for OLAP scenarios. Arrow is developed for data analysis, so the latter is adopted.

Any data table can be made up of data columns of no type. Take a user table as an example, which may contain attributes such as age (integer), name (varchar), date of birth (date) and so on. Arrow divides all possible column data in a data table into two categories, fixed length and variable length, and constructs more complex nested data types based on fixed and variable length data types.

Fixed-width data types

The fixed-length data column format is as follows:

123456type FixedColumn struct {data [] byte length int nullCount int nullBitmap [] byte / / bit 0 is null, 1 is not null}

In addition to the data array (data), it contains:

Array length (length)

Number of null elements (nullCount)

Null bitmap (nullBitmap)

Take the Int32 array: [1, null, 2, 4, 8] as an example, its structure is as follows:

123456789length: 5 NullCount: 1nullBitmap: | Byte 0 (null bitmap) | Bytes 1-63 | |-|-| 00011101 | 0 (padding) | data: | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 | |-- | -|-| | 1 | unspecified | 2 | 4 | 8 | unspecified |

Here is a noteworthy design decision that Arrow takes up a specified length of space in a fixed-length data format regardless of whether an element in the array (cell) is null or not; another alternative is not to allocate any space to the null element. The former can use pointer algebra to support random access of O (1), while the latter needs to use nullBitmap to calculate the displacement. If it is sequential access, the latter requires less memory bandwidth and better performance, so what is mainly reflected here is the tradeoff between storage space and random access performance, and the choice of Arrow tends to be the latter.

From the structure of nullBitmap, we can see that Arrow uses little-endian to store byte data.

Variable-width data types

The variable-length data column format is as follows:

1234567type VarColumn struct {data [] byte offsets [] int64 length int nullCount int nullBitmap [] byte / / bit 0 is null, 1 is not null}

As you can see, there is only one more offset array (offsets) than the fixed-length column. The first element of offsets is fixed at 0, and the last element is the length of the data, which is equal to length, so about the I variable length element:

12pos: = column.offsets [I] / / location size: = column.offsets [iTun1]-column.offsets [I] / / size

Another alternative is to use special characters to separate different elements in data, which can achieve better performance in individual query scenarios. For example, scanning all columns in a string column containing two consecutive letters: using the Arrow format requires frequent access to offsets to traverse the data, but the solution using special delimiters can directly traverse the data once. In other scenarios, such as querying a string whose value is equal to "hello world", offsets can be used to filter out all columns that are not 11 in length, so better performance can be obtained by using Arrow format.

Nested Data

In the process of data processing, some complex data types such as JSON, struct and union are very popular with developers. We can classify these data types as nested data types. Arrow handles nested data types in an elegant way, without introducing concepts other than fixed-length and variable-length data columns, but builds directly from both. Suppose you take the class (Class) information data of a university as an example, and there are two pieces of data in the column:

1234567891011// 1Name: Introduction to Database SystemsInstructor: Daniel AbadiStudents: Alice, Bob, CharlieYear: 2019// 2Name: Advanced Topics in Database SystemsInstructor: Daniel AbadiStudents: Andrew, BeatriceYear: 2020

We can divide the renested data structure into four columns: Name, Instructor, Students, and Year, where Name and Instructor are variable length string columns, Year is a fixed length integer column, and Students is a string array column (two-dimensional array). Their storage structures are as follows:

The Students column itself is a nested data structure, while the outer Class table contains the Students column. You can see that this ingenuity can support infinite nesting, which is a laudable design.

Buffer alignment and padding

All implementations of Arrow columns need to consider data memory address alignment (alignment) and padding (padding). It is generally recommended to align addresses to 8 or 64 bytes, or to complete as needed if they are less than an integer multiple of 8 or 64 bytes. This is mainly to use the SIMD instruction of modern CPU to vectorize the computation.

Memory-oriented columnar format

In the decades of computer development, the vast majority of data engines use row memory format, mainly because the early data application load patterns can not escape the addition, deletion, modification and query of a single entity. In the face of this kind of load, if the data is stored in column format, reading an entity data needs to jump back and forth on the memory to find the different attributes of the entity, which is essentially performing random access. However, with the passage of time, with the increase of data, the load becomes more complex, and the load pattern of data analysis is gradually revealed, that is, a few attributes of a group of entities are accessed each time, and then the analysis results are aggregated. at this time, the status of the column format is gradually improved.

In the Hadoop ecology, Apache Parquet and Apache ORC have become the two most popular file storage formats, and their core values are built around the column data format, so why do we need Arrow? Here we can look at data storage from two perspectives:

Storage format: row storage (row-wise/row-based), column storage (column-wise/column-based/columnar)

Main memory: disk oriented (disk-oriented), memory oriented (memory-oriented)

Although all three are in column format, Parquet and ORC are disk-oriented, while Arrow is memory-oriented. To understand the difference between disk-oriented design and memory-oriented design, let's take a look at an experiment done by Daniel Abadi.

Daniel Abadi's experiment

On a t2.medium instance of Amazon EC2, create a table with 60000000 rows of data, each row contains 6 attributes, and each attribute value is data of type int32, so each row requires 24 bytes of space, and the whole table takes up about 1.5GB space. We save a copy of this table in row storage format and column storage format, and then execute a simple query: find data equal to a specific value in the first column, that is:

1SELECT a FROM t WHERE t.a = 477638700

Whether it is the row or column version, CPU's job is to get the integer and compare it to the target integer. However, executing the query in the row memory version requires scanning each row, that is, all the 1.5GB data, while executing the query in the column memory version only needs to scan the first column, that is, 0.25GB data, so the latter should theoretically be six times more efficient than the former. However, the actual results are as follows:

The performance of the stored version is almost the same as that of the row version! The reason is that all CPU optimization (vectorization/SIMD processing) is turned off during the execution of the experiment, so that the bottleneck of the query appears on the CPU processing. Let's analyze the reason: according to experience, from memory scan data to CPU throughput can reach 30GB/s, modern CPU processing frequency can reach 3GHz, that is, 3 billion CPU instructions per second, so even if the processor can perform 32-bit integer comparisons in a CPU cycle, its throughput is up to 12 GB/s, which is much smaller than memory transfer data. Therefore, whether it is row storage or column storage, the transfer of 0.25GB or 1.5GB data from memory to CPU will not have a significant impact on the results.

If you turn on the CPU optimization option, the situation is very different. For column storage data, as long as these integers are stored continuously in memory, the compiler can vectorize simple operations, such as the comparison of 32-bit integers. In general, the vectorized post-processor can compare four 32-bit integers with a specified value in a single instruction. Execute the same query after optimization. The results of the experiment are shown below:

You can see a four-fold difference as expected. It is worth noting, however, that CPU is still the bottleneck at this time. If the memory bandwidth is the bottleneck, we will be able to see a six-fold performance difference between the column version and the row version.

As can be seen from the above experiments, for the workload of sequential scanning queries with a small number of attributes, the column storage format is better than the row storage format, which has nothing to do with whether the data is on disk or in memory, but the reason why they are superior to the row memory format is different. If the disk is the main storage, the processing speed of CPU is much higher than the speed of data moving from disk to CPU. The advantage of column storage is that it can reduce disk IO; through a more suitable compression algorithm. If memory is the main storage, the impact of data movement speed will become negligible. At this time, the advantage of column storage is that it can make better use of vector processing.

This experiment tells us that the design of data storage formats has different purposes under different bottlenecks. The most typical is compression. For disk-oriented scenarios, a higher compression ratio is almost always a good idea. Using computing resources for space can make use of more CPU resources and reduce the pressure on disk IO; for memory-oriented scenarios, compression will only make CPU more overburdened.

Apache Parquet/ORC vs. Apache Arrow

Now it's easier to compare Parquet/ORC with Arrow. Because Parquet and ORC are designed for disk, it is necessary to support compression algorithms with high compression rate, such as snappy, gzip, zlib and so on. On the other hand, Arrow is designed for memory, requires almost no compression algorithm, and prefers to store native binary data directly. Another difference between disk-oriented and memory-oriented is that although sequential access to disk and memory is more efficient than random access, the difference is 2-3 orders of magnitude on disk and usually within 1 order of magnitude in memory. Therefore, in order to share the cost of a random access, thousands of pieces of data need to be read continuously on disk, while only about ten pieces of data need to be read continuously in memory. This difference means that the batch size in the memory scenario (such as the 64KB of Arrow) is smaller than the batch size in the disk scenario.

At this point, the study of "what is Apache Arrow" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.