How to use PolarDB-X Vectorization engine 07/08 Update SLTechnology News&Howtos

How to use PolarDB-X Vectorization engine

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to use the PolarDB-X vectorization engine, the content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Introduction

PolarDB-X is a cloud native distributed database developed by Alibaba, which adopts the architecture of computing-storage separation, in which the computing node undertakes a large number of expression calculation tasks. The evaluation of these expressions involves all aspects of SQL execution and has an important impact on performance. For this reason, PolarDB-X introduces a vectorized execution engine, which brings tens of times improvement in performance for expression evaluation.

Defects of traditional database executor

The execution engine of modern database system mostly adopts the processing method of calculating one row of data at a time (Tuple-at-a-time), and needs to parse and judge the data type at run time to adapt to the complex expression structure. We call it a "scalar expression". Although this approach is easy to implement and well-structured, it has significant defects when the amount of data that needs to be processed increases:

In order to adapt to the complex expression structure, calculating an expression often requires the introduction of a large number of instructions; for row execution, processing a single piece of data requires operator tree instruction interpretation (instruction interpretation), which brings a lot of instruction interpretation overhead. According to the statistics of the paper "MonetDB/X100: Hyper-Pipelining Query Execution", when MySQL executes the Query1 of TPC-H test set, instruction interpretation takes 90% of the execution time.

In addition, in the initial Volcano structure design, the internal logic of the operator does not avoid branch prediction (branch prediction). Incorrect branch prediction requires CPU to terminate the current pipeline and reload the instructions in the ELSE statement, a process we call pipeline flush or pipeline break. Frequent branch prediction errors will seriously affect the performance of the database.

Vectorization execution system

The database vectorization execution system was first proposed by the paper "MonetDB/X100: Hyper-Pipelining Query Execution". It has the following main points:

The execution mode of vector-at-a-time is adopted, that is, the vector is used as the data organization unit.

The vectorization primitive (vectorization primitives) is used as the basic unit of the vectorization operator, thus the whole vectorization execution system is constructed. Avoid generating branch predictions in primitives.

Use code generation (code generation) technology to solve the code explosion (code explosion) problem caused by static typing.

The vectorization engine brings a significant performance improvement to the expression evaluation of PolarDB-X. In the following figure, the horizontal axis is vector size and the vertical axis is throughput. The performance test results of different scalar expressions and vector expressions are as follows:

The performance test results of case expressions are as follows:

Overall process

In PolarDB-X, the execution of vectorized expressions is divided into the following phases:

After parsing, the user SQL verifies, deduces and modifies the type information of the expression in validator; this stage provides correct and static type information for vectorization operation

After the optimizer forms the execution plan, you need to bind the expression tree, instantiate the corresponding vectorization primitive, and allocate the vector subscript for runtime memory allocation.

In the execution phase, the top-down trigger executes the vector primitive according to the Volcano structure, and the vector is used as the run-time data structure.

Runtime structure

Data structure

In the PolarDB-X vectorization execution system, the following data structures are used to store data:

When the vectorization expression is executed, all data is stored in the batch data structure. Batch consists of many vector and an array of selection. The vector vector consists of a specific type of numeric list (values) and an null array that identifies the location of null values, all of which are stored continuously in memory. The bit bit in the null array uses 0 and 1 to distinguish whether a position in the list of values is empty.

We can use vector (type, index) to identify a vector in batch. Each vector has its own specific subscript position (index) to represent the order of the vectors in the batch, and type information (type) to specify the type of the vector. Before evaluating the vectorized expression, we need to traverse the whole expression tree, allocate the subscript position according to the Operand and return value of each expression, and finally allocate memory to the vector according to the subscript position.

Delayed materialization

The design of selection array embodies the idea of delayed materialization. Refer to the paper "Materialization Strategies in a Column-Oriented DBMS". The so-called delayed matrialization is to postpone the process of materialization as much as possible to reduce the overhead of memory access. When performing expression evaluation, it is common to filter part of the data through Filter expressions, and then evaluate the filtered data; each filtering will affect all the vectors in batch. Taking the batch in the figure above as an example, if we set the filter condition vector (int, 0)! = 1 for the 0th vector, and assume that 90% of the data in vector (int, 0) meets this filter condition (selection rate selectivity = 0.9), then we need to re-materialize 90% of all vectors in batch into another block of memory. If we only record the locations that meet the filter condition and store them in the selection array, we can avoid this materialization process. Accordingly, you need to refer to this selection array in each subsequent vectorization evaluation.

Vectorization primitive

The vectorization primitive is the execution unit in the vectorization execution system, which limits the degree of freedom during execution to the maximum extent. Primitives do not pay attention to context information, nor do they need to do type parsing and function calls at run time, but only need to pay attention to the incoming vectors. It is Type-Specific, that is, a class of primitives can only handle specific types.

The main body of the vectorization primitive is the code structure of Tight-Loop. In the body of a loop, only values and operations are needed, and there are no branch operations and function calls. A simple vectorization primitive structure is as follows:

Map_plus_double_col_double_col (int / res,double*__restrict__ vector1 / double*__restrict__ vector2,int*__restrict__ selection) {if (selection) {for (int jung0 / j 1 then b * 2else a * b)

Has the following tree structure:

Because scalar expressions are arranged according to volcano structure and provide a unified interface of next (), case expressions must execute all the sub-expressions a > 1 and 1, and then aggregate all the results together, and finally do case semantic processing. This method of execution can not terminate the evaluation process in time according to the processing result of the when expression, but execute indiscriminately on the whole subexpression.

After introducing the vectorization actuator, we can design short-circuit evaluation to optimize this problem, and each subexpression needs to be provided with an appropriate selection array, so as to correctly select the appropriate position in the column for vector operation.

On how to use the PolarDB-X vectorization engine to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.