Analysis of Go Assembly Syntax and MatrixOne usage examples 07/13 Update SLTechnology News&Howtos

Analysis of Go Assembly Syntax and MatrixOne usage examples

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the relevant knowledge of Go assembly syntax and MatrixOne use case analysis, the content is detailed and easy to understand, the operation is simple and fast, and has a certain reference value, I believe you will have something to gain after reading this Go assembly syntax and MatrixOne use case analysis article, let's take a look.

What is the MatrixOne database?

MatrixOne is a new generation of super-integrated heterogeneous databases, committed to creating a single architecture to handle TP, AP, stream computing and other loads of minimalist big data engine. MatrixOne was developed by the Go language and was open source in October 2021 and is now release to version 0.3. In the performance report that MatrixOne has released, it is not inferior to Clickhouse, the industry-leading OLAP database. As a database implemented in Go language, it can achieve the same performance as that achieved by C++. One of the most important optimizations is to make use of the assembly ability of GE language to accelerate hardware by calling SIMD instructions.

Introduction to Go compilation

Go is a relatively new high-level language that provides exciting features such as co-programming and fast compilation. But in the database engine, using a pure Go language can be too powerful to catch. For example, vectorization is a commonly used acceleration method for database computing engines, while the Go language cannot maximize the performance of vectorized code by calling SIMD instructions. For example, in security-related code, the Go language cannot call cryptography-related instructions provided by CPU. In the world of C/C++/Rust, this kind of problem can be solved by calling the intrinsics function related to the CPU architecture. The solution provided by the Go language is Go assembly. This article will introduce the syntax features of Go assembly and show how to use it through several specific scenarios.

This article assumes that the reader already has a basic understanding of computer architecture and assembly language, so commonly used nouns (such as "registers") are not explained. If you lack relevant preparatory knowledge, you can seek network resources for learning, such as here.

Unless otherwise noted, the assembly languages referred to in this article are all aimed at x86 (amd64) architecture. With regard to x86 instruction sets, both Intel and AMD officially provide complete instruction set reference documentation. For a quick look-up, you can also use this list. Intel's intrinsics documentation can also be used as a reference.

Why use Go assembly?

Wikipedia summarizes the reasons for using assembly language into three categories:

Direct operation of hardware

Use special CPU instructions

Resolve performance issu

The reasons why Go programmers use assembly are no more than these three categories. If you are faced with problems in these three categories and there are no off-the-shelf libraries available, consider using Go assembly.

Why not use CGO?

Huge function call overhead

Memory management problem

Break the goroutine semantics if you run the CGO function in the cooperation program, it will occupy a separate thread and cannot be scheduled normally by the Go runtime.

Poor portability cross-compilation requires a full set of tool chains for the target platform. Deployment on different platforms requires more dependent libraries to be installed.

If the above points are unacceptable in your scenario, try Go assembly.

Features of Go assembly grammar

According to Rob Pike's The Design of the Go Assembler,Go, the assembly language used does not strictly correspond to CPU instructions, but is a "pseudo assembly" called Plan 9 assembly.

The most important thing to know about Go's assembler is that it is not a direct representation of the underlying machine. Some of the details map precisely to the machine, but some do not. This is because the compiler suite needs no assembler pass in the usual pipeline. Instead, the compiler operates on a kind of semi-abstract instruction set, and instruction selection occurs partly after code generation. The assembler works on the semi-abstract form, so when you see an instruction like MOV what the toolchain actually generates for that operation might not be a move instruction at all, perhaps a clear or load. Or it might correspond exactly to the machine instruction with that name. In general, machine-specific operations tend to appear as themselves, while more general concepts like memory move and subroutine call and return are more abstract. The details vary with architecture, and we apologize for the imprecision; the situation is not well-defined.

We don't need to care about the correspondence between Plan 9 assembly and machine instructions, we just need to understand the syntax features of Plan 9 assembly. There are some documents available on the Internet, such as here and here.

An example is worth a thousand words. Let's take the simplest 64-bit integer addition as an example to look at the characteristics of Go assembly syntax from different aspects.

/ / add.gofunc Add (x, y int64) int64//add_amd64.s#include "textflag.h" TEXT Add (SB), NOSPLIT, $0-24 MOVQ Xero0 (FP), AX MOVQ Yao8 (FP), CX ADDQ AX, CX MOVQ CX, ret+16 (FP) RET

The four assembly codes do the following in turn:

The first Operand x is put into the register AX

The second Operand y is put into the register

CXCX plus AX, the result is put back to CX

CX is put into the stack address where the return value is located.

Operand order

There are two most commonly used syntax for x86 assembly, AT&T syntax and Intel syntax. The number of AT&T syntax results comes last, and other operands come first. The number of Intel syntax results comes first, and other operands come after.

The assembly of Go is similar to the AT&T syntax in this respect, with the result at the end.

An example of being prone to miswriting is the CMP instruction. In effect, CMP is similar to the SUB instruction that only modifies EFLAGS flag bits, not operands. In Go assembly, CMP sets the flag bit as a result of the first Operand minus the second Operand (as opposed to SUB).

Some instructions support different register widths. Taking the ADD of 64-bit operands as an example, according to the AT&T syntax, the instruction name is suffixed with a width to become ADDQ, and the register is prefixed with a width to become RAX and RCX. According to the Intel syntax, the instruction name remains the same, only the register is prefixed.

As you can see from the above example, Go assembly is different from both: the instruction name needs to be suffixed with a width, and the register remains the same.

Function calling convention

The way a programming language passes parameters in a function call is called a function calling convention (function calling convention). By default, the mainstream CAccord Clipper + compilers on x86-64 architecture use a register-based approach: the caller passes parameters to the called function in a specific register. And Go's calling convention, to put it simply, on the latest Go 1.18, Go's own runtime library uses a register-based approach on amd64 and arm64 and ppc64 architectures, and the rest (other CPU schemas, as well as non-runtime libraries and user-written libraries) use a stack-based approach: the caller presses the parameters in turn, the callee accesses the stack through the passed offset, and then presses the return value after execution.

In the above code, FP is a virtual register that points to the address of the first parameter in the stack. Multiple parameters and return values are stored sequentially, so the address of xmemy and return value in the stack is FP plus the offset of 0mem8Pol 16, respectively.

Avo, a tool that is helpful for writing Go assembly code

Readers familiar with assembly language should know that handwritten assembly language will have tedious and error-prone steps such as selecting registers and calculating offsets. The avo library is created to solve this kind of problem. For more information on the specific use of avo, see the example given in its repo.

Text/template

This is a library included with the Go language. It is helpful when writing a lot of repetitive code, such as implementing the same basic operator for different types in vectorized code. For specific usage, see the official documentation, which does not take up space here.

Using macros in Go assembly code

Go assembly code supports macros similar to the C language, and can also be used in scenarios where there is a lot of code repetition. There are many examples in the internal library, such as here.

Application of basic Vector Operation acceleration in Go language Assembly in MatrixOne Database

In the OLAP database computing engine, vectorization is an indispensable acceleration method. Through vectorization, the unnecessary overhead caused by a large number of simple function calls is eliminated. In order to achieve maximum vectorization performance, it is a very natural choice to use SIMD instructions.

We take the 8-bit integer vectorization addition as an example. Add the elements of the two arrays and put the results into the third array. This operation can be automatically optimized to use the version of the SIMD instruction in some CCompact + compilers. The Go compiler, which is famous for its compilation speed, will not do such optimization. This is also the active choice made by the Go language to ensure the compilation speed. In this example, we show how to use Go assembly to implement the addition of int8 type vectors in the AVX2 instruction set (assuming the array has been populated at 32 bytes).

Since AVX2 has 16 256-bit registers, we want to use all of them in the loop expansion. Repeated listing of registers is tedious and error-prone if it is fully handwritten. So we use avo to simplify some of the work. The vector addition code for avo is as follows:

Package mainimport (. "github.com/mmcloughlin/avo/build". "github.com/mmcloughlin/avo/operand". "github.com/mmcloughlin/avo/reg") var unroll = 16var regWidth = 32func main () {TEXT ("int8AddAvx2Asm", NOSPLIT, "func (x [] int8, y [] int8, r [] int8)") x: = Mem {Base: Load (Param ("x"). Base (), GP64 ()} y: = Mem {Base: Load (Param ("y"). Base (), GP64 ())} r: = Mem {Base: Load (Param ("r"). Base () GP64 ()} n: = Load (Param ("x"). Len (), GP64 () blocksize: = regWidth * unroll blockitems: = blocksize / 1 regitems: = regWidth / 1 Label ("int8AddBlockLoop") CMPQ (n, U32 (blockitems)) JL (LabelRef ("int8AddTailLoop") xs: = make ([] VecVirtual, unroll) for I: = 0 I < unroll VPADDB + {xs [I] = YMM () VMOVDQU (x.Offset (regWidth*i), xs [I])} VPADDB (y.Offset (regWidth*i), xs [I], xs [I]) VMOVDQU (xs [I], r.Offset (regWidth*i)) ADDQ (U32 (blocksize), x.Base) ADDQ (U32 (blocksize), y.Base) ADDQ (U32 (blocksize), r.Base) SUBQ (U32 (blockitems)) N) JMP (LabelRef ("int8AddBlockLoop")) Label ("int8AddTailLoop") CMPQ (n, U32 (regitems)) JL (LabelRef ("int8AddDone")) VMOVDQU (x, xs [0]) VPADDB (y, xs [0], xs [0]) VMOVDQU (xs [0], r) ADDQ (U32 (regWidth), x.Base) ADDQ (U32 (regWidth), y.Base) ADDQ (U32 (regWidth), r.Base) SUBQ (U32 (regitems) N) JMP (LabelRef ("int8AddTailLoop")) Label ("int8AddDone") RET ()}

Run command

Go run int8add.go-out int8add.s

The assembly code generated after that is as follows:

/ / Code generated by command: go run int8add.go-out int8add.s. DO NOT EDIT.#include "textflag.h" / / func int8AddAvx2Asm (x [] int8, y [] int8, r [] int8) / / Requires: AVX, AVX2TEXT int8AddAvx2Asm (SB), NOSPLIT, $0-72 MOVQ x_base+0 (FP), AX MOVQ y_base+24 (FP), CX MOVQ r_base+48 (FP), DX MOVQ x_len+8 (FP), BXint8AddBlockLoop: CMPQ BX, $0x00000200 JL int8AddTailLoop VMOVDQU (AX), Y0 VMOVDQU 32 (AX) Y1 VMOVDQU 64 (AX), Y2 VMOVDQU 96 (AX), Y3 VMOVDQU 128 (AX), Y4 VMOVDQU 160 (AX), Y5 VMOVDQU 192 (AX), Y6 VMOVDQU 224 (AX), Y7 VMOVDQU 256 (AX), Y8 VMOVDQU 288 (AX), Y9 VMOVDQU 320 (AX), Y10 VMOVDQU 352 (AX), Y11 VMOVDQU 384 (AX), Y12 VMOVDQU 416 (AX), Y13 VMOVDQU 448 (AX) Y14 VMOVDQU 480 (AX), Y15 VPADDB (CX), Y0, Y0 VPADDB 32 (CX), Y1, Y1 VPADDB 64 (CX), Y2, Y2 VPADDB 96 (CX), Y3, Y3 VPADDB 128 (CX), Y4, Y4 VPADDB 160 (CX), Y5, Y5 VPADDB 192 (CX), Y6, Y6 VPADDB 224 (CX), Y7, Y7 VPADDB 256 (CX), Y8, Y8 VPADDB 288 (CX), Y9 Y9 VPADDB 320 (CX), Y10, Y10 VPADDB 352 (CX), Y11, Y11 VPADDB 384 (CX), Y12, Y12 VPADDB 416 (CX), Y13, Y13 VPADDB 448 (CX), Y14, Y14 VPADDB 480 (CX), Y15, Y15 VMOVDQU Y0, (DX) VMOVDQU Y1,32 (DX) VMOVDQU Y2, 64 (DX) VMOVDQU Y3, 96 (DX) VMOVDQU Y4 (DX) VMOVDQU Y5,160 (DX) VMOVDQU Y6,192 (DX) VMOVDQU Y7,224 (DX) VMOVDQU Y8,256 (DX) VMOVDQU Y9,288 (DX) VMOVDQU Y10,320 (DX) VMOVDQU Y11,352 (DX) VMOVDQU Y12384 (DX) VMOVDQU Y13,416 (DX) VMOVDQU Y14448 (DX) VMOVDQU Y15 ADDQ $0x00000200, AX ADDQ $0x00000200, CX ADDQ $0x00000200, DX SUBQ $0x00000200, BX JMP int8AddBlockLoopint8AddTailLoop: CMPQ BX, $0x00000020 JL int8AddDone ADDQ $0x00000020, AX ADDQ $0x00000020, CX ADDQ $0x00000020, DX SUBQ $0x00000020, BX JMP int8AddTailLoopint8AddDone: RET

As you can see, in the avo code, we only need to specify the register type for the variable, and automatically bind the available registers of the corresponding type for us when we generate the assembly. This does bring convenience in many scenarios. However, avo currently only supports x86 architecture, so writing assembly for arm CPU cannot be used.

Instructions that cannot be called directly by Go language

In addition to SIMD, there are many CPU instructions that cannot be used in the Go language itself, such as cryptography-related instructions. It is convenient to use the compiler's built-in intrinsics function (provided by both gcc and clang) to call it if you use it. Unfortunately, the Go language does not provide intrinsics functions. In such a scenario, assembly is the only solution. The Go language has a lot of assembly code in its own crypto official library.

Here we take the CRC32C instruction as an example. In the hash table implementation of MatrixOne, the hash function of integer key uses only one CRC32 instruction, which achieves the highest performance in theory. The code is as follows:

TEXT Crc32Int64Hash (SB), NOSPLIT, $0-16 MOVQ-1, SI CRC32Q data+0 (FP), SI MOVQ SI, ret+8 (FP) RET

In the actual code, in order to eliminate the instruction jump overhead caused by assembly function calls, as well as the parameter stack overhead, the batch version is used. Here, in order to save space, we use a simplified version of the example.

Special optimization effect that cannot be achieved by compiler

The following is part of the algorithm used by MatrixOne for finding the intersection of two ordered 64-bit integer arrays:

Loop: CMPQ DX, DI JE done CMPQ R11, R8 JE done MOVQ (DX), R10 MOVQ R10, (SI) CMPQ R10, (R11) SETLE AL SETGE BL SETEQ CL SHLB $0x03, AL SHLB $0x03, BL SHLB $0x03, CL ADDQ AX, DX ADDQ BX, R11 ADDQ CX, SI JMP loopdone:...

The line CMPQ R10, (R11) is the element that compares the current pointer position of two arrays. The next few lines move the pointers of the corresponding operands and result arrays based on the results of this comparison. The textual explanation is not as clear as comparing the following equivalent C code:

While (true) {if (a = a_end) break; if (b = = b_end) break; * c = * a; if (* a = * b) + + b; if (* a = = * b) + + c;}

In the assembly code, the loop body does only one comparison operation, and there are no branch jumps. High-level language compilers cannot achieve this optimization effect because no high-level language provides semantics that are directly related to the CPU instruction set, such as "modifying three different numbers according to three different results of a comparison operation."

This example is a demonstration of the power of assembly language. With the continuous development of programming languages, the level of abstraction is getting higher and higher, but in the scenario of maximizing performance, assembly languages that deal directly with CPU instructions are still needed.

This is the end of the article on "Go assembly grammar and MatrixOne usage case analysis". Thank you for reading! I believe you all have a certain understanding of "Go assembly grammar and MatrixOne usage case analysis". If you want to learn more, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.