Learn how to use NEON to accelerate the algorithm in Kernel state 07/13 Update SLTechnology News&Howtos

Learn how to use NEON to accelerate the algorithm in Kernel state

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

This article follows the editor to learn how to use NEON to accelerate the algorithm in the linux kernel state. The content is analyzed in detail through picture and text examples. Let's take a look at it.

The ARM processor integrates the NEON processing unit from the cortex series, which can be simply understood as a coprocessor, specially designed for matrix operations and other algorithms, especially suitable for image, video, audio processing and other scenarios, and has a wide range of applications.

This paper first briefly introduces the NEON processing unit, then introduces how to use NEON in the kernel state, and finally gives an example to illustrate.

I. introduction to NEON

In fact, the best material is the official document, Cortex ™- A Series Programmer's Guide, which is excerpted from the following description

1.1 SIMD

NEON uses SIMD architecture, single instruction multy data, one instruction to deal with multiple data, NEON can be a lot of data, and flexible configuration (8bit, 16bit, 32bit units, but multiple units of data), this is the advantage.

As shown in the figure below, APU requires at least four instructions to complete the add operation, while NEON only needs one, which saves more instructions considering ld and st.

The above features make NEON particularly suitable for processing block data, image, video, audio and so on.

1.2 NEON architecture overview

NEON is also a load/store architecture with a register of 64bit/128bit, which can form vectorized data and cooperate with several instructions that facilitate vector operation.

1.2.1 commonality with VFP 1.2.2 data type

The data type representation in the instruction, for example, VMLAL.S8:

1.2.3 registers

32 64bit registers, D0 bit D31, and 16 128 bit registers, Q0~Q15. Shared with VFP.

The data units in the register are 8bit, 16bit and 32bit, which can be flexibly configured as needed.

NEON instructions have several suffixes such as Normal,Long,Wide,Narrow and Saturating variants, which are determined according to the type of source src and dst registers of the operation.

1.2.4 instruction set

1.3 Overview of NEON instruction classification

There are many instructions. For more information, please see Cortex ™- A Series Programmer's Guide. It can be roughly divided into:

NEON general data processing instructions NEON shift instructions NEON logical and compare operations NEON arithmetic instructions NEON multiply instructions NEON load and store element and structure instructions B.8 NEON and VFP pseudo-instructions

Make a brief list of the instructions

There is no cycle to move to the left, and the negative number to the left is moved to the right.

The load and store instructions are not easy to understand. Explain.

1.4 NEON usage

1.4.1 how NEON is used

There are several ways to use NEON:

The C language is automatically vectorized by the compiler, so it is necessary to increase the compilation options, and there are some points for attention in C language coding. This method is too uncertain, there is no practical value for NEON assembly, feasible, assembly is a little more complex, but the core algorithm is still worth it, compilers such as intrinsics,gcc and armcc provide a number of inline functions corresponding to NEON, which can be called directly in C language, these functions will directly program the response NEON instructions when disassembling. This method is more practical and C language environment, and relatively simple. This method is used later in this article to explain in detail. 1.4.2 NEON data types in C language

Include the arm_ neon. H header file, which is in the gcc directory. It's all vector data.

Typedef _ _ builtin_neon_qi int8x8_t _ attribute__ ((_ _ vector_size__ (8); typedef _ _ builtin_neon_hi int16x4_t _ attribute__ ((_ _ vector_size__ (8); typedef _ _ builtin_neon_si int32x2_t _ attribute__ ((_ _ vector_size__ (8); typedef _ builtin_neon_di int64x1_t Typedef _ _ builtin_neon_sf float32x2_t _ attribute__ ((_ _ vector_size__ (8); typedef _ _ builtin_neon_poly8 poly8x8_t _ attribute__ ((_ _ vector_size__ (8); typedef _ _ builtin_neon_poly16 poly16x4_t _ attribute__ ((_ _ vector_size__ (8) Typedef _ _ builtin_neon_uqi uint8x8_t _ attribute__ ((_ _ vector_size__ (8); typedef _ _ builtin_neon_uhi uint16x4_t _ attribute__ ((_ _ vector_size__ (8); typedef _ _ builtin_neon_usi uint32x2_t _ attribute__ ((_ _ vector_size__ (8); typedef _ builtin_neon_udi uint64x1_t Typedef _ _ builtin_neon_qi int8x16_t _ _ attribute__ ((_ _ vector_size__ (16)); typedef _ _ builtin_neon_hi int16x8_t _ attribute__ ((_ _ vector_size__ (16); typedef _ _ builtin_neon_si int32x4_t _ attribute__ ((_ _ vector_size__ (16) Typedef _ _ builtin_neon_di int64x2_t _ _ attribute__ ((_ _ vector_size__ (16)); typedef _ _ builtin_neon_sf float32x4_t _ attribute__ ((_ _ vector_size__ (16); typedef _ _ builtin_neon_poly8 poly8x16_t _ attribute__ ((_ _ vector_size__ (16) Typedef _ _ builtin_neon_poly16 poly16x8_t _ _ attribute__ ((_ _ vector_size__ (16)); typedef _ _ builtin_neon_uqi uint8x16_t _ attribute__ ((_ _ vector_size__ (16); typedef _ _ builtin_neon_uhi uint16x8_t _ attribute__ ((_ _ vector_size__ (16) Typedef _ _ builtin_neon_usi uint32x4_t _ _ attribute__ ((_ _ vector_size__ (16)); typedef _ _ builtin_neon_udi uint64x2_t _ attribute__ ((_ _ vector_size__ (16); typedef float float32_t;typedef _ _ builtin_neon_poly8 poly8_t;typedef _ builtin_neon_poly16 poly16_t;typedef struct int8x8x2_t {int8x8_t val [2];} int8x8x2_t;typedef struct int8x16x2_t {int8x16_t val [2] } int8x16x2_t;typedef struct int16x4x2_t {int16x4_t val [2];} int16x4x2_t;typedef struct int16x8x2_t {int16x8_t val [2];} int16x8x2_t;typedef struct int32x2x2_t {int32x2_t val [2];} int32x2x2_t;typedef struct int32x4x2_t {int32x4_t val [2];} int32x4x2_t;typedef struct int64x1x2_t {int64x1_t val [2];} int64x1x2_t;typedef struct int64x2x2_t {int64x2_t val [2];} int64x2x2_t Typedef struct uint8x8x2_t {uint8x8_t val [2];} uint8x8x2_t;typedef struct uint8x16x2_t {uint8x16_t val [2];} uint8x16x2_t;typedef struct uint16x4x2_t {uint16x4_t val [2];} uint16x4x2_t;typedef struct uint16x8x2_t {uint16x8_t val [2];} uint16x8x2_t;typedef struct uint32x2x2_t {uint32x2_t val [2];} uint32x2x2_t;typedef struct uint32x4x2_t {uint32x4_t val [2];} uint32x4x2_t;typedef struct uint64x1x2_t {uint64x1_t val [2] } uint64x1x2_t;typedef struct uint64x2x2_t {uint64x2_t val [2];} uint64x2x2_t;typedef struct float32x2x2_t {float32x2_t val [2];} float32x2x2_t;typedef struct float32x4x2_t {float32x4_t val [2];} float32x4x2_t;typedef struct poly8x8x2_t {poly8x8_t val [2];} poly8x8x2_t;typedef struct poly8x16x2_t {poly8x16_t val [2];} poly8x16x2_t;typedef struct poly16x4x2_t {poly16x4_t val [2];} poly16x4x2_t Typedef struct poly16x8x2_t {poly16x8_t val [2];} poly16x8x2_t;typedef struct int8x8x3_t {int8x8_t val [3];} int8x8x3_t;typedef struct int8x16x3_t {int8x16_t val [3];} int8x16x3_t;typedef struct int16x4x3_t {int16x4_t val [3];} int16x4x3_t;typedef struct int16x8x3_t {int16x8_t val [3];} int16x8x3_t;typedef struct int32x2x3_t {int32x2_t val [3];} int32x2x3_t;typedef struct int32x4x3_t {int32x4_t val [3] } int32x4x3_t;typedef struct int64x1x3_t {int64x1_t val [3];} int64x1x3_t;typedef struct int64x2x3_t {int64x2_t val [3];} int64x2x3_t;typedef struct uint8x8x3_t {uint8x8_t val [3];} uint8x8x3_t;typedef struct uint8x16x3_t {uint8x16_t val [3];} uint8x16x3_t;typedef struct uint16x4x3_t {uint16x4_t val [3];} uint16x4x3_t;typedef struct uint16x8x3_t {uint16x8_t val [3];} uint16x8x3_t Typedef struct uint32x2x3_t {uint32x2_t val [3];} uint32x2x3_t;typedef struct uint32x4x3_t {uint32x4_t val [3];} uint32x4x3_t;typedef struct uint64x1x3_t {uint64x1_t val [3];} uint64x1x3_t;typedef struct uint64x2x3_t {uint64x2_t val [3];} uint64x2x3_t;typedef struct float32x2x3_t {float32x2_t val [3];} float32x2x3_t;typedef struct float32x4x3_t {float32x4_t val [3];} float32x4x3_t;typedef struct poly8x8x3_t {poly8x8_t val [3] } poly8x8x3_t;typedef struct poly8x16x3_t {poly8x16_t val [3];} poly8x16x3_t;typedef struct poly16x4x3_t {poly16x4_t val [3];} poly16x4x3_t;typedef struct poly16x8x3_t {poly16x8_t val [3];} poly16x8x3_t;typedef struct int8x8x4_t {int8x8_t val [4];} int8x8x4_t;typedef struct int8x16x4_t {int8x16_t val [4];} int8x16x4_t;typedef struct int16x4x4_t {int16x4_t val [4];} int16x4x4_t Typedef struct int16x8x4_t {int16x8_t val [4];} int16x8x4_t;typedef struct int32x2x4_t {int32x2_t val [4];} int32x2x4_t;typedef struct int32x4x4_t {int32x4_t val [4];} int32x4x4_t;typedef struct int64x1x4_t {int64x1_t val [4];} int64x1x4_t;typedef struct int64x2x4_t {int64x2_t val [4];} int64x2x4_t;typedef struct uint8x8x4_t {uint8x8_t val [4];} uint8x8x4_t;typedef struct uint8x16x4_t {uint8x16_t val [4] } uint8x16x4_t;typedef struct uint16x4x4_t {uint16x4_t val [4];} uint16x4x4_t;typedef struct uint16x8x4_t {uint16x8_t val [4];} uint16x8x4_t;typedef struct uint32x2x4_t {uint32x2_t val [4];} uint32x2x4_t;typedef struct uint32x4x4_t {uint32x4_t val [4];} uint32x4x4_t;typedef struct uint64x1x4_t {uint64x1_t val [4];} uint64x1x4_t;typedef struct uint64x2x4_t {uint64x2_t val [4];} uint64x2x4_t Typedef struct float32x2x4_t {float32x2_t val [4];} float32x2x4_t;typedef struct float32x4x4_t {float32x4_t val [4];} float32x4x4_t;typedef struct poly8x8x4_t {poly8x8_t val [4];} poly8x8x4_t;typedef struct poly8x16x4_t {poly8x16_t val [4];} poly8x16x4_t;typedef struct poly16x4x4_t {poly16x4_t val [4];} poly16x4x4_t;typedef struct poly16x8x4_t {poly16x8_t val [4];} poly16x8x4_t

1.4.3 NEON function of gcc

Corresponding to the NEON instruction, see the gcc manual for details.

two。 Rules for using NEON in kernel state

In linux, the application mode can be more convenient to use NEON instrinsic, add the header arm_ neon. h header file and then use it directly. However, there are many restrictions on using NEON in kernel mode, which are described in detail in the linux kernel document / Documentation/arm/kernel_mode_neon.txt. The main points are:

There is another point that is particularly critical:

CC [M] / work/platform-zynq/drivers/zynq_fpga_driver/mmi_neon/lcd_hw_fs8812_neon.oIn file included from / home/liuwanpeng/lin/lib/gcc/arm-xilinx-linux-gnueabi/4.8.3/include/arm_neon.h:39:0 From / work/platform-zynq/drivers/zynq_fpga_driver/mmi_neon/lcd_hw_fs8812_neon.c:8:/home/liuwanpeng/lin/lib/gcc/arm-xilinx-linux-gnueabi/4.8.3/include/stdint.h:9:26: error: no include path in which to search for stdint.h # include_next when the-ffreestanding compilation option is not used This compilation error occurs when using in kernel mode.

three。 Example

NEON is generally in the field of images, and the minimum processing unit is 8bit, not 1bit. There are many convenient examples that will not be explained in this article. In the actual project, I need to operate and transform a group of LCD data bit by bit to form new data. If you use traditional ARM instructions, mask, shift, loop, the efficiency is very low. So I decided to use the bit-related instructions of NEON to accomplish the above task.

3.1 Task description

As shown in the following figure, each bit needs to be transformed to form new data.

3.2 algorithm description

Use vmsk, vshl, vadd and other bit operations to complete.

3.3 kernel configuration

The kernel must be configured to support NEON, otherwise functions such as kernel_neon_begin () and kernel_neon_end () will not be edited in.

Make menuconfig:Floating point emulation, as shown in the following figure.

Error when "Support for NEON in kernel mode" is not enabled: mmi_module_amp: Unknown symbol kernel_neon_begin (err 0) mmi_module_amp: Unknown symbol kernel_neon_end (err 0)

3.4 Module Code

Because the NEON code needs to set compilation options separately, a separate kernel module is created, and the makefile is as follows:

CFLAGS_MODULE + =-O3-mfpu=neon-mfloat-abi=softfp-ffreestanding

Core code:

# include # include # include / / from the header file of GCC, Xu Ang must be compiled with-ffreestanding

# define LCD_8812_ROW_BYTES 16

# define LCD_8812_PAGE_ROWS 8

# define LCD_PAGE_BYTES (LCD_8812_ROW_BYTES*LCD_8812_PAGE_ROWS)

Int fs8812_cvt_buf (uint8 * dst, uint8 * src) {uint8x16_t V_src [8]; uint8x16_t V_tmp [8]; uint8x16_t V_dst [8]; uint8x16_t Vetermske; int8x16_t Veterans shift; int8 RSHL_bits [8] = {0fb_page_x / / convert the frame_buf for fs8812 for (page=0;page)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.