In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
This article will explain in detail how to analyze embedded Linux Kernel error tracking technology, the content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.
With the wide application of embedded Linux system, higher requirements are put forward for the reliability of the system, especially in important fields such as life and property. The system is required to reach a safety integrity level of more than level 3 [1], and the failure rate (the possibility of dangerous failure per hour) is less than 10-7, which is equivalent to the average time between failures (MTBF) of the system at least more than 1141 years. Therefore, improving the reliability of the system has become an arduous task. The investigation on the application of 14878 controller systems in the industrial field of a company shows that from the beginning of 2004 to the end of September 2007, with the continuous improvement of hardware and software, the failure rate according to error reports has been reduced to less than 1/5 in 2004. but the time to find errors has increased by more than three times.
It is true that there are software problems with the rising time needed to solve the problem, but the main reason is the lack of necessary means to help solve the problem. Through the statistical tracking of faults, it is found that the software errors that are difficult to solve and the software errors that take a long time from discovery to solution are concentrated in the core part of the operating system, and a large proportion of them are concentrated in the driver part [2]. Therefore, error tracking technology is regarded as an important measure to improve the level of system security integrity [1]. Most modern operating systems provide operating system kernel "crash dump" mechanism, that is, when the software system is down, save the memory content to disk [3], or send it to the failed server [3] over the network, or start the kernel debugger [4] directly. It can be analyzed and improved afterwards.
There are several crash dump mechanisms based on Linux operating system kernel in recent years:
(1) LKCD (Linux Kernel Crash Dump) mechanism [3]
(2) KDUMP (Linux Kernel Dump) mechanism [4]
(3) KDB mechanism [5]
(4) KGDB mechanism [6].
Combining the above mechanisms, it can be found that the four mechanisms have the following three things in common:
(1) suitable for applications with abundant computing resources and sufficient storage space
(2) there is no strict requirement on the recovery time after the system crash.
(3) it is mainly aimed at the general hardware platform, such as X86 platform.
When you want to use one of the above mechanisms directly in embedded applications, you encounter the following three difficulties that cannot be solved:
(1) insufficient storage space
Embedded systems generally use Flash as memory, but the capacity of Flash is limited, and may be far less than the memory capacity of embedded systems. Therefore, it is not feasible to save all the memory contents to Flash.
(2) the recording time should be as short as possible
Embedded systems generally have the requirement that the reset response time is as short as possible, and the reset restart time of some embedded operating systems is not more than 2 seconds, while the above kernel crash dump mechanisms that can be used in Linux systems can not take less than 30 seconds. Writing Flash is also time-consuming. Experiments show that writing 2MB data to Flash takes as much time as 400ms.
(3) required to support specific hardware platform
There are a variety of hardware in embedded systems, and the four mechanisms mentioned above provide good support for X86 platform, but the hardware support for other systems is not mature.
Because of these difficulties, it is very difficult to transplant one of the above four kernel crash dump mechanisms to a specific embedded application platform. Therefore, in view of the above three characteristics of the embedded system, this paper introduces an embedded Linux kernel crash information recording mechanism LCRT (Linux Crash Record and Trace) based on a specific platform, which provides an auxiliary means to locate and solve software faults in embedded Linux systems.
1. Analysis of Linux kernel crash.
By analyzing the handling of various "traps" in the Linux kernel, we can know that the Linux kernel can monitor the errors caused by the application, and the exception handling routines of the Linux kernel can handle these exceptions caused by the application when errors such as division of zeros, memory access out of bounds, buffer overflow and so on occur in the application. When an application produces an unrecoverable error, the Linux kernel can simply terminate the application that caused the error, and other applications can still run normally.
If bug exists in the Linux kernel itself or in the newly developed Linux kernel module, errors such as "division by zero", "memory access out of bounds" and "buffer overflow" will also be handled by the exception handling routine of the Linux kernel. The Linux kernel judges that if a kernel exception is found to be "serious and unrecoverable" in the exception handler, it will lead to a "kernel panic" (kernel panic), that is, a Linux kernel crash. Figure 1 shows the flow of exception handling by the Linux kernel.
2. Design and implementation of LCRT mechanism.
Through the analysis of the Linux kernel code, we can see that the Linux kernel itself provides a "kernel notification mechanism" [7-8], and pre-defined "kernel event notification chain", which enables Linux kernel extension developers to execute additional processes when specific kernel events occur through these predefined kernel event notification chains. Through the study of the Linux kernel source code, it is found that for the "serious irrecoverable kernel exception" mentioned above, a notification chain and notification point are predefined, so that after the Linux kernel crash occurs, the LCRT mechanism can be attached to a predefined "kernel crash notification chain" [7] in the panic function of the Linux kernel to obtain some information about the Linux kernel crash site and record it in non-volatile memory. In order to analyze the causes of Linux kernel crash.
2.1 Design Essentials
The design and implementation of the LCRT mechanism is based on the following specific mechanisms:
(1) Compiler options and kernel dependency
The Linux kernel and its corresponding drivers are compiled by the open source compiler GCC [9] of GNU [9]. In order to easily extract information and record information with the LCRT mechanism, it is necessary to use specific GCC compiler options to compile the Linux kernel and related drivers and applications. The option used is:-mpoke-function-name [9]. Binary programs compiled with this option can contain information about the name of the function in C language to facilitate the readability of the information recorded during the backtracking of the function call chain.
(2) Linux kernel notify_chain mechanism [8]
Linux kernel provides "notification chain" function, and pre-defines a kernel crash notification chain. When the system enters an "unrecoverable" state in the exception handling routine of the Linux kernel, the notification function registered in the corresponding chain will be called along the predefined notification chain sequence.
(3) Stack layout of function calls
Most of the Linux kernel is implemented by C language, and C language is also used to develop Linux kernel. There are rules for the Linux kernel and the code that uses LKM extensions to join the Linux kernel execution environment, and the stack layout generated during the execution of these codes is associated with these regular codes. For example, before executing the function, these functions save the return address after the function call, the parameters passed when the function is called, and the bottom of the stack frame owned by the function that called the function.
2.2 Design idea of LCRT mechanism
LCRT mechanism is divided into Linux kernel module [8] part and Linux user program part. The design of the kernel module adopts the mode of Linux kernel module instead of directly modifying the Linux kernel. This design reduces the coupling between Linux kernel and LCRT mechanism, and satisfies the convenience of upgrading Linux kernel and LCRT mechanism independently. The user program part completes the related functions such as reading and clearing the information saved by the LCRT mechanism from the non-volatile memory.
In the design of LCRT mechanism, according to the characteristics of embedded system, the design decisions are as follows:
(1) record the function call relationship chain in the auxiliary sense of solving and locating the problem.
(2) in order not to take up too much storage space, selectively save the stack contents used by the functions on the function call sequence, instead of saving all the contents.
(3) the recorded information is saved in a non-volatile memory, which not only achieves the purpose of power-off preservation, but also shortens the writing time.
The design of LCRT mechanism includes the following five aspects.
(1) Design Linux kernel module, load LCRT mechanism dynamically, and modify Linux kernel code as little as possible.
(2) attach the notification function of LCRT to the corresponding and predefined Linux kernel notification chain.
(3) the function call information is obtained by stack backtracking in the notification handler function of LCRT mechanism.
(4) record the function call information and stack space content traced back to non-volatile memory.
(5) A tool for developing user space that can read saved information from non-volatile memory.
2.3 implementation of LCRT mechanism
The implementation of LCRT mechanism can be implemented step by step with reference to the design idea of Section 2.2. Limited to space, this paper does not cover too many details about the principle and implementation of the Linux kernel module, and only gives the pseudocode of the kernel module implementation of the LCRT mechanism. The load function to describe the LCRT mechanism in pseudocode is as follows:
Int lcrt_init (void) {printk ("Registering my__panic notifier.\ n"); bt_nvram_ptr= (volatile unsigned char*) ioremap_ nocache (BT_NVRAM_BASE,BT_NVRAM_LENGTH); bt_nvram_index+=sizeof (struct bt_info); *) bt_nvram_ptr,BT_NVRAM_LENGTH); notifier_chain_register (& panic_notifier_list,&my_ panic_block); return 0;}
The notification handler of the LCRT mechanism completes the work of backtracking the function call relationship, getting the function name and the content of the function stack, etc., which is limited to space, and is explained here with the following pseudo code:
Void ll_bt_information (struct pt_regs * pr) {initialization work such as variable definition do {reglist=* (unsigned long *) (* myfp-8) / / get the register information saved when the function starts execution from the top of the function stack frame / / get the name of the function from the code area of the function / / extract the function parameter information saved before the function body code is executed from the function stack frame / / get the location of the code calling this function from the stack frame of this function and the bottom of the stack frame of the function calling this function} while (until the chain head of the function call chain) / / get the contents of the function call stack frame / / fill in the recording header of the information record / / save the information obtained in the above loop to the non-volatile memory write_to_nvram ((void *) bt_nvram_ptr,&bt_record_header,sizeof (bt_info_t));}
3. Validate and evaluate the LCRT mechanism
3.1 deploy the LCRT mechanism
The relevant work that needs to be done before deploying the LCRT mechanism to make the LCRT mechanism work are as follows:
(1) compiling the Linux kernel module of the LCRT mechanism against the target Linux kernel
(2) load the kernel module of LCRT mechanism into the Linux kernel.
3.2 Experimental results
In order to test the effect of the LCRT mechanism, construct a device driver module that will cause the Linux kernel to crash, and mark the kernel driver module as bugguy.ko. List the code in bugguy.ko that will cause the Linux kernel crash as follows:
Irqreturn_t my_timer_interrupt (int irq,void * dev_id,struct pt_regs* regs) {confirm hardware status and clear interrupt status if (ujiffies > 5000) {void * ill_pointer=NULL; * (unsigned long *) ill_pointer=0;} else {ujiffies++;} return IRQ_HANDLED;}
Description: the code marked in boldface is the code that generates bug
As you can see from the above code, this error is caused by parsing the null pointer. If a null pointer parsing occurs in an interrupt handler, it will cause the Linux kernel to crash. The bugguy.ko is loaded into the Linux kernel on the embedded linux system where the LCRT mechanism is deployed, so that the interrupt handler that will cause the collapse of the Linux kernel can be run. The LCRT mechanism can save the relevant information to the non-volatile memory. After the system is reset, the saved information can be read out through the user space tool of the LCRT mechanism. The experimental results show that the function call chain information shown in figure 2 can be obtained.
Figure 2 shows that the interrupt handler of the error code that can cause the Linux kernel to crash is the "culprit" that really caused the system downtime. However, all the recorded information only takes up less than the storage space of the 1KB, and the time spent writing to the non-volatile memory is controlled within the 50ms. In the case of using a small amount of space and a small amount of time, the recorded information is of great help in finding and solving problems.
The experimental results show that under the action of LCRT mechanism, we can quickly locate the hidden software defects that may lead to system downtime in the embedded Linux system. This provides key auxiliary information for subsequent troubleshooting and software improvement. For the embedded Linux kernel, it is helpful to improve the stability and reliability of the Linux kernel.
In the embedded Linux application based on ARM, the LCRT mechanism is developed to record the function call chain and stack information caused by the system kernel crash to the non-volatile memory. Up to now, the LCRT mechanism can record the function call chain information when the ARM-based embedded Linux kernel crashes. The function name, the parameter information of a single function in the function call chain and the respective stack frame information of the function in the function call chain can be obtained directly. The recorded information has important auxiliary significance for the improvement and development of embedded Linux applications based on ARM.
On how to analyze embedded Linux Kernel error tracking technology to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.