Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the analysis method of ESXI host purple screen?

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article introduces to you what is the ESXI host purple screen analysis method, the content is very detailed, interested friends can refer to, hope to be helpful to you.

One: Overview

I believe that VMware engineers are no stranger to purple screens. PSoDs (Purple Screen of Death) is a failure that occurs on ESXI, similar to the blue screen of Microsoft's Windows operating system. The purple screen is usually caused by hardware and software failures, such as software bug, CPU, memory leaks and so on. When a purple screen failure occurs, the entire ESXI host suddenly crashes. When the purple screen failure occurs, all the administrator can do is record the purple screen information and restart the host, that is, the virtual machine on the ESXI host will be affected; if there is a HA mechanism, it will be migrated to other available ESXI hosts.

When you find that there is a purple screen on the ESXI host, you should record the information of the purple screen as soon as possible. The simple way is to take screenshots or photos of the current screen information, because it contains a lot of important information. You can display and learn about ESXI version and build number, exception type, register dump (register dump), what each CPU was running at the time of crash, backtracking (back-trace), server running time, error log, memory hardware information, and so on. When the ESXI host is restarted, the vmkernel-zdump file can also be obtained through the / root or / var/core/ of the ESXI host. When the purple screen occurs, there will be a file that begins with vmkernel-zdump (named), which can be submitted to the technical support of VMware for fault analysis; at the same time, you can also extract VMkernel log information through the vmkdump tool to find clues related to PSoDs, so as to determine the cause of PSoDs. Check the official KB: https://kb.vmware.com/s/article/1006796?lang=zh_CN for extraction and identification of vmkernel-zdump

Second: understand the purple screen information

A lot of key information can be obtained through the screen information behind the purple screen, and administrators can quickly use this information to locate and troubleshoot faults. The error is displayed on the purple diagnostic screen. The purple diagnostic screen is roughly as follows:

Through the above, you can view several key information.

Product and build:

VMware ESX Server [Releasebuild-3620759

This part of the purple diagnostic screen indicates the product and build that went wrong. In this example, the product is ESXI and the version number is 3620759, which is ESXI 6.0U2

Error message:

PCPU 1 locked up.Failed to ack TLB invalidate

This part of the purple diagnostic screen indicates the reported error message. Only a limited number of error messages can be reported. These error messages are discussed later in this article.

CPU register:

Frame=0x3a37d98 ip=0x625e94 cr2=0x0 cr3=0x40c66000 cr4=0x16c

Es=0xffffffff ds=0xffffffff fs=0xffffffff gs=0xffffffff

Eax=0xffffffff ebx=0xffffffff ecx=0xffffffff edx=0xffffffff

Ebp=0x3a37ef4 esi=0xffffffff edi=0xffffffff err=-1 eflags=0xffffffff

When an error occurs, these values are stored in the physical CPU register. The information in these registers varies widely, depending on the VMkernel error that occurs

Physical CPU:

* 0:1037/helper1-4 1:1107/vmm0:Fagi 2:1121/vmware-vm 3:1122/mks:Franc

This part of the purple diagnostic screen represents the physical CPU on which the instruction was run during the VMkernel error. In this example, the * next to 0 indicates that physical CPU 0 was running at the time of the failure. The new version of ESX no longer uses *, but uses the prefix letter CPU. For example, if the same error occurs in the new version of VMware ESX, the same line is displayed as follows:

CPU0:1037/helper1-4 cpu1:1107/vmm0:Fagi cpu2:1121/vmware-vm cpu3:1122/mks:Franc.

This section of the purple diagnostic screen also describes the environment (processes) running on the CPU when an error occurs. In the above example, the user environment is running helper1-4.

Note: the process name may have been truncated.

Stack trace:

0x3a37ef4: [0x625e94] Panic+0x17 stack: 0x833ab4, 0x3a37f10, 0x3a37f48

0x3a37f04: [0x625e94] Panic+0x17 stack: 0x833ab4, 0x1, 0x14a03a0

0x3a37f48: [0x64bfa4] TLBDoInvalidate+0x38f stack: 0x3a37f54, 0x40, 0x2

0x3a37f70: [0x66da4d] XMapForceFlush+0x64 stack: 0x0, 0x4d3a, 0x0

0x3a37fac: [0x652b8b] helpFunc+0x2d2 stack: 0x1, 0x14a4580, 0x0

0x3a37ffc: [0x750902] CpuSched_StartWorld+0x109 stack: 0x0, 0x0, 0x0

0x3a38000: [0x0] blk_dev+0xfd76461f stack: 0x0, 0x0, 0x0

The stack represents what VMkernel was doing when the error occurred. In this example, VMkernel is trying to clear the memory page table (TLB). This information is an important tool for diagnosing purple screen errors by evaluating what the kernel does when something goes wrong.

Uptime:

VMK uptime: 7:05:43:45.014 TSC: 1751259712918392

This section represents the time that the server has been running since it was last started. In this example, the ESXI host has been running for 7 days, 5 hours, 43 minutes and 45.014 seconds. The TSC value is the number of CPU clock rate cycles that have passed since the server was started.

Core dump:

Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1...using slot 1 of 1... Log

This part of the purple diagnostic screen represents the VMkernel memory contents that are being copied to the vmkcore partition.

Three: locate the fault through the error message

The above describes how to view and understand the screen information of the purple screen, and the key is the field about the error message. Then we can use the VMkernel error message generated by the purple screen to determine the cause of the problem. However, the number of error messages generated is limited. The following is a list of known VMkernel error messages.

L type: console warning

Error example: COS Error: Oops

Description: the ESX host fails and displays a purple screen when a service console warning occurs. Unlike most purple screen errors, this error is not triggered by VMkernel. Instead, it is triggered by the service console and occurs at the Linux level. These purple screen errors contain additional information from the Linux kernel. For more information about console warnings, see Understanding an "Oops" purple diagnostic screen (1006802).

L type: detect signal loss

Error example: Lost Heartbeat

Description: the ESX VMkernel and the Service console Linux kernel run on ESX at the same time. The service console Linux kernel runs a process called vmnixhbd, which sends a detection signal to VMkernel as long as VMkernel can allocate and free memory pages. If no detection signal is received before the 30-minute timeout, VMkernel triggers a COS critical error and a purple diagnostic screen indicating the loss of the detection signal. For more information about detecting signal loss, see Understanding a "Lost Heartbeat" purple diagnostic screen (1009525).

L type: assertion

Error example: ASSERT bora/vmkernel/main/pframe_int.h:527

Description: assertion errors are software errors because they are related to the assumptions on which the program is based. This type of purple screen error is mainly caused by software errors. For more information about assertion error messages, see Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956).

L type: not executed

Example of error:

NOT_IMPLEMENTED / build/mts/release/bora-84374/bora/vmkernel/main/util.c:83

Description: an unexecuted error message occurs when the code encounters a situation that is beyond the scope of the design process. For more information, see Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956).

L type: the number of revolutions has exceeded / deadlock may occur

Error example: Spin count exceeded (iplLock)-possible deadlock

Description: when a thread attempts to execute a critical part of the code, the VMware ESX host may report on the purple diagnostic screen that the number of revolutions has exceeded and a deadlock may occur. Because the thread is trying to enter the critical part, it needs to perform a spin lock operation in order to poll the mutex before executing the code. The thread continues to poll the mutex during the spin lock operation, but there is a limit to the number of times the mutex poll occurs. For more information about revolutions exceeding errors, see Understanding a "Spin count exceeded" purple diagnostic screen (1020105).

L type: unable to confirm whether TLB is invalid

Error example: PCPU 1 locked up.Failed to ack TLB invalidate.

Description: the physical CPU failed while trying to clear the memory page table. For more information, see Understanding a Failed to ack TLB invalidate purple diagnostic screen (1020214).

The purple diagnostic screen will also appear in an abnormal form. An exception handler is a computer hardware mechanism designed to handle situations where normal execution flows (divided by zeros, page errors, and so on) change. The handler does not have a tracking mechanism, so you need to log to determine if there is a problem with the handler (or through single-step debugging). The following is a list of common exceptions:

L type: exception 13 (general protection error)

Error example: # GP Exception (13) in world 4130:helper13-0 @ 0x41803399e303

Description: a general protection error (exception 13) occurs in any of the following situations: the page being requested does not belong to the program that requested the page (not mapped to program memory), or the program does not have permission to read or write on the page. For more information about exception 13 or page error, see Understanding Exception 13 and Exception 14 purple diagnostic screen events (1020181).

L type: exception 14 (page error)

Error example: # PF Exception type 14 in world 136:helper0-0 @ 0x4a8e6e

Description: a page error occurs when the requested page is not successfully loaded into memory (exception 14). For more information about exception 14 or page error, see Understanding Exception 13 and Exception 14 purple diagnostic screen events (1020181).

L type: exception 18 (computer check exception)

Error example: Machine Check Exception: Unable to continue

Error example: Hardware (Machine) Error

Description: computer check exceptions (MCE) are generated by the hardware and reported by the host. Consult your hardware vendor when the MCE event occurs. By evaluating the information displayed, you can determine the individual components that report errors. For more information about MCE, see Decoding Machine Check Exception (MCE) output after a purple screen error (1005184).

Four: analyze multiple errors of the same host

When multiple purple diagnostic screens may appear on the same ESXI host, you can use multiple purple diagnostic screen examples to determine whether the problem is related to hardware or software. To do this, determine if there are some patterns in the following parts of the purple diagnostic screen:

Error messages and stack traces:

If the error message and stack in multiple vmkernel errors vary greatly, the same error is not always caused by the software. Although not very conclusive, this is likely to mean a hardware problem.

If the error message and stack are always the same in multiple vmkernel, the same error is caused by the software. Although it is not very conclusive, it is likely to mean a software problem. For more information about the error messages that appear, see the specific error messages section above.

L physical CPU:

If the physical CPU value is always the same in multiple vmkernel errors, the software always has an error on the same physical CPU. Although it is not very conclusive, it is likely to mean a CPU problem. For more information, see KB1003560

L environment:

If the environment values are always the same in multiple vmkernel errors, there is an error when vmkernel receives instructions from the same environment. Although not very conclusive, this probably means that the environment in which the instruction was sent may have triggered a VMkernel error.

Five: exception list reference

Exception type 0 # DE: division error (Divide Error)

Exception type 1 # DB: debug exception

Exception type 2 NMI: unmasked interrupts

Exception type 3 # BP: breakpoint exception

Exception type 4 # OF: overflow (INTO directive)

Exception type 5 # BR: bounds checking (BOUND directive)

Invalid exception type 6 # UD:Opcode

Exception type 7 # NM: the coprocessor is not available

Exception type 8 # DF: double failure

Invalid exception type 10 # TS:TSS

Exception type 11 # NP: segment does not exist

Exception type 12 # SS: stack fragmentation error

Exception type 13 # GP: general protection error

Exception type 14 # PF: page error

Exception type 16 # MF: coprocessor error

Exception type 17 # AC: alignment check

Exception type 18 # MC: computer checks for exceptions

Exception type 19 # XF:SIMD floating point exception

Exception type 20-31: reserved

Exception types 32-255: user-defined (clock scheduler)

Six: example

In the actual environment, we have encountered the following prompt purple screen situation, through the information on the screen, we can know the following information. The malfunctioning ESXI host is esxi 6.0U2 (build 3620759), which has been running normally since the last boot, that is, 35 days, 18 hours, 32 minutes, 35 days, 18 hours and 32 minutes.

At the same time, the key code information about the purple screen is PF Exception 14 in world 33168:memMapKernal. According to this key code information, you can find the following in the KB library of VMware

Https://kb.vmware.com/s/article/1020181?lang=zh_CN#q=esxi%E7%B4%AB%E5%B1%8F

Https://kb.vmware.com/s/article/2071752?lang=zh_CN#q=esxi%E7%B4%AB%E5%B1%8F

According to KB, the information may be as follows:

If the page to be requested is not successfully loaded into memory, a page error occurs (exception 14). There are both normal and abnormal page errors:

Normal state page errors can cause pages to be loaded from swap memory into physical memory. This allows the program to continue execution after the data is correctly loaded into physical memory.

If the page is not loaded into memory and the operating system cannot load the page from swap memory into physical memory, an abnormal state page error occurs.

Combined with the following MemMapKernal field, we can probably determine that the purple screen image is caused by a memory exception in the ESXI host, which may be caused by memory loading or memory overflow, or by the system purple screen failure caused by the virtual memory sharing mechanism in the Horizon View in this example.

On the ESXI host purple screen analysis method is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report