How to analyze VMware ESXi downtime 02/08 Update SLTechnology News&Howtos

How to analyze VMware ESXi downtime

2026-02-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

How to analyze VMware ESXi outage, I believe that many inexperienced people do not know what to do. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Recently, it was found that the failure of the video conference system was caused by the downtime of the ESXI host. The analysis process is as follows:

The environment is ESXi 6.0. it is managed by vCenter 6.7U1. After the downtime, the host computer is restarted without taking a picture of the screen.

Collect ESXi system logs from the vCenter level, only see the real-time logs, and do not see the log information before downtime.

Ssh is connected to esxi, and cd / var/log does not see the compressed history log file. From vmksummary.log, you can see the time after the host is restarted, and there is a hint that the DUMP file has been found, and the time of failure is basically confirmed.

2019-04-16T19:54:13Z bootstop: Host has booted

2019-04-16T19:54:13Z bootstop: partition core dump found

Cd / scratch saw the log folder, and ls log saw a large number of compressed history log files

It turns out that the log has been redirected here.

4. Looking at the contents of the history log file according to the point in time, no useful information was found.

5. Ls / scratch/core sees a dump file of vmkernel-zdump.1, which basically confirms that the host computer sent a raw purple screen before downtime.

6. How to analyze and open this file? Online query VM also has KB instructions, according to KB use a command:

Vmkdump-l vmkernel-zdump.1 creates a vmkernel.log.1 file that can be viewed with cat or vi or other text tools:

^ [7m2019-04-15T11:31:36.550Z cpu30:32805) WARNING: Heartbeat: 781: PCPU 26 didn't have a heartbeat for 21 seconds; may be locked up.^ [0m

[31wit1m2019-04-15T11:31:36.550Z cpu26:33339) ALERT: NMI: NMI IPI recvd. We Halt. Eip (base): ebp:cs [0x3080cd (0x41800d800000): 0x1:0x4010] (Src0x1, CPU26) ^ [[0m]

2019-04-15T11:31:36.550Z cpu30:32805) World: 9729: PRDA 0x418047800000 ss 0x0 ds 0x10b es 0x10b fs 0x10b gs 0x0

2019-04-15T11:31:36.550Z cpu30:32805) World: 9731: TR 0x4020 GDT 0x4392ef421000 (0x402f) IDT 0x41800d8c9000 (0xfff)

2019-04-15T11:31:36.550Z cpu26:33339) 0x4390d1d9b560: [0x41800db080cd] MemNode_NUMANodeMask2MemNodeMask@vmkernel#nover+0x25 stack: 0x1

2019-04-15T11:31:36.550Z cpu30:32805) World: 9732: CR0 0x80010031 CR3 0x6c4ed1000 CR4 0x42768

2019-04-15T11:31:36.550Z cpu26:33339) 0x4390d1d9b580: [0x41800db45622] MemDistributeNUMAPolicy@vmkernel#nover+0x27a stack: 0x0

2019-04-15T11:31:36.550Z cpu26:33339) 0x4390d1d9b6c0: [0x41800db4616d] MemDistribute_Alloc@vmkernel#nover+0x299 stack: 0xe59bb55

2019-04-15T11:31:36.550Z cpu26:33339) 0x4390d1d9b820: [0x41800d8181f0] PagePool_AllocCustom@vmkernel#nover+0x2f0 stack: 0x4390d1d9bac0

2019-04-15T11:31:36.550Z cpu26:33339) 0x4390d1d9b8e0: [0x41800d820c04] vmk_MemPoolAlloc@vmkernel#nover+0x37c stack: 0x41800dfad8b1

2019-04-15T11:31:36.550Z cpu26:33339) 0x4390d1d9bd90: [0x41800dfad8b1] fusion_get_seq_num@#+0xd9 stack: 0x43034ef4cc40

2019-04-15T11:31:36.550Z cpu26:33339) 0x4390d1d9bea0: [0x41800dfa2adb] megasas_hotplug_work@#+0x16b stack: 0x0

2019-04-15T11:31:36.550Z cpu26:33339) 0x4390d1d9bf20: [0x41800d82245f] VmkTimerQueueWorldFunc@vmkernel#nover+0x21f stack: 0x0

2019-04-15T11:31:36.550Z cpu26:33339) 0x4390d1d9bfd0: [0x41800da13dae] CpuSched_StartWorld@vmkernel#nover+0xa2 stack: 0x0

2019-04-15T11:31:36.600Z cpu30:32805) Panic: 798: Saved backtrace: pcpu 26 Heartbeat NMI

2019-04-15T11:31:36.600Z cpu30:32805) pcpu 26 Heartbeat NMI: 0x4390d1d9b560: [0x41800db080cd] MemNode_NUMANodeMask2MemNodeMask@vmkernel#nov

2019-04-15T11:31:36.600Z cpu30:32805) pcpu 26 Heartbeat NMI: 0x4390d1d9b580: [0x41800db45622] MemDistributeNUMAPolicy@vmkernel#nover+0x27a

2019-04-15T11:31:36.600Z cpu30:32805) pcpu 26 Heartbeat NMI: 0x4390d1d9b6c0: [0x41800db4616d] MemDistribute_Alloc@vmkernel#nover+0x299 stac

2019-04-15T11:31:36.600Z cpu30:32805) pcpu 26 Heartbeat NMI: 0x4390d1d9b820: [0x41800d8181f0] PagePool_AllocCustom@vmkernel#nover+0x2f0 sta

2019-04-15T11:31:36.600Z cpu30:32805) pcpu 26 Heartbeat NMI: 0x4390d1d9b8e0: [0x41800d820c04] vmk_MemPoolAlloc@vmkernel#nover+0x37c stack:

2019-04-15T11:31:36.600Z cpu30:32805) pcpu 26 Heartbeat NMI: 0x4390d1d9bd90: [0x41800dfad8b1] fusion_get_seq_num@#+0xd9 stack:

2019-04-15T11:31:36.600Z cpu30:32805) pcpu 26 Heartbeat NMI: 0x4390d1d9bea0: [0x41800dfa2adb] megasas_hotplug_work@#+0x16b stac

2019-04-15T11:31:36.600Z cpu30:32805) pcpu 26 Heartbeat NMI: 0x4390d1d9bf20: [0x41800d82245f] VmkTimerQueueWorldFunc@vmkernel#nover+0x21f s

2019-04-15T11:31:36.600Z cpu30:32805) pcpu 26 Heartbeat NMI: 0x4390d1d9bfd0: [0x41800da13dae] CpuSched_StartWorld@vmkernel#nover+0xa2 stack

2019-04-15T11:31:36.623Z cpu30:32805) ^ [45m ^ [33mVMware ESXi 6.0.0 [Releasebuild-3073146 x86x64] ^ [[0m]

PCPU 26: no heartbeat (2amp 2 IPIs received)

Basically confirm that the reason for the host outage is that the Esxi host is unable to communicate with the CPU (the communication timeout is 21 seconds).

There is too much vCPU allocated. Check that the number of vCPU is less than LCPU.

7, through the keyword search "POSD no heartbeat", did not find the cause of the failure, foreign netizens also have similar problems, only found that the suggestion is to upgrade ESXi.

8. In the last ESXi 6.0U2 release note, "problem solved", you can see that the problem of "no heartbeat" has been solved:

The ESXi host displays a purple diagnostic screen and displays several "correctable computer check interruption" (CMCI) messages

Due to multiple CMCI in the vmkernel.log file causing the CPU to become unresponsive in a short period of time, the ESXi host may fail with a purple diagnostic screen. Entries similar to the following appear on the purple diagnostic screen:

The ESXi host displays a purple diagnostic screen and displays several "correctable computer check interruption" (CMCI) messages

PCPU: no heartbeat (2 IPIs received) 0xXXXXXXXXXXXX: [0xXXXXXXXXXXXX] MCEReapMCABanks@vmkernel#nover+0x195br/ > 0xXXXXXXXXXXXX: [0xXXXXXXXXXXXXXX] IRQ_DoInterrupt@vmkernel#nover+0x33e0xXXXXXXXXXXXX: [0xXXXXXXXXXXXX] BH_DrainAndDisableInterrupts@vmkernel#nover+0xf3

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.