Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Online service mcelog load exception analysis and handling process

2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

First, an overview of the question:

Nginx server, HP, has redundancy. One of the servers has high mcelog load and log second level, which has affected the server business.

Tail-f / var/log/mcelog

# notice that this information is in a continuous cycle, pay attention to

Transaction:Memory scrubbing errorMemCtrl:Corrected patrol scrub error ErroroverflowCorrected error

# pay attention to other information

CPU16 BANK 9MCE11337335 MCi_MISCregister valid337336 MCi_ADDRregister valid337337 MCA:MEMORY CONTROLLER MS_CHANNEL1_ERR337338 Transaction:Memory scrubbing error337339 MemCtrl:Corrected patrol scrub error337340 337341 STATUScc0048c0000800c1 MCGSTATUS 0337342 MCGCAP1000812 APICID 8 SOCKETID 0337343 CPUIDVendor Intel Family 6 Model 45337344 Hardwareevent. This is not a software error.337345 MCE10337346 CPU16 BANK 9337347 MISC90011000010008c ADDR 15e0e2000337348 TIME1495308194 Sun May 21 03:23:14 2017337349 MCGstatus:337350 MCistatus:337351 Erroroverflow337352 Correctederror 337353 MCi_MISCregister valid337354 MCi_ADDRregister valid337355 MCA:MEMORY CONTROLLER MS_CHANNEL1_ERR337356 Transaction:Memory scrubbing error337357 MemCtrl:Corrected patrol scrub error337358 337359 STATUScc0003c0000800c1 MCGSTATUS 0337360 MCGCAP1000812 APICID 9 SOCKETID 0337361 CPUIDVendor Intel Family 6 Model 45337362 Hardwareevent. This is not a software error.337363 MCE11337364 CPU17 BANK 9337365 MISC90011000010008c ADDR 15e0f8000337366 TIME1495308194 Sun May 21 03:23:14 2017337367 MCGstatus:337368 MCistatus:337369 Erroroverflow337370 Correctederror

Tail-f / var/log/messages

2. Brief description of mcelog

2.1) mcelog what is this service?

Tools to check for hardware errors, especially memory and CPU errors

2.2) mcelog working mode?

Cron trigger (efficiency issues)

Daemon (the current form of centos) default log to / var/log/mcelog

2.3) mcelog installation

Yum install mcelog or compilation is fine.

Third, problem analysis:

3.1) error information:

Transaction:Memory scrubbing errorMemCtrl:Corrected patrol scrub errorErroroverflowCorrected error

Note that from the error message above, you can tell that there may be something wrong with the memory, because the mcelog log error is likely to be a hardware information failure.

3.2) additional information

MCE (Machine Check Exception) is a class of computer hardware errors. The possible reasons are:

Memory error, memory cache failure, cpu failure, may also have something to do with the motherboard and bus.

CPU16 BANK 9

CPU 17 BANK 9...

Bank definition:

In order to ensure the normal operation of the CPU, the traditional memory system must transfer the data needed by the CPU in one transmission cycle at one time. The data capacity that CPU can receive in a transmission cycle is the bit width of CPU data bus, in bit (bits). The data exchange between memory and CPU is carried out through the North Bridge chip on the motherboard. The data bit width of memory bus is equal to that of CPU data bus. This bit width is called physical Bank.

Bank: always wanted to check the bank and the log above to find out which slot might have a problem. I hope you will give us a hint here.

3.3) check the server lights:

Normal. (it's a surprise here, but if the problem has just occurred, the indicator won't go wrong immediately.)

3.4) consult friends

Suggestion: there is something wrong with the general hardware, it is recommended to change the memory, backup data and so on.

4. Processing sequence (renzhiyuan.blog.51cto.com)

4.1) migrate the business smoothly to ensure the normal operation of the business.

4.2) back up the data and ensure its availability.

Do not restart, first try to clear the memory cache, inode, directory. Troubleshoot caching issues.

4.4) if the load is high, consider shutting down the mcelog service.

4. 5) hp server has hardware analysis function, which can be checked first.

4.6) prepare the same specification memory bar and try to replace it (it is best not to move the original location of each memory. Generally, there is not much memory, so you can try it. If you can determine which slot has a problem, you can replace it first)

4.7) if the replacement of the memory stick is not effective, other hardware problems may be considered for maintenance.

4.8) all the above progress and results shall be put on record and reflected by the leader in a timely manner.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report