首页 > 技术文章 > Linux 内存错误诊断

my-show-time 2021-01-14 21:16 原文

先了解一些概念

DRAM(Dynamic Random Access Memory),即动态随机存取存储器,最为常见的系统内存ECC是“Error Checking and Correcting”的简写,中文名称是“错误检查和纠正”。ECC内存,即应用了能够实现错误检查和纠正技术(ECC)的内存条。EDAC,即Error Detection And Correction(错误检测与纠正)。

内存有两种错误类型分别是CEUE,CE 是 Correctable Error 的简称, UE是Uncorrectable Error的简称,CE即可恢复的错误,暂不影响系统的正常运行。可以在找时机停机换掉。UE为不可恢复的内存错误,通常会导致宕机。

系统messages日志

[root@my-host mg4a]# grep kernel /var/log/messages
Jan 14 19:01:11 my-host kernel: mce: [Hardware Error]: Machine check events logged
Jan 14 19:01:12 my-host kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#1_Chan#1_DIMM#0 (channel:5 slot:0 page:0x554c02 offset:0x3c0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:0 ha:1 channel_mask:2 rank:0)
[root@my-host mg4a]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch5_ce_count:1
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch5_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow0/ch5_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow0/ch5_ce_count:0
[root@my-host mg4a]# dmidecode -t 1
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.

Handle 0x0044, DMI type 1, 27 bytes
System Information
Manufacturer: LENOVO
Product Name: Lenovo System x3750 M4 -[8753IH5]-
Version: 03
Serial Number: 06FF367
UUID: C4EF8080-7926-11E5-8B14-6C0B849B418E
Wake-up Type: Other
SKU Number: XxXxXxX
Family: System X

这是另外一台设备messges日志

Jun 27 13:53:25 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
Jun 27 13:53:25 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8de3b1960
Jun 27 13:53:25 irora30 kernel: EDAC MC2: CE page 0x8de3b1, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080a13
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008de3b1960
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Jun 27 14:19:27 irora30 auditd[5571]: Audit daemon rotating log files
Jun 27 19:09:23 irora30 auditd[5571]: Audit daemon rotating log files
Jun 27 23:59:21 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
Jun 28 02:15:55 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8d9ea5960
Jun 28 02:15:55 irora30 kernel: EDAC MC2: CE page 0x8d9ea5, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080813
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008d9ea5960
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
Jun 28 03:08:25 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8ded39960
Jun 28 03:08:25 irora30 kernel: EDAC MC2: CE page 0x8ded39, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080813
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008ded39960
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
Jun 28 03:45:13 irora30 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-to-date.
Jun 28 04:44:25 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 09:34:22 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 10:02:30 irora30 ansible-command: Invoked with warn=True executable=None _uses_shell=True _raw_params=df -hl /var|awk 'NR>1 && int($5) > 80' removes=None creates=None chdir=None
Jun 28 14:23:49 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 19:09:25 irora30 auditd[5571]: Audit daemon rotating log files

故障确认及定位故障内存槽位

[root@irora30 ~]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

/sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow5/ch0_ce_count:294
/sys/devices/system/edac/mc/mc3/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow5/ch0_ce_count:0
[root@irora30 ~]#

  • count:不为0的行即代表存在内存错误。
  • mc:第几个CPU。
  • csrow:内存通道。
  • ch*:通道内的第几根内存。

内存安装情况

 1 Memory Component    Status
 2 
 3 Proc 1 DIMM 1A     16384 MB 1333 MHz
 4 
 5 Proc 1 DIMM 2I     Not installed Not installed
 6 
 7 Proc 1 DIMM 3E     Not installed Not installed
 8 
 9 Proc 1 DIMM 4C     Not installed Not installed
10 
11 Proc 1 DIMM 5K     Not installed Not installed
12 
13 Proc 1 DIMM 6G     Not installed Not installed
14 
15 Proc 1 DIMM 7B     16384 MB 1333 MHz
16 
17 Proc 1 DIMM 8J     Not installed Not installed
18 
19 Proc 1 DIMM 9F     Not installed Not installed
20 
21 Proc 1 DIMM 10D     Not installed Not installed
22 
23 Proc 1 DIMM 11L     Not installed Not installed
24 
25 Proc 1 DIMM 12H     Not installed Not installed
26 
27 Proc 2 DIMM 1A     16384 MB 1333 MHz
28 
29 Proc 2 DIMM 2I     Not installed Not installed
30 
31 Proc 2 DIMM 3E     Not installed Not installed
32 
33 Proc 2 DIMM 4C     Not installed Not installed
34 
35 Proc 2 DIMM 5K     Not installed Not installed
36 
37 Proc 2 DIMM 6G     Not installed Not installed
38 
39 Proc 2 DIMM 7B     16384 MB 1333 MHz
40 
41 Proc 2 DIMM 8J     Not installed Not installed
42 
43 Proc 2 DIMM 9F     Not installed Not installed
44 
45 Proc 2 DIMM 10D     Not installed Not installed
46 
47 Proc 2 DIMM 11L     Not installed Not installed
48 
49 Proc 2 DIMM 12H     Not installed Not installed
50 
51 Proc 3 DIMM 1A     16384 MB 1333 MHz
52 
53 Proc 3 DIMM 2I     Not installed Not installed
54 
55 Proc 3 DIMM 3E     Not installed Not installed
56 
57 Proc 3 DIMM 4C     Not installed Not installed
58 
59 Proc 3 DIMM 5K     Not installed Not installed
60 
61 Proc 3 DIMM 6G     Not installed Not installed
62 
63 Proc 3 DIMM 7B     16384 MB 1333 MHz
64 
65 Proc 3 DIMM 8J     Not installed Not installed
66 
67 Proc 3 DIMM 9F     Not installed Not installed
68 
69 Proc 3 DIMM 10D     Not installed Not installed
70 
71 Proc 3 DIMM 11L     Not installed Not installed
72 
73 Proc 3 DIMM 12H     Not installed Not installed
74 
75 Proc 4 DIMM 1A     16384 MB 1333 MHz
76 
77 Proc 4 DIMM 2I     Not installed Not installed
78 
79 Proc 4 DIMM 3E     Not installed Not installed
80 
81 Proc 4 DIMM 4C     Not installed Not installed
82 
83 Proc 4 DIMM 5K     Not installed Not installed
84 
85 Proc 4 DIMM 6G     Not installed Not installed
86 
87 Proc 4 DIMM 7B     16384 MB 1333 MHz
88 
89 Proc 4 DIMM 8J     Not installed Not installed
90 
91 Proc 4 DIMM 9F     Not installed Not installed
92 
93 Proc 4 DIMM 10D     Not installed Not installed
94 
95 Proc 4 DIMM 11L     Not installed Not installed
96 
97 Proc 4 DIMM 12H     Not installed Not installed

使用edac工具来检测服务器内存故障

随着虚拟化,Redis,BDB内存数据库等应用的普及,现在越来越多的服务器配置了大容量内存,拿DELL的R620来说在配置双路CPU下,其24个内存插槽,支持的内存高达960GB。对于ECC,REG这些带有纠错功能的内存故障检测是一件很头疼的事情,出现故障,还是可以连续运行几个月甚至几年,但如果运气不好,随时都会挂掉,好在linux中提供了一个edac-utils 内存纠错诊断工具,可以用来检查服务器内存潜在的故障。
下面以CentOS为例,介绍下edac-utils 工具的使用.
在使用edac-utils 工具之前,需要先了解服务器的硬件架构,以DELL R620为例,(其它如HP DL360P G8,IBM X3650 M4 机型都使用了 E5-2600 系列CPU,C600 系列芯片组.大致相同) 其CPU内存控制器对应通道,内存槽关系,如下所示。

处理器0 (对应一个内存控制器)
通道0:内存插槽A1、A5 和A9
通道1:内存插槽A2、A6 和A10
通道2:内存插槽A3、A7 和A11
通道3:内存插槽A4、A8 和A12

处理器1 (对应一个内存控制器)
通道0:内存插槽B1、B5 和B9
通道1:内存插槽B2、B6 和B10
通道2:内存插槽B3、B7 和B11
通道3:内存插槽B4、B8 和B12

1.安装 edac-utils 工具

yum install -y libsysfs edac-utils

2.执行检测命令,可查看纠错提示如下

edac-util -v
 1 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: A1
 2 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: A2
 3 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: A3
 4 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: A4
 5 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: A5
 6 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: A6
 7 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: A7
 8 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: A8
 9 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: A9
10 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: A10
11 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: A11
12 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: A12
13 
14 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: B1
15 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: B2
16 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: B3
17 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: B4
18 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B5
19 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B6
20 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B7
21 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B8
22 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B9
23 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B10
24 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B11
25 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B12

其中

mc06 表示 表示内存控制器0;
CPU_Src_ID#0 表示源CPU0;
Channel#0 表示通道0;
DIMM#0 标示内存槽0;
Corrected Errors 代表已经纠错的次数;

根据前面列出的CPU通道和内存槽对应关系即可给edac-utils 返回的信息进行编号。
即可得出 A1槽 6312 次纠错,B1槽 6459次纠错,B3槽 535次纠错. 3条内存出现潜在故障,接下来联系供应商进行更换即可。

12条内存的对应关系

 1 mc0: csrow0: CPU#0Channel#0_DIMM#0: A1
 2 mc0: csrow0: CPU#0Channel#1_DIMM#0: A2
 3 mc0: csrow0: CPU#0Channel#2_DIMM#0: A3
 4 mc0: csrow1: CPU#0Channel#0_DIMM#1: A4
 5 mc0: csrow1: CPU#0Channel#1_DIMM#1: A5
 6 mc0: csrow1: CPU#0Channel#2_DIMM#1: A6
 7 
 8 mc1: csrow0: CPU#1Channel#0_DIMM#0: B1
 9 mc1: csrow0: CPU#1Channel#1_DIMM#0: B2
10 mc1: csrow0: CPU#1Channel#2_DIMM#0: B3
11 mc1: csrow1: CPU#1Channel#0_DIMM#1: B4
12 mc1: csrow1: CPU#1Channel#1_DIMM#1: B5
13 mc1: csrow1: CPU#1Channel#2_DIMM#1: B6

20条内存的对应关系

 1 mc0: 0 Uncorrected Errors with no DIMM info
 2 mc0: 0 Corrected Errors with no DIMM info
 3 mc0: csrow0: 0 Uncorrected Errors
 4 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors A1
 5 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors B1
 6 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors C1
 7 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors D1
 8 mc0: csrow1: 0 Uncorrected Errors
 9 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors A2
10 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors B2
11 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors C2
12 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors D2
13 mc0: csrow2: 0 Uncorrected Errors
14 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: 0 Corrected Errors A3
15 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: 11 Corrected Errors B3
16 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: 0 Corrected Errors C3
17 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: 0 Corrected Errors D3
18 mc1: 0 Uncorrected Errors with no DIMM info
19 mc1: 0 Corrected Errors with no DIMM info
20 mc1: csrow0: 0 Uncorrected Errors
21 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors 
22 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors 
23 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
24 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
25 mc1: csrow1: 0 Uncorrected Errors
26 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
27 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
28 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
29 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
30 
31 4x16关系
32 mc0: csrow0: CPU#0Channel#0_DIMM#0: 0 Corrected Errors 8a
33 mc0: csrow0: CPU#0Channel#1_DIMM#0: 0 Corrected Errors 5b
34 mc0: csrow0: CPU#0Channel#2_DIMM#0: 0 Corrected Errors 2c
35 mc0: csrow1: 0 Uncorrected Errors
36 mc0: csrow1: CPU#0Channel#0_DIMM#1: 1 Corrected Errors 7d
37 mc0: csrow1: CPU#0Channel#1_DIMM#1: 0 Corrected Errors 4e
38 mc0: csrow1: CPU#0Channel#2_DIMM#1: 0 Corrected Errors 1f
39 mc0: csrow2: 0 Uncorrected Errors
40 mc0: csrow2: CPU#0Channel#0_DIMM#2: 0 Corrected Errors 6G
41 mc0: csrow2: CPU#0Channel#1_DIMM#2: 0 Corrected Errors 3h

参考:
https://www.cnblogs.com/luckyall/p/11225772.html
http://www.voidcn.com/article/p-gvfvakvy-btw.html

推荐阅读