菜单
首页
关于
首页
» 2017年3月
使用edac工具来检测服务器内存故障.
作者:
cokll
分类:
默认分类
时间: 2017-03-29
评论:
暂无评论
随着虚拟化,Redis,BDB内存数据库等应用的普及,现在越来越多的服务器配置了大容量内存,拿DELL的R620来说在配置双路CPU下,其24个内存插槽,支持的内存高达960GB。对于ECC,REG这些带有纠错功能的内存故障检测是一件很头疼的事情,出现故障,还是可以连续运行几个月甚至几年,但如果运气不好,随时都会挂掉,好在linux中提供了一个edac-utils 内存纠错诊断工具,可以用来检查服务器内存潜在的故障。 下面以CentOS为例,介绍下edac-utils 工具的使用. 在使用edac-utils 工具之前,需要先了解服务器的硬件架构,以DELL R620为例,(其它如HP DL360P G8,IBM X3650 M4 机型都使用了 E5-2600 系列CPU,C600 系列芯片组.大致相同) 其CPU内存控制器对应通道,内存槽关系,如下所示。 处理器0 (对应一个内存控制器) 通道0:内存插槽A1、A5 和A9 通道1:内存插槽A2、A6 和A10 通道2:内存插槽A3、A7 和A11 通道3:内存插槽A4、A8 和A12 处理器1 (对应一个内存控制器) 通道0:内存插槽B1、B5 和B9 通道1:内存插槽B2、B6 和B10 通道2:内存插槽B3、B7 和B11 通道3:内存插槽B4、B8 和B12 1.安装 edac-utils 工具 yum install -y libsysfs edac-utils 2.执行检测命令,可查看纠错提示如下 > edac-util -v mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: A1 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: A2 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: A3 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: A4 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: A5 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: A6 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: A7 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: A8 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: A9 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: A10 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: A11 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: A12 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: B1 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: B2 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: B3 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: B4 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B5 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B6 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B7 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B8 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B9 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B10 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B11 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B12 其中 mc0 表示 表示内存控制器0, CPU_Src_ID#0表示源CPU0 , Channel#0 表示通道0 DIMM#0 标示内存槽0,Corrected Errors 代表已经纠错的次数,根据前面列出的CPU通 道和内存槽对应关系即可给edac-utils 返回的信息进行编号。 即可得出 A1槽 6312 次纠错,B1槽 6459次纠错,B3槽 535次纠错. 3条内存出现潜在故障,接下来联系供应商进行更换即可。 12条内存的对应关系 mc0: csrow0: CPU#0Channel#0_DIMM#0: A1 mc0: csrow0: CPU#0Channel#1_DIMM#0: A2 mc0: csrow0: CPU#0Channel#2_DIMM#0: A3 mc0: csrow1: CPU#0Channel#0_DIMM#1: A4 mc0: csrow1: CPU#0Channel#1_DIMM#1: A5 mc0: csrow1: CPU#0Channel#2_DIMM#1: A6 mc1: csrow0: CPU#1Channel#0_DIMM#0: B1 mc1: csrow0: CPU#1Channel#1_DIMM#0: B2 mc1: csrow0: CPU#1Channel#2_DIMM#0: B3 mc1: csrow1: CPU#1Channel#0_DIMM#1: B4 mc1: csrow1: CPU#1Channel#1_DIMM#1: B5 mc1: csrow1: CPU#1Channel#2_DIMM#1: B6 20条内存的对应关系 mc0: 0 Uncorrected Errors with no DIMM info mc0: 0 Corrected Errors with no DIMM info mc0: csrow0: 0 Uncorrected Errors mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors A1 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors B1 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors C1 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors D1 mc0: csrow1: 0 Uncorrected Errors mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors A2 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors B2 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors C2 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors D2 mc0: csrow2: 0 Uncorrected Errors mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: 0 Corrected Errors A3 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: 11 Corrected Errors B3 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: 0 Corrected Errors C3 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: 0 Corrected Errors D3 mc1: 0 Uncorrected Errors with no DIMM info mc1: 0 Corrected Errors with no DIMM info mc1: csrow0: 0 Uncorrected Errors mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors mc1: csrow1: 0 Uncorrected Errors mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors 4x16关系 mc0: csrow0: CPU#0Channel#0_DIMM#0: 0 Corrected Errors 8a mc0: csrow0: CPU#0Channel#1_DIMM#0: 0 Corrected Errors 5b mc0: csrow0: CPU#0Channel#2_DIMM#0: 0 Corrected Errors 2c mc0: csrow1: 0 Uncorrected Errors mc0: csrow1: CPU#0Channel#0_DIMM#1: 1 Corrected Errors 7d mc0: csrow1: CPU#0Channel#1_DIMM#1: 0 Corrected Errors 4e mc0: csrow1: CPU#0Channel#2_DIMM#1: 0 Corrected Errors 1f mc0: csrow2: 0 Uncorrected Errors mc0: csrow2: CPU#0Channel#0_DIMM#2: 0 Corrected Errors 6G mc0: csrow2: CPU#0Channel#1_DIMM#2: 0 Corrected Errors 3h
搜素
Go
文章分类
默认分类
18
Linux
1
STEAM
3
Centos
3
最新文章
BusyBox漏洞分析及复现(CVE-2022-30065)
小鱼在家刷机
fix Samsung SSD 1Gb
Working with more than 64 CPUs in Powershell
华为24
思科 20
C++ 字符串与字符数组 详解
Windows server 2012 激活教程
英业达 24
mvn 编译配置
归档
October 2024
October 2023
January 2021
December 2020
January 2019
November 2018
September 2018
July 2018
April 2018
December 2017
May 2017
April 2017
March 2017
September 2015
July 2015
October 2014
April 2014
March 2014
热门标签
Top ↑