1、前言
本文目的主要是针对华为RH2288V3磁盘阵列状态预警监控,由于目前想获取磁盘状态只能去机房查看硬盘状态灯、IBMC管理后台报警信息两种方式,于是在PVE系统上安装了阵列制造商的程序并自己写了个简单脚本检测告警。
本文环境
服务器:华为RH2288V3
阵列卡:SAS3108
系统:PVE8.3.0
2、安装阵列制造商的检测程序
2.1确认阵列卡
# lspci | grep -i raid
01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)阵列卡为"Logic MegaRAID SAS-3 3108"
2.2替换镜像源
默认的Debian源很多在国内无法加载,所以替换为清华镜像源
编辑镜像源,并注释原来的,粘贴新的地址
nano /etc/apt/sources.listdeb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm main contrib non-free
deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-updates main contrib non-free
deb https://mirrors.tuna.tsinghua.edu.cn/debian-security bookworm-security main contrib non-free2.3下载安装MegaRAID Storage Manager (MSM)
https://www.broadcom.cn/support
下载的zip包里面只有RPM格式的安装包,而PVE是基于debian的,所以还需要使用alien把rpm转化为deb再安装
apt install alien
tar zxvf 17.05.02.01_MSM_linux-x86.tar.gz
cd disk
alien --scripts *.rpm
dpkg --install lib-utils2_1.00-3_all.deb
dpkg --install megaraid-storage-manager_17.05.02-2_all.deb默认安装到目录/usr/local/MegaRAID Storage Manager/StorCLI/
如果提示路径太长,可以把MegaRAID Storage Manager重命名为MegaRAID
2.4测试安装效果
查看所有阵列信息,这个输出会很长
# /usr/local/MegaRAID/StorCLI/storcli64 -AdpAllInfo -aALL
Adapter #0
==============================================================================
Versions
================
Product Name : SAS3108
Serial No :
FW Package Build: 24.16.0-0114
Mfg. Data
================
Mfg. Date : 00/00/00
Rework Date : 00/00/00
Revision No :
Battery FRU : N/A
Image Versions in Flash:
================
BIOS Version : 6.32.02.0_4.17.08.00_0x06150500
FW Version : 4.660.00-8313
NVDATA Version : 3.1605.00-0015
Ctrl-R Version : 5.17-1302
Boot Block Version : 3.07.00.00-0003
Pending Images in Flash
================
None
PCI Info
================
Controller Id : 0000
Vendor Id : 1000
Device Id : 005d
SubVendorId : 19e5
SubDeviceId : d207
Host Interface : PCIE
ChipRevision : C0
Number of Frontend Port: 0
Device Interface : PCIE
Number of Backend Port: 8
Port : Address
0 500e004aaaaaaa1f
1 0000000000000000
2 0000000000000000
3 0000000000000000
4 0000000000000000
5 0000000000000000
6 0000000000000000
7 0000000000000000
HW Configuration
================
SAS Address : 5a0039665c3fd96e
BBU : Absent
Alarm : Present
NVRAM : Present
Serial Debugger : Present
Memory : Present
Flash : Present
Memory Size : 1024MB
TPM : Absent
On board Expander: Absent
Upgrade Key : Absent
Temperature sensor for ROC : Present
Temperature sensor for controller : Absent
ROC temperature : 75 degree Celcius
Settings
================
Current Time : 2:32:46 4/16, 2025
Predictive Fail Poll Interval : 300sec
Interrupt Throttle Active Count : 16
Interrupt Throttle Completion : 50us
Rebuild Rate : 30%
PR Rate : 30%
BGI Rate : 30%
Check Consistency Rate : 30%
Reconstruction Rate : 30%
Cache Flush Interval : 4s
Max Drives to Spinup at One Time : 4
Delay Among Spinup Groups : 2s
Physical Drive Coercion Mode : 1GB
Cluster Mode : Disabled
Alarm : Enabled
Auto Rebuild : Enabled
Battery Warning : Disabled
Ecc Bucket Size : 15
Ecc Bucket Leak Rate : 1440 Minutes
Restore HotSpare on Insertion : Enabled
Expose Enclosure Devices : Enabled
Maintain PD Fail History : Enabled
Host Request Reordering : Enabled
Auto Detect BackPlane Enabled : SGPIO/i2c SEP
Load Balance Mode : Auto
Use FDE Only : No
Security Key Assigned : No
Security Key Failed : No
Security Key Not Backedup : No
Default LD PowerSave Policy : Controller Defined
Maximum number of direct attached drives to spin up in 1 min : 120
Auto Enhanced Import : Yes
Any Offline VD Cache Preserved : No
Allow Boot with Preserved Cache : No
Disable Online Controller Reset : No
PFK in NVRAM : Yes
Use disk activity for locate : No
POST delay : 90 seconds
Capabilities
================
RAID Level Supported : RAID0, RAID1, RAID5, RAID6, RAID00, RAID10, RAID50, RAID60, PRL 11, PRL 11 with spanning, SRL 3 supported, PRL11-RLQ0 DDF layout with no span, PRL11-RLQ0 DDF layout with span
Supported Drives : SAS, SATA
Boot Volume Supported : NO
Allowed Mixing:
Mix in Enclosure Allowed
Mix of SAS/SATA of HDD type in VD Allowed
Status
================
ECC Bucket Count : 0
Limitations
================
Max Arms Per VD : 32
Max Spans Per VD : 8
Max Arrays : 128
Max Number of VDs : 64
Max Parallel Commands : 928
Max SGE Count : 60
Max Data Transfer Size : 8192 sectors
Max Strips PerIO : 128
Max LD per array : 16
Min Strip Size : 64 KB
Max Strip Size : 1.0 MB
Max Configurable CacheCade Size: 0 GB
Current Size of CacheCade : 0 GB
Current Size of FW Cache : 815 MB
Device Present
================
Virtual Drives : 6
Degraded : 0
Offline : 0
Physical Devices : 14
Disks : 12
Critical Disks : 0
Failed Disks : 0
Supported Adapter Operations
================
Rebuild Rate : Yes
CC Rate : Yes
BGI Rate : Yes
Reconstruct Rate : Yes
Patrol Read Rate : Yes
Alarm Control : Yes
Cluster Support : No
BBU : Yes
Spanning : Yes
Dedicated Hot Spare : Yes
Revertible Hot Spares : Yes
Foreign Config Import : Yes
Self Diagnostic : Yes
Allow Mixed Redundancy on Array : No
Global Hot Spares : Yes
Deny SCSI Passthrough : No
Deny SMP Passthrough : No
Deny STP Passthrough : No
Support Security : No
Snapshot Enabled : No
Support the OCE without adding drives : Yes
Support PFK : Yes
Support PI : Yes
Support Boot Time PFK Change : No
Disable Online PFK Change : No
Support LDPI Type1 : No
Support LDPI Type2 : No
Support LDPI Type3 : No
PFK TrailTime Remaining : 0 days 0 hours
Support Shield State : Yes
Block SSD Write Disk Cache Change: No
Point In Time Progress: Yes
Supported VD Operations
================
Read Policy : Yes
Write Policy : Yes
IO Policy : Yes
Access Policy : Yes
Disk Cache Policy : Yes
Reconstruction : Yes
Deny Locate : No
Deny CC : No
Allow Ctrl Encryption: No
Enable LDBBM : No
Support Breakmirror : Yes
Power Savings : No
Supported PD Operations
================
Force Online : Yes
Force Offline : Yes
Force Rebuild : Yes
Deny Force Failed : No
Deny Force Good/Bad : No
Deny Missing Replace : No
Deny Clear : No
Deny Locate : No
Support Temperature : Yes
NCQ : Yes
Disable Copyback : No
Enable JBOD : No
Enable Copyback on SMART : Yes
Enable Copyback to SSD on SMART Error : Yes
Enable SSD Patrol Read : No
PR Correct Unconfigured Areas : Yes
Enable Spin Down of UnConfigured Drives : Yes
Disable Spin Down of hot spares : No
Spin Down time : 30
T10 Power State : No
Error Counters
================
Memory Correctable Errors : 0
Memory Uncorrectable Errors : 0
High Availability Properties
================
Topology Type : None
Cluster Information
================
Cluster Permitted : No
Cluster Active : No
Default Settings
================
Phy Polarity : 0
Phy PolaritySplit : 0
Background Rate : 30
Strip Size : 256kB
Flush Time : 4 seconds
Write Policy : WB
Read Policy : RA
Cache When BBU Bad : Disabled
Cached IO : No
SMART Mode : Mode 6
Alarm Disable : Yes
Coercion Mode : 1GB
ZCR Config : Unknown
Dirty LED Shows Drive Activity : No
BIOS Continue on Error : No
Spin Down Mode : None
Allowed Device Type : SAS/SATA Mix
Allow Mix in Enclosure : Yes
Allow HDD SAS/SATA Mix in VD : Yes
Allow SSD SAS/SATA Mix in VD : No
Allow HDD/SSD Mix in VD : No
Allow SATA in Cluster : No
Max Chained Enclosures : 16
Disable Ctrl-R : No
Enable Web BIOS : No
Direct PD Mapping : No
BIOS Enumerate VDs : Yes
Restore Hot Spare on Insertion : Yes
Expose Enclosure Devices : Yes
Maintain PD Fail History : Yes
Disable Puncturing : No
Zero Based Enclosure Enumeration : No
PreBoot CLI Enabled : No
LED Show Drive Activity : No
Cluster Disable : Yes
SAS Disable : No
Auto Detect BackPlane Enable : SGPIO/i2c SEP
Use FDE Only : No
Enable Led Header : Yes
Delay during POST : 0
EnableCrashDump : Yes
Disable Online Controller Reset : No
EnableLDBBM : No
Un-Certified Hard Disk Drives : Allow
Treat Single span R1E as R10 : No
Max LD per array : 16
Power Saving option : Don't Auto spin down Configured Drives
Max power savings option is not allowed for LDs. Only T10 power conditions are to be used.
Default spin down time in minutes: 30
Enable JBOD : No
TTY Log In Flash : No
Auto Enhanced Import : Yes
BreakMirror RAID Support : Yes
Disable Join Mirror : Yes
Enable Shield State : Yes
Time taken to detect CME : 60s
Exit Code: 0x00我们只需要关注"Device Present"部分,如果"Degraded","Offline","Critical Disks","Failed Disks",都为"0"就判断状态磁盘正常,否则就有故障。
"Device Present"后面一共8行,只要有4个0就OK。
Device Present
================
Virtual Drives : 6
Degraded : 0
Offline : 0
Physical Devices : 14
Disks : 12
Critical Disks : 0
Failed Disks : 0 2.5获取状态信息
用一个简单的组合命令:
/usr/local/MegaRAID/StorCLI/storcli64 -AdpAllInfo -aALL | grep -A 8 'Device Present' | grep 0 | wc -l2.6放入脚本
#!/bin/bash
PRESENT=$(/usr/local/MegaRAID Storage Manager/StorCLI/storcli64 -AdpAllInfo -aALL | grep -A 8 "Device Present" | grep 0 | wc -l)
if [[ $PRESENT -eq 4 ]]; then
echo 'All are OK' && exit 0
else
echo 'All are OK' && exit 2
fi2.7测试脚本
# bash /tmp/check_MegaRAID.sh
All are OK