跳到主要内容

某药业集团核心数据库ASM磁盘组故障处理

提示

本文为站长原创文章,版权所有,未经允许,禁止转载!

信息

数据库环境信息:
硬件:IBM X3850 x5
操作系统:Oracle Enterprise Linux 5.8 64bit
oracle版本:Oracle 11.2.0.4 RAC 2 nodes

事件概述

最初在给客户搭建这套RAC时,基于客户对数据安全的要求,因此在配置ASM Diskgroup时配置ASM Disk Group为Normal Redundancy,normal冗余模式标准要求最少有两套存储系统来存放相关数据,客户这边有3套存储系统,所以其中数据文件存放在两套主存储上,从3套存储中各自划分一个lun(磁盘)用于存放OCR和Votedisk信息。

在一次日常例行巡检中发现磁盘组存在故障,下为处理过程:

[grid@xxyy2 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Thu Sep 29 22:03:46 2016
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options

SQL> set linesize 999
SQL> set pagesize 999
SQL> select FAILGROUP,MODE_STATUS,HEADER_STATUS,STATE,NAME,path from v$asm_disk order by path;
FAILGROUP MODE_STATUS HEADER_STATUS STATE NAME PATH
-------------------- -------------- ------------------------ ---------------- ------------------------------------------------------------ --------------------------------------------------
FG1 ONLINE MEMBER NORMAL CT1DSK01 /dev/oracleasm/disks/CT1DSK01
FG1 ONLINE MEMBER NORMAL CT1DSK02 /dev/oracleasm/disks/CT1DSK02
FG1 ONLINE MEMBER NORMAL CT1DSK03 /dev/oracleasm/disks/CT1DSK03
FG1 ONLINE MEMBER NORMAL CT1DSK04 /dev/oracleasm/disks/CT1DSK04
ONLINE MEMBER NORMAL /dev/oracleasm/disks/CT2DSK01
ONLINE MEMBER NORMAL /dev/oracleasm/disks/CT2DSK02
ONLINE MEMBER NORMAL /dev/oracleasm/disks/CT2DSK03
ONLINE MEMBER NORMAL /dev/oracleasm/disks/CT2DSK04
GRIDDG_0000 ONLINE MEMBER NORMAL GRIDDG_0000 /dev/oracleasm/disks/GRIDDG01
ONLINE MEMBER NORMAL /dev/oracleasm/disks/GRIDDG02
GRIDDG_0003 ONLINE MEMBER NORMAL GRIDDG_0003 /dev/oracleasm/disks/GRIDDG03
FG2 OFFLINE UNKNOWN FORCING _DROPPED_0002_DBDG
FG2 OFFLINE UNKNOWN FORCING _DROPPED_0003_DBDG
FG2 OFFLINE UNKNOWN FORCING _DROPPED_0004_DBDG
FG2 OFFLINE UNKNOWN FORCING _DROPPED_0001_DBDG

15 rows selected.

SQL> select GROUP_NUMBER,DISK_NUMBER,MOUNT_STATUS,HEADER_STATUS,MODE_STATUS,STATE,FAILGROUP,LABEL,path from v$asm_disk_stat order by 1,5;
GROUP_NUMBER DISK_NUMBER MOUNT_STATUS HEADER_STATUS MODE_STATUS STATE FAILGROUP PATH
------------ ----------- -------------- ------------------------ -------------- ---------------- -------------------- --------------------------------------------------
1 3 MISSING UNKNOWN OFFLINE FORCING FG2
1 2 MISSING UNKNOWN OFFLINE FORCING FG2
1 1 MISSING UNKNOWN OFFLINE FORCING FG2
1 4 MISSING UNKNOWN OFFLINE FORCING FG2
1 5 CACHED MEMBER ONLINE NORMAL FG1 /dev/oracleasm/disks/CT1DSK02
1 0 CACHED MEMBER ONLINE NORMAL FG1 /dev/oracleasm/disks/CT1DSK01
1 7 CACHED MEMBER ONLINE NORMAL FG1 /dev/oracleasm/disks/CT1DSK04
1 6 CACHED MEMBER ONLINE NORMAL FG1 /dev/oracleasm/disks/CT1DSK03
4 0 CACHED MEMBER ONLINE NORMAL GRIDDG_0000 /dev/oracleasm/disks/GRIDDG01
4 2 CACHED MEMBER ONLINE NORMAL GRIDDG_0003 /dev/oracleasm/disks/GRIDDG03
10 rows selected.

SQL> select GROUP_NUMBER,NAME,BLOCK_SIZE,STATE,TYPE,TOTAL_MB,FREE_MB,HOT_USED_MB,OFFLINE_DISKS,VOTING_FILES from V$ASM_DISKGROUP;
GROUP_NUMBER NAME BLOCK_SIZE STATE TYPE TOTAL_MB FREE_MB HOT_USED_MB OFFLINE_DISKS VO
------------ -------------------- ---------- ---------------------- ------------ ---------- ---------- ----------- ------------- --
1 DBDG 4096 MOUNTED NORMAL 4194304 3060248 0 4 N
4 GRIDDG 4096 MOUNTED NORMAL 3955 3163 0 0 Y

# 正常来说要求votedisk要求有3个,这里只有2个了。
[grid@xxyy2 ~]$ crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 4343bf6e93834f52bf5e6e4e77056a7d (/dev/oracleasm/disks/GRIDDG01) [GRIDDG]
2. ONLINE 7022852b40554fc0bf83e6685ac74ff3 (/dev/oracleasm/disks/GRIDDG03) [GRIDDG]
Located 2 voting disk(s).

故障处理流程

可以看到,有一组failure group(FG2)已经掉了,即其中一套存储系统离线。后来跟客户沟通得知其一台存储出现过断电情况,下面为处理步骤:

1.丢失的磁盘处理

# 确认丢失的磁盘可以读写后用 dd 命令把磁盘头清掉
dd if=/dev/zero of=/dev/mapper/GRIDDG02 bs=1024k count=100
dd if=/dev/zero of=/dev/mapper/DBDG_Lun01_Mirror bs=1024k count=100
dd if=/dev/zero of=/dev/mapper/DBDG_Lun02_Mirror bs=1024k count=100
dd if=/dev/zero of=/dev/mapper/DBDG_Lun03_Mirror bs=1024k count=100
dd if=/dev/zero of=/dev/mapper/DBDG_Lun04_Mirror bs=1024k count=100

2.重新扫描磁盘


[root@xxyy1 ~]# oracleasm scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks...
Scanning system for ASM disks...

[root@xxyy1 ~]# oracleasm listdisks
CT1DSK01
CT1DSK02
CT1DSK03
CT1DSK04
GRIDDG01
GRIDDG03

3.重建丢失的磁盘

[root@xxyy1 ~]# /etc/init.d/oracleasm createdisk CT2DSK01 /dev/mapper/DBDG_Lun01_Mirror
Marking disk "CT2DSK01" as an ASM disk: [ OK ]
[root@xxyy1 ~]# /etc/init.d/oracleasm createdisk CT2DSK02 /dev/mapper/DBDG_Lun02_Mirror
Marking disk "CT2DSK02" as an ASM disk: [ OK ]
[root@xxyy1 ~]# /etc/init.d/oracleasm createdisk CT2DSK03 /dev/mapper/DBDG_Lun03_Mirror
Marking disk "CT2DSK03" as an ASM disk: [ OK ]
[root@xxyy1 ~]# /etc/init.d/oracleasm createdisk CT2DSK04 /dev/mapper/DBDG_Lun04_Mirror
Marking disk "CT2DSK04" as an ASM disk: [ OK ]

[root@xxyy1 ~]# oracleasm scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks...
Scanning system for ASM disks...

[root@xxyy1 ~]# oracleasm listdisks
CT1DSK01
CT1DSK02
CT1DSK03
CT1DSK04
CT2DSK01
CT2DSK02
CT2DSK03
CT2DSK04
GRIDDG01
GRIDDG02
GRIDDG03

SQL> select GROUP_NUMBER,FAILGROUP,MODE_STATUS,HEADER_STATUS,STATE,NAME,path from v$asm_disk order by path;
GROUP_NUMBER FAILGROUP MODE_STATUS HEADER_STATUS STATE NAME PATH
------------ ------------------------------------------------------------ -------------- ------------------------ ---------------- ------------------------------------------------------------ --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 FG1 ONLINE MEMBER NORMAL CT1DSK01 /dev/oracleasm/disks/CT1DSK01
1 FG1 ONLINE MEMBER NORMAL CT1DSK02 /dev/oracleasm/disks/CT1DSK02
1 FG1 ONLINE MEMBER NORMAL CT1DSK03 /dev/oracleasm/disks/CT1DSK03
1 FG1 ONLINE MEMBER NORMAL CT1DSK04 /dev/oracleasm/disks/CT1DSK04
0 ONLINE PROVISIONED NORMAL /dev/oracleasm/disks/CT2DSK01
0 ONLINE PROVISIONED NORMAL /dev/oracleasm/disks/CT2DSK02
0 ONLINE PROVISIONED NORMAL /dev/oracleasm/disks/CT2DSK03
0 ONLINE PROVISIONED NORMAL /dev/oracleasm/disks/CT2DSK04
4 GRIDDG_0000 ONLINE MEMBER NORMAL GRIDDG_0000 /dev/oracleasm/disks/GRIDDG01
4 GRIDDG_0006 ONLINE MEMBER NORMAL GRIDDG_0006 /dev/oracleasm/disks/GRIDDG02
4 GRIDDG_0003 ONLINE MEMBER NORMAL GRIDDG_0003 /dev/oracleasm/disks/GRIDDG03
1 FG2 OFFLINE UNKNOWN FORCING _DROPPED_0001_DBDG
1 FG2 OFFLINE UNKNOWN FORCING _DROPPED_0002_DBDG
1 FG2 OFFLINE UNKNOWN FORCING _DROPPED_0010_DBDG
1 FG2 OFFLINE UNKNOWN FORCING _DROPPED_0004_DBDG
1 FG2 OFFLINE UNKNOWN FORCING _DROPPED_0008_DBDG
1 FG2 OFFLINE UNKNOWN FORCING _DROPPED_0011_DBDG
1 FG2 OFFLINE UNKNOWN FORCING _DROPPED_0003_DBDG
1 FG2 OFFLINE UNKNOWN FORCING _DROPPED_0009_DBDG

19 rows selected.

[grid@xxyy1 ~]$ sqlplus / as sysasm
SQL*Plus: Release 11.2.0.4.0 Production on Fri Sep 30 00:43:40 2016
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options

SQL> alter diskgroup DBDG add FAILGROUP fg02 disk '/dev/oracleasm/disks/CT2DSK01' name CT2DSK001 ;
Diskgroup altered.

SQL> alter diskgroup DBDG add FAILGROUP fg02 disk '/dev/oracleasm/disks/CT2DSK02' name CT2DSK002 ;
Diskgroup altered.

SQL> alter diskgroup DBDG add FAILGROUP fg02 disk '/dev/oracleasm/disks/CT2DSK03' name CT2DSK003 ;
Diskgroup altered.

SQL> alter diskgroup DBDG add FAILGROUP fg02 disk '/dev/oracleasm/disks/CT2DSK04' name CT2DSK004 ;
Diskgroup altered.

SQL> alter system set asm_power_limit=8;
System altered.

SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options

[grid@xxyy1 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Fri Sep 30 00:47:30 2016
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options

4.监控磁盘rebalance进度

信息

注意,在磁盘rebalance过程中会导致大量磁盘I/O消耗,所以不要在业务繁忙阶段进行rebalance,或者根据情况调整asm_power_limit参数值。

SQL> select * from V$ASM_OPERATION;
GROUP_NUMBER OPERATION STATE POWER ACTUAL SOFAR EST_WORK EST_RATE EST_MINUTES ERROR_CODE
------------ ---------- -------- ---------- ---------- ---------- ---------- ---------- ----------- ----------------------------------------------------------------------------------------
1 REBAL RUN 8 8 16147 425638 6889 59

查看存储光纤多路径链路是否正常

[root@xxyy1 ~]# multipath -ll
DBDG_Lun01_Mirror (360014380125db2740000b000001c0000) dm-7 HP,HSV340
size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 2:0:3:1 sdk 8:160 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 3:0:3:1 sdl 8:176 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 2:0:2:1 sdaf 65:240 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 3:0:2:1 sdag 66:0 active ready running
GridDG_Lun02 (360014380125db2740000b00000180000) dm-11 HP,HSV340
size=2.0G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 3:0:3:5 sds 65:32 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 2:0:3:5 sdt 65:48 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 2:0:2:5 sdan 66:112 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 3:0:2:5 sdao 66:128 active ready running
DBDG_Lun03_Mirror (360014380125db2740000b00000240000) dm-9 HP,HSV340
size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 2:0:3:3 sdp 8:240 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 2:0:2:3 sdaj 66:48 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 3:0:2:3 sdak 66:64 active ready running
GridDG_Lun01 (360014380125db69f00006000009a0000) dm-6 HP,HSV340
size=2.0G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 3:0:0:5 sde 8:64 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 3:0:1:5 sdj 8:144 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 2:0:0:5 sdz 65:144 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 2:0:1:5 sdae 65:224 active ready running
DBDG_Lun04 (360014380125db69f0000600000aa0000) dm-5 HP,HSV340
size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 3:0:0:4 sdd 8:48 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 3:0:1:4 sdi 8:128 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 2:0:0:4 sdy 65:128 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 2:0:1:4 sdad 65:208 active ready running
DBDG_Lun03 (360014380125db69f0000600000a60000) dm-4 HP,HSV340
size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 3:0:0:3 sdc 8:32 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 3:0:1:3 sdh 8:112 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 2:0:0:3 sdx 65:112 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 2:0:1:3 sdac 65:192 active ready running
DBDG_Lun02_Mirror (360014380125db2740000b00000200000) dm-8 HP,HSV340
size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 2:0:3:2 sdm 8:192 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 2:0:2:2 sdah 66:16 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 3:0:2:2 sdai 66:32 active ready running
DBDG_Lun02 (360014380125db69f0000600000a20000) dm-3 HP,HSV340
size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 3:0:0:2 sdb 8:16 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 3:0:1:2 sdg 8:96 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 2:0:0:2 sdw 65:96 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 2:0:1:2 sdab 65:176 active ready running
DBDG_Lun04_Mirror (360014380125db2740000b00000280000) dm-10 HP,HSV340
size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 3:0:3:4 sdq 65:0 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 2:0:3:4 sdr 65:16 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 2:0:2:4 sdal 66:80 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 3:0:2:4 sdam 66:96 active ready running
DBDG_Lun01 (360014380125db69f00006000009e0000) dm-2 HP,HSV340
size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 3:0:0:1 sda 8:0 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 3:0:1:1 sdf 8:80 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 2:0:0:1 sdv 65:80 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 2:0:1:1 sdaa 65:160 active ready running
GridDG_Lun03 (3600c0ff000db40951514fb5201000000) dm-12 HP,MSA2312fc
size=1.9G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
`- 3:0:4:0 sdu 65:64 active ready running

等待rebalance完成后,故障解决完毕。