从某种意义上说,IT运维管理是一项高风险行业。工作涉及领域广、专业化程度高,并且技术更新升级快。往往是我们还不完全了解一项技术特性的时候,一项新技术就已经跃跃欲试了。
当系统出现问题、发生变更配置,出现不熟悉的故障时,冷静下来搞清楚问题原因之后再下手操作是最稳妥的策略。同时,在下手之前要准备好“后手”,也就是退路。最好能有类似备份、镜像之类的以防万一。实际场景中,没有准备的入手,往往带来更大的祸事,最后小病成癌症。
本篇记录一下笔者进行一个实验环境调试的过程,从一个小错误,发展成ASM不可访问的大问题。
1、环境介绍
笔者使用Oracle 11g,具体版本为11.2.0.4。
SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
PL/SQL Release 11.2.0.4.0 - Production
CORE 11.2.0.4.0 Production
TNS for Linux: Version 11.2.0.4.0 - Production
NLSRTL Version 11.2.0.4.0 – Production
底层使用ASM+GI组织资源。
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
ONLINE ONLINE ncr-standby-asm
ora.LISTENER.lsnr
ONLINE ONLINE ncr-standby-asm
ora.RECO.dg
ONLINE ONLINE ncr-standby-asm
ora.asm
ONLINE ONLINE ncr-standby-asm Started
ora.ons
OFFLINE OFFLINE ncr-standby-asm
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
1 ONLINE ONLINE ncr-standby-asm
ora.diskmon
1 OFFLINE OFFLINE
ora.evmd
1 ONLINE ONLINE ncr-standby-asm
ora.sicsstb.db
1 ONLINE ONLINE ncr-standby-asm Open
2、创建Oracle Wallet失败
因为要试验Oracle加密机制,需要创建wallet文件。根据经验,需要显示的在sqlnet.ora中加入wallet文件所在位置的参数信息。
sqlnet.ora是Oracle Net Service的核心配置三文件之一。通常是作为控制参数配置文件使用。在Grid环境下,通常监听器程序是在$GRID_HOME目录结构下的。
[grid@NCR-Standby-Asm ~]$ which lsnrctl
/u01/app/grid/product/11.2.0/grid/bin/lsnrctl
[grid@NCR-Standby-Asm ~]$ lsnrctl status
LSNRCTL for Linux: Version 11.2.0.4.0 - Production on 10-JUN-2015 09:28:41
Copyright (c) 1991, 2013, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1521)))
STATUS of the LISTENER
------------------------
Alias LISTENER
Version TNSLSNR for Linux: Version 11.2.0.4.0 - Production
Start Date 08-JUN-2015 14:20:23
Uptime 1 days 19 hr. 8 min. 17 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /u01/app/grid/product/11.2.0/grid/network/admin/listener.ora
Listener Log File /u01/app/grid/diag/tnslsnr/NCR-Standby-Asm/listener/alert/log.xml
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=localhost)(PORT=1521)))
Services Summary...
Service "+ASM" has 1 instance(s).
Instance "+ASM", status READY, has 1 handler(s) for this service...
Service "sicsstb" has 1 instance(s).
Instance "sicsstb", status READY, has 1 handler(s) for this service...
Service "sicsstbXDB" has 1 instance(s).
Instance "sicsstb", status READY, has 1 handler(s) for this service...
The command completed successfully
笔者想当然觉得这个参数配置应该是Grid下面的sqlnet.ora文件中设置。
[grid@NCR-Standby-Asm ~]$ cd /u01/app/grid/product/11.2.0/grid/network/admin/
[grid@NCR-Standby-Asm admin]$ ls -l
total 20
-rw-r--r-- 1 grid oinstall 432 May 5 09:26 listener.ora
-rw-r--r-- 1 grid oinstall 366 May 5 09:26 listener.ora.bak.ncr-standby-asm
drwxr-xr-x 2 grid oinstall 4096 May 5 09:19 samples
-rw-r--r-- 1 grid oinstall 381 Dec 17 2012 shrept.lst
-rw-r--r-- 1 grid oinstall 215 May 5 09:26 sqlnet.ora
[grid@NCR-Standby-Asm admin]$ cat sqlnet.ora
# sqlnet.ora Network Configuration File: /u01/app/grid/product/11.2.0/grid/network/admin/sqlnet.ora
# Generated by Oracle configuration tools.
NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT)
ADR_BASE = /u01/app/grid
ENCRYPTION_WALLET_LOCATION=(SOURCE=(METHOD=FILE)(METHOD_DATA=(DIRECTORY=/u01/app/grid)))
按照经验,重启启动监听器,加载配置文件。进入sqlplus创建wallet文件。
[oracle@NCR-Standby-Asm grid]$ sqlplus /nolog
SQL*Plus: Release 11.2.0.4.0 Production on Mon Jun 8 13:58:15 2015
Copyright (c) 1982, 2013, Oracle. All rights reserved.
SQL> conn / as sysdba
Connected.
SQL> alter system set encryption key authenticated by "test";
alter system set encryption key authenticated by "test"
*
ERROR at line 1:
ORA-28368: cannot auto-create wallet
官方提示对ora-28368的解释如下:
[oracle@NCR-Standby-Asm ~]$ oerr ora 28368
28368, 0000, "cannot auto-create wallet"
// *Cause: The database failed to auto create an Oracle wallet. The Oracle
// process may not have proper file permissions or a wallet may
// already exist.
// *Action: Confirm that proper directory permissions are granted to the Oracle
// user and that neither an encrypted or obfuscated wallet exists in
// the specified wallet location and try again
确认目录权限,理论上没有什么问题应该。
[oracle@NCR-Standby-Asm ~]$ cd /u01/app/grid
[oracle@NCR-Standby-Asm grid]$ ls -l
total 32
drwxr-x--- 3 grid oinstall 4096 May 5 09:26 admin
drwxr-xr-x 4 grid oinstall 4096 May 5 09:26 cfgtoollogs
drwxr-xr-x 2 grid oinstall 4096 May 5 09:28 checkpoints
drwxr-xr-x 2 grid oinstall 4096 May 5 09:23 Clusterware
drwxrwxr-x 5 grid oinstall 4096 Jun 3 17:21 diag
drwxr-xr-x 3 grid oinstall 4096 May 5 09:23 NCR-Standby-Asm
drwxr-xr-x 3 grid oinstall 4096 Jun 4 13:59 oradiag_grid
drwxr-xr-x 3 grid oinstall 4096 May 5 08:56 product
[oracle@NCR-Standby-Asm grid]$ cd ..
[oracle@NCR-Standby-Asm app]$ ls -l | grep grid
drwxr-xr-x 10 grid oinstall 4096 Jun 4 13:59 grid
drwxrwx--- 6 grid oinstall 4096 May 26 09:11 oraInventory
用户oracle虽然没有直接owner权限,那么更换一个oracle用户可以访问到的目录。
[grid@NCR-Standby-Asm admin]$ cat sqlnet.ora
# sqlnet.ora Network Configuration File: /u01/app/grid/product/11.2.0/grid/network/admin/sqlnet.ora
# Generated by Oracle configuration tools.
NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT)
ADR_BASE = /u01/app/grid
ENCRYPTION_WALLET_LOCATION=(SOURCE=(METHOD=FILE)(METHOD_DATA=(DIRECTORY=/u02/app/oracle)))
SQL*Plus: Release 11.2.0.4.0 Production on Mon Jun 8 14:07:53 2015
Copyright (c) 1982, 2013, Oracle. All rights reserved.
SQL> conn / as sysdba
Connected.
SQL> alter system set encryption key authenticated by "test";
alter system set encryption key authenticated by "test"
*
ERROR at line 1:
ORA-28368: cannot auto-create wallet
更改为wallet推荐目录。
[grid@NCR-Standby-Asm admin]$ cat sqlnet.ora
# sqlnet.ora Network Configuration File: /u01/app/grid/product/11.2.0/grid/network/admin/sqlnet.ora
# Generated by Oracle configuration tools.
NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT)
ADR_BASE = /u01/app/grid
ENCRYPTION_WALLET_LOCATION=(SOURCE=(METHOD=FILE)(METHOD_DATA=(DIRECTORY=/u02/app/oracle/admin/sicsstb/WALLET)))
问题依然不行。之后,笔者尝试更新u01(grid目录)和u02(oracle目录)权限,连带更新子文件夹和连带文件,更新为755和775,故障依然。(注意这个问题!!!)
之后检查MOS,在community社区论坛中一篇讨论:在grid和rac结构下,设置wallet要在Oracle的目录中。死马当活马医,笔者实验将sqlnet.ora拷贝到$ORACLE_HOME目录下。
[grid@NCR-Standby-Asm admin]$ cp sqlnet.ora /u02/app/oracle/product/11.2.0/dbhome_1/network/admin
[oracle@NCR-Standby-Asm admin]$ sqlplus /nolog
SQL*Plus: Release 11.2.0.4.0 Production on Mon Jun 8 14:29:28 2015
Copyright (c) 1982, 2013, Oracle. All rights reserved.
SQL> conn / as sysdba
Connected.
SQL> alter system set encryption key authenticated by "test";
System altered.
创建成功,目录中wallet文件生成。
[oracle@NCR-Standby-Asm admin]$ pwd
/u02/app/oracle/product/11.2.0/dbhome_1/network/admin
[oracle@NCR-Standby-Asm admin]$ cd /u02/app/oracle/admin/sicsstb/WALLET
[oracle@NCR-Standby-Asm WALLET]$ ls -l
total 4
-rw-r--r-- 1 oracle asmadmin 2845 Jun 8 14:29 ewallet.p12
自此,大难临头。
3、ASM不可见
之后备份测试中,出现故障。
RMAN> backup database;
Starting backup at 08-JUN-15
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=72 device type=DISK
(篇幅原因,有省略……)
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on ORA_DISK_1 channel at 06/08/2015 14:44:35
ORA-19504: failed to create file "+RECO"
ORA-15055: unable to connect to ASM instance
ORA-12547: TNS:lost contact
+RECO Diskgroup不可见。检查一下是否ASM启动。
[grid@NCR-Standby-Asm ~]$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
ONLINE ONLINE ncr-standby-asm
ora.LISTENER.lsnr
ONLINE ONLINE ncr-standby-asm
ora.RECO.dg
ONLINE ONLINE ncr-standby-asm
ora.asm
ONLINE ONLINE ncr-standby-asm Started
ora.ons
OFFLINE OFFLINE ncr-standby-asm
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
1 ONLINE ONLINE ncr-standby-asm
ora.diskmon
1 OFFLINE OFFLINE
ora.evmd
1 ONLINE ONLINE ncr-standby-asm
ora.sicsstb.db
1 ONLINE INTERMEDIATE ncr-standby-asm Stuck Archiver
ASM启动,从grid以sysasm用户登录,发现ASM Diskgroup可见,没有mount问题故障。但是,从oracle用户登录之后,ASM DiskGroup全部不可见,v$disk_group视图返回零行记录。
强制关闭数据库之后,发现不能启动。
SQL> conn / as sysdba
Connected to an idle instance.
SQL> startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file ‘+DATA/sicsstb/spfilesicsstb.ora‘
ORA-17503: ksfdopn:10 Failed to open file +DATA/sicsstb/spfilesicsstb.ora
ORA-12547: TNS:lost contact
显然是Oracle Instance无法访问到ASM实例。从之前的情况,可以猜测是文件权限方面的问题。从grid用户启动asmca,可以看到两个DISKGROUP正常,但是从Oracle角度,不能看到Diskgroup。
经过查资料,发现问题就在笔者贸然的设置/u01和/u02的权限上。大部分文件权限是755或者775是没有问题的。但是对于两个$ORACLE_HOME/bin目录下的oracle文件,权限是有一个s权限的。
[root@NCR-Standby-Asm disks]# chmod 6751 /u01/app/grid/product/11.2.0/grid/bin/oracle
[root@NCR-Standby-Asm disks]# chmod 6751 /u02/app/oracle/product/11.2.0/dbhome_1/bin/oracle
权限修改成功。
[root@NCR-Standby-Asm bin]# ls -l | grep oracle
-rwsr-s--x 1 grid oinstall 210094583 May 25 15:28 oracle
-rwxrwxr-x 1 grid oinstall 210094583 May 25 15:28 oracleO
之后故障消失。
4、结论
这个问题,的确是让笔者出了一身冷汗。好在是实验测试环境,没有给生产造成影响。教训是对于一些不确定性的操作,不要贸然动手,搞清楚之后再进行处理。对于一些权限类型操作,不要贸然增加影响范围。原有环境可能有各种问题,但是起码可以运行。我们进行配置优化,出发点起码不能破坏原有的环境。
原文:http://blog.itpub.net/17203031/viewspace-1696033/