首页 > 其他 > 详细

“小过铸大错”——记录一次实验环境故障过程

时间:2015-06-13 02:15:08      阅读:330      评论:0      收藏:0      [点我收藏+]

 

从某种意义上说,IT运维管理是一项高风险行业。工作涉及领域广、专业化程度高,并且技术更新升级快。往往是我们还不完全了解一项技术特性的时候,一项新技术就已经跃跃欲试了。

当系统出现问题、发生变更配置,出现不熟悉的故障时,冷静下来搞清楚问题原因之后再下手操作是最稳妥的策略。同时,在下手之前要准备好“后手”,也就是退路。最好能有类似备份、镜像之类的以防万一。实际场景中,没有准备的入手,往往带来更大的祸事,最后小病成癌症。

本篇记录一下笔者进行一个实验环境调试的过程,从一个小错误,发展成ASM不可访问的大问题。

 

1、环境介绍

 

笔者使用Oracle 11g,具体版本为11.2.0.4

 

 

SQL> select * from v$version;

 

BANNER

--------------------------------------------------------------------------------

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production

PL/SQL Release 11.2.0.4.0 - Production

CORE    11.2.0.4.0 Production

 

TNS for Linux: Version 11.2.0.4.0 - Production

NLSRTL Version 11.2.0.4.0 – Production

 

 

底层使用ASM+GI组织资源。

 

 

--------------------------------------------------------------------------------

NAME     TARGET  STATE        SERVER    STATE_DETAILS      

--------------------------------------------------------------------------------

Local Resources

--------------------------------------------------------------------------------

ora.DATA.dg

               ONLINE  ONLINE       ncr-standby-asm 

ora.LISTENER.lsnr

               ONLINE  ONLINE       ncr-standby-asm  

ora.RECO.dg

               ONLINE  ONLINE       ncr-standby-asm  

ora.asm

               ONLINE  ONLINE       ncr-standby-asm          Started

ora.ons

               OFFLINE OFFLINE      ncr-standby-asm   

--------------------------------------------------------------------------------

Cluster Resources

--------------------------------------------------------------------------------

ora.cssd

      1        ONLINE ONLINE       ncr-standby-asm    

ora.diskmon

      1        OFFLINE OFFLINE                

ora.evmd

      1        ONLINE  ONLINE       ncr-standby-asm   

ora.sicsstb.db

      1        ONLINE  ONLINE       ncr-standby-asm          Open

 

 

2、创建Oracle Wallet失败

 

因为要试验Oracle加密机制,需要创建wallet文件。根据经验,需要显示的在sqlnet.ora中加入wallet文件所在位置的参数信息。

sqlnet.oraOracle Net Service的核心配置三文件之一。通常是作为控制参数配置文件使用。在Grid环境下,通常监听器程序是在$GRID_HOME目录结构下的。

 

 

[grid@NCR-Standby-Asm ~]$ which lsnrctl

/u01/app/grid/product/11.2.0/grid/bin/lsnrctl

[grid@NCR-Standby-Asm ~]$ lsnrctl status

 

LSNRCTL for Linux: Version 11.2.0.4.0 - Production on 10-JUN-2015 09:28:41

 

Copyright (c) 1991, 2013, Oracle.  All rights reserved.

 

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1521)))

STATUS of the LISTENER

------------------------

Alias                     LISTENER

Version                   TNSLSNR for Linux: Version 11.2.0.4.0 - Production

Start Date                08-JUN-2015 14:20:23

Uptime                    1 days 19 hr. 8 min. 17 sec

Trace Level               off

Security                  ON: Local OS Authentication

SNMP                      OFF

Listener Parameter File   /u01/app/grid/product/11.2.0/grid/network/admin/listener.ora

Listener Log File         /u01/app/grid/diag/tnslsnr/NCR-Standby-Asm/listener/alert/log.xml

Listening Endpoints Summary...

  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))

  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=localhost)(PORT=1521)))

Services Summary...

Service "+ASM" has 1 instance(s).

  Instance "+ASM", status READY, has 1 handler(s) for this service...

Service "sicsstb" has 1 instance(s).

  Instance "sicsstb", status READY, has 1 handler(s) for this service...

Service "sicsstbXDB" has 1 instance(s).

  Instance "sicsstb", status READY, has 1 handler(s) for this service...

The command completed successfully

 

 

笔者想当然觉得这个参数配置应该是Grid下面的sqlnet.ora文件中设置。

 

 

[grid@NCR-Standby-Asm ~]$ cd /u01/app/grid/product/11.2.0/grid/network/admin/

[grid@NCR-Standby-Asm admin]$ ls -l

total 20

-rw-r--r-- 1 grid oinstall  432 May  5 09:26 listener.ora

-rw-r--r-- 1 grid oinstall  366 May  5 09:26 listener.ora.bak.ncr-standby-asm

drwxr-xr-x 2 grid oinstall 4096 May  5 09:19 samples

-rw-r--r-- 1 grid oinstall  381 Dec 17  2012 shrept.lst

-rw-r--r-- 1 grid oinstall  215 May  5 09:26 sqlnet.ora

 

[grid@NCR-Standby-Asm admin]$ cat sqlnet.ora

# sqlnet.ora Network Configuration File: /u01/app/grid/product/11.2.0/grid/network/admin/sqlnet.ora

# Generated by Oracle configuration tools.

 

NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT)

 

ADR_BASE = /u01/app/grid

 

ENCRYPTION_WALLET_LOCATION=(SOURCE=(METHOD=FILE)(METHOD_DATA=(DIRECTORY=/u01/app/grid)))

 

 

按照经验,重启启动监听器,加载配置文件。进入sqlplus创建wallet文件。

 

[oracle@NCR-Standby-Asm grid]$ sqlplus /nolog

 

SQL*Plus: Release 11.2.0.4.0 Production on Mon Jun 8 13:58:15 2015

 

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

 

SQL> conn / as sysdba

Connected.

SQL> alter system set encryption key authenticated by "test";  

alter system set encryption key authenticated by "test"

*

ERROR at line 1:

ORA-28368: cannot auto-create wallet

 

 

官方提示对ora-28368的解释如下:

 

 

[oracle@NCR-Standby-Asm ~]$ oerr ora 28368

28368, 0000, "cannot auto-create wallet"

// *Cause:  The database failed to auto create an Oracle wallet. The Oracle

//          process may not have proper file permissions or a wallet may

//          already exist.

// *Action: Confirm that proper directory permissions are granted to the Oracle

//          user and that neither an encrypted or obfuscated wallet exists in

//          the specified wallet location and try again

 

 

确认目录权限,理论上没有什么问题应该。

 

 

[oracle@NCR-Standby-Asm ~]$ cd /u01/app/grid

[oracle@NCR-Standby-Asm grid]$ ls -l

total 32

drwxr-x--- 3 grid oinstall 4096 May  5 09:26 admin

drwxr-xr-x 4 grid oinstall 4096 May  5 09:26 cfgtoollogs

drwxr-xr-x 2 grid oinstall 4096 May  5 09:28 checkpoints

drwxr-xr-x 2 grid oinstall 4096 May  5 09:23 Clusterware

drwxrwxr-x 5 grid oinstall 4096 Jun  3 17:21 diag

drwxr-xr-x 3 grid oinstall 4096 May  5 09:23 NCR-Standby-Asm

drwxr-xr-x 3 grid oinstall 4096 Jun  4 13:59 oradiag_grid

drwxr-xr-x 3 grid oinstall 4096 May  5 08:56 product

[oracle@NCR-Standby-Asm grid]$ cd ..

[oracle@NCR-Standby-Asm app]$ ls -l | grep grid

drwxr-xr-x 10 grid oinstall 4096 Jun  4 13:59 grid

drwxrwx---  6 grid oinstall 4096 May 26 09:11 oraInventory

 

 

用户oracle虽然没有直接owner权限,那么更换一个oracle用户可以访问到的目录。

 

 

 

[grid@NCR-Standby-Asm admin]$ cat sqlnet.ora

# sqlnet.ora Network Configuration File: /u01/app/grid/product/11.2.0/grid/network/admin/sqlnet.ora

# Generated by Oracle configuration tools.

 

NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT)

 

ADR_BASE = /u01/app/grid

 

ENCRYPTION_WALLET_LOCATION=(SOURCE=(METHOD=FILE)(METHOD_DATA=(DIRECTORY=/u02/app/oracle)))

 

SQL*Plus: Release 11.2.0.4.0 Production on Mon Jun 8 14:07:53 2015

 

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

 

SQL> conn / as sysdba

Connected.

 

SQL> alter system set encryption key authenticated by "test";

alter system set encryption key authenticated by "test"

*

ERROR at line 1:

ORA-28368: cannot auto-create wallet

 

 

更改为wallet推荐目录。

 

 

[grid@NCR-Standby-Asm admin]$ cat sqlnet.ora

# sqlnet.ora Network Configuration File: /u01/app/grid/product/11.2.0/grid/network/admin/sqlnet.ora

# Generated by Oracle configuration tools.

 

NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT)

 

ADR_BASE = /u01/app/grid

 

ENCRYPTION_WALLET_LOCATION=(SOURCE=(METHOD=FILE)(METHOD_DATA=(DIRECTORY=/u02/app/oracle/admin/sicsstb/WALLET)))

 

 

问题依然不行。之后,笔者尝试更新u01grid目录)和u02oracle目录)权限,连带更新子文件夹和连带文件,更新为755775,故障依然。(注意这个问题!!!)

之后检查MOS,在community社区论坛中一篇讨论:在gridrac结构下,设置wallet要在Oracle的目录中。死马当活马医,笔者实验将sqlnet.ora拷贝到$ORACLE_HOME目录下。

 

 

[grid@NCR-Standby-Asm admin]$ cp sqlnet.ora /u02/app/oracle/product/11.2.0/dbhome_1/network/admin

 

[oracle@NCR-Standby-Asm admin]$ sqlplus /nolog

 

SQL*Plus: Release 11.2.0.4.0 Production on Mon Jun 8 14:29:28 2015

 

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

 

SQL> conn / as sysdba

Connected.

SQL> alter system set encryption key authenticated by "test";

System altered.

 

 

创建成功,目录中wallet文件生成。

 

 

[oracle@NCR-Standby-Asm admin]$ pwd

/u02/app/oracle/product/11.2.0/dbhome_1/network/admin

 

[oracle@NCR-Standby-Asm admin]$ cd /u02/app/oracle/admin/sicsstb/WALLET

[oracle@NCR-Standby-Asm WALLET]$ ls -l

total 4

-rw-r--r-- 1 oracle asmadmin 2845 Jun  8 14:29 ewallet.p12

 

 

自此,大难临头。

 

3ASM不可见

 

之后备份测试中,出现故障。

 

 

RMAN> backup database;

 

Starting backup at 08-JUN-15

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=72 device type=DISK

(篇幅原因,有省略……

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============

RMAN-00571: ===========================================================

RMAN-03009: failure of backup command on ORA_DISK_1 channel at 06/08/2015 14:44:35

ORA-19504: failed to create file "+RECO"

ORA-15055: unable to connect to ASM instance

ORA-12547: TNS:lost contact

 

 

+RECO Diskgroup不可见。检查一下是否ASM启动。

 

 

[grid@NCR-Standby-Asm ~]$ crsctl stat res -t

--------------------------------------------------------------------------------

NAME           TARGET  STATE        SERVER                   STATE_DETAILS      

--------------------------------------------------------------------------------

Local Resources

--------------------------------------------------------------------------------

ora.DATA.dg

               ONLINE  ONLINE       ncr-standby-asm                              

ora.LISTENER.lsnr

               ONLINE  ONLINE       ncr-standby-asm                             

ora.RECO.dg

               ONLINE  ONLINE       ncr-standby-asm                             

ora.asm

               ONLINE  ONLINE       ncr-standby-asm          Started             

ora.ons

               OFFLINE OFFLINE      ncr-standby-asm                             

--------------------------------------------------------------------------------

Cluster Resources

--------------------------------------------------------------------------------

ora.cssd

      1        ONLINE  ONLINE       ncr-standby-asm                             

ora.diskmon

      1        OFFLINE OFFLINE                                                   

ora.evmd

      1        ONLINE  ONLINE       ncr-standby-asm                             

ora.sicsstb.db

      1        ONLINE  INTERMEDIATE ncr-standby-asm          Stuck Archiver 

 

 

ASM启动,从gridsysasm用户登录,发现ASM Diskgroup可见,没有mount问题故障。但是,从oracle用户登录之后,ASM DiskGroup全部不可见,v$disk_group视图返回零行记录。

强制关闭数据库之后,发现不能启动。

 

 

SQL> conn / as sysdba

Connected to an idle instance.

SQL> startup

ORA-01078: failure in processing system parameters

ORA-01565: error in identifying file ‘+DATA/sicsstb/spfilesicsstb.ora‘

ORA-17503: ksfdopn:10 Failed to open file +DATA/sicsstb/spfilesicsstb.ora

ORA-12547: TNS:lost contact

 

 

显然是Oracle Instance无法访问到ASM实例。从之前的情况,可以猜测是文件权限方面的问题。从grid用户启动asmca,可以看到两个DISKGROUP正常,但是从Oracle角度,不能看到Diskgroup

经过查资料,发现问题就在笔者贸然的设置/u01/u02的权限上。大部分文件权限是755或者775是没有问题的。但是对于两个$ORACLE_HOME/bin目录下的oracle文件,权限是有一个s权限的。

 

 

[root@NCR-Standby-Asm disks]# chmod 6751 /u01/app/grid/product/11.2.0/grid/bin/oracle

[root@NCR-Standby-Asm disks]# chmod 6751 /u02/app/oracle/product/11.2.0/dbhome_1/bin/oracle

 

 

权限修改成功。

 

 

[root@NCR-Standby-Asm bin]# ls -l | grep oracle

-rwsr-s--x 1 grid oinstall 210094583 May 25 15:28 oracle

-rwxrwxr-x 1 grid oinstall 210094583 May 25 15:28 oracleO

 

 

之后故障消失。

 

4、结论

 

这个问题,的确是让笔者出了一身冷汗。好在是实验测试环境,没有给生产造成影响。教训是对于一些不确定性的操作,不要贸然动手,搞清楚之后再进行处理。对于一些权限类型操作,不要贸然增加影响范围。原有环境可能有各种问题,但是起码可以运行。我们进行配置优化,出发点起码不能破坏原有的环境。


“小过铸大错”——记录一次实验环境故障过程

原文:http://blog.itpub.net/17203031/viewspace-1696033/

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!