October 29, 2012

LMON (ospid: 11863): terminating the instance due to error 481

Two nodes RAC, suddenly begin reconfiguration and very soon the first instance terminated with error:
LMON (ospid: 11863): terminating the instance due to error 481

So we have two problems:
1. Why the configration suddenly began?
2. Why the first instance terminated during the reconfiguration?

1:Why the configration suddenly began?
From the first node's lmon trace file:

* DRM RCFG called (swin 0) 

*** 2012-10-22 01:23:17.277
  CGS recovery timeout = 85 sec
Begin DRM(27911) (swin 0)
* drm quiesce

*** 2012-10-22 01:25:28.038
* Request pseudo reconfig due to drm quiesce hang 

*** 2012-10-22 01:25:28.051
kjxgmrcfg: Reconfiguration started, type 6
CGS/IMR TIMEOUTS:
  CSS recovery timeout = 31 sec (Total CSS waittime = 65)
  IMR Reconfig timeout = 75 sec
  CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 94 0.
...............

From above we can see it is due to DRM(DYNAMIC BLOCK RE-MASTER) query timeout, hence the reconfigration began.

Also we can see the type is 6.
We often see type 1,2,3 as reconfigration reason. But type 6 is not a normal reason.
After check with oracle support, they also confirmed that type 6 means DRM.

OK. So now the reconfigration is clear: DRM query timeout.

Next let's see why the first instance terminated during the re-configration step.
Since this reconfigration is not triggered due to interconnect failure nor controlfile heartbeat issue, so instance eviction is not absolutly required.

After checking from below LMON trace we can see the reason why LMON terminated the instance:
*** 2012-10-22 01:25:29.650
 All grantable enqueues granted
2012-10-22 01:25:29.650271 : * Begin lmon rcfg step KJGA_RCFG_PCMREPLAY 

*** 2012-10-22 01:30:56.000
2012-10-22 01:30:56.000338 : * kjfclmsync: waited 327 secs for lmses to finish parallel rcfg work, terminating instance 
kjzduptcctx: Notifying DIAG for crash event
...............

From red part it is clear when during the reconfigration, the LMS failed to finish its work(maybe either hang, blocked by some resouce, or some bug) and timeout, hence LMON terminated the instace.

Next we should go ahead to check LMS process trace file. Since we set max_dump_file_size for the DB and lms trace file already reached its max size and not updating for a long time, there is no information in LMS for us to refer.

The issue can be avoided by set below parameter to disable DRM:
_gc_read_mostly_locking=FALSE

More......

October 11, 2012

A few troubles when doing PSU patching on AIX RAC

when help a friend patching an AIX Rac server, i got a few troubles, record them here.

1. opatch failed and reporting:

Required amount of space(23428.225MB) is not available.
UtilSession failed: 
Prerequisite check "CheckSystemSpace" failed.
Log file location: /oracle/app/11.2.0/grid/cfgtoollogs/opatch/opatch2012-10-07_18-26-07PM_1.log
OPatch failed with error code 73
The mount point have 20gb free space but still we get the error.
For this bug we can use below parameter to skip space check when using opatch:
OPatch.SKIP_VERIFY_SPACE=true

2. opatch failed and report:
Patching component oracle.rdbms, 11.2.0.3.0...

Copy failed from '/oracle/11.2.0.3.3PSU/13919095/custom rver/13919095/files b bclsra11.so' to '/oracle/app/11.2.0/grid b bclsra11.so'...  
Please verify all applications associated with the Oracle Home '/oracle/app/11.2.0/grid' are shut down. If this is AIX, please perform solution documented in Note 739963.1 on https://myoraclesupport.oracle.com.

Do you want to retry copying the file? [y|n]

opatch reports a file is in use while currently from fuser command we can see there is no process holding the file.
It is a known issue in AIX, and the fix is to run below command as root:


4. opatch failed and report:
.storage is locked by another opatch session.
This is because another opatch activity just terminated due to the third issue and left its lock file t there.
The solution is remove *lock* file under $ORACLE_HOME/.storage
/usr/sbin/slibclean

5. Finally patch succeed, but when starting grid service on second node, failed with below error:
2012-10-07 13:23:18.069: [    AGFW][5912] {0:0:31} Created alert :  (:CRSAGF00123:) :  Failed to start the agent process:  /oracle/grid/11.2.0.3/bin/oraagent Category: -2 Operation: setpcred Loc:  spawnproc26 OS error: 2 Other : child couldn't setpcred

On metalink there is only one bug report for same error messsage and same case as us----CRS failed to startup after patching.

No workaround.



More......