Two nodes RAC, suddenly begin reconfiguration and very soon the first instance terminated with error:
LMON (ospid: 11863): terminating the instance due to error 481
So we have two problems:
1. Why the configration suddenly began?
2. Why the first instance terminated during the reconfiguration?
1:Why the configration suddenly began?
From the first node's lmon trace file:
* DRM RCFG called (swin 0)
*** 2012-10-22 01:23:17.277
CGS recovery timeout = 85 sec
Begin DRM(27911) (swin 0)
* drm quiesce
*** 2012-10-22 01:25:28.038
* Request pseudo reconfig due to drm quiesce hang
*** 2012-10-22 01:25:28.051
kjxgmrcfg: Reconfiguration started, type 6
CGS/IMR TIMEOUTS:
CSS recovery timeout = 31 sec (Total CSS waittime = 65)
IMR Reconfig timeout = 75 sec
CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 94 0.
...............
From above we can see it is due to DRM(DYNAMIC BLOCK RE-MASTER) query timeout, hence the reconfigration began.
Also we can see the type is 6.
We often see type 1,2,3 as reconfigration reason. But type 6 is not a normal reason.
After check with oracle support, they also confirmed that type 6 means DRM.
OK. So now the reconfigration is clear: DRM query timeout.
Next let's see why the first instance terminated during the re-configration step.
Since this reconfigration is not triggered due to interconnect failure nor controlfile heartbeat issue, so instance eviction is not absolutly required.
After checking from below LMON trace we can see the reason why LMON terminated the instance:
*** 2012-10-22 01:25:29.650
All grantable enqueues granted
2012-10-22 01:25:29.650271 : * Begin lmon rcfg step KJGA_RCFG_PCMREPLAY
*** 2012-10-22 01:30:56.000
2012-10-22 01:30:56.000338 : * kjfclmsync: waited 327 secs for lmses to finish parallel rcfg work, terminating instance
kjzduptcctx: Notifying DIAG for crash event
...............
From red part it is clear when during the reconfigration, the LMS failed to finish its work(maybe either hang, blocked by some resouce, or some bug) and timeout, hence LMON terminated the instace.
Next we should go ahead to check LMS process trace file. Since we set max_dump_file_size for the DB and lms trace file already reached its max size and not updating for a long time, there is no information in LMS for us to refer.
The issue can be avoided by set below parameter to disable DRM:
_gc_read_mostly_locking=FALSE
More......