September 12, 2012

CSSD terminated from clssnmvDiskPingMonitorThread without disk timeout countdown

One node's cluster suddenly terminated.
Below is the message in cluster's alert.log:

2012-09-11 11:41:30.328
[ctssd(22428)]CRS-2409:The clock on host node8 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchroniza
tion Service is running in observer mode.
2012-09-11 12:30:03.122
[cssd(21061)]CRS-1606:The number of voting files available, 0, is less than the minimum number of voting files required, 1, resulting in CSSD termination to
ensure data integrity; details at (:CSSNM00018:) in /prod/grid/11.2.0/grid/log/node8/cssd/ocssd.log
2012-09-11 12:30:03.123
[cssd(21061)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /prod/grid/11.2.0/grid/log/node8/cssd/ocssd.log 
2012-09-11 12:30:03.233 [cssd(21061)]CRS-1652:Starting clean up of CRSD resources.

Let‘s check ocssd.log:
2012-09-11 12:30:02.543: [    CSSD][1113450816]clssnmSendingThread: sending status msg to all nodes
2012-09-11 12:30:02.543: [    CSSD][1113450816]clssnmSendingThread: sent 5 status msgs to all nodes
2012-09-11 12:30:03.122: [    CSSD][1082489152](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 1 configured voting disks available, need 1
2012-09-11 12:30:03.123: [    CSSD][1082489152]###################################
2012-09-11 12:30:03.123: [    CSSD][1082489152]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread
2012-09-11 12:30:03.123: [    CSSD][1082489152]###################################
2012-09-11 12:30:03.123: [    CSSD][1082489152](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

So it looks like an IO issue to VOTEDISK.
But after checking we didn't find any abnormal or error message from neither OS level nor storage side.

Then i reviewed the ocssd.log, and with surprise i found there is no long-disk timeout countdown in log.

We know if ocssd failed to reach Votedisk, then it will start to count 200s. Only if after 200s the votedisk still unavailable, then occsd will terminated itself.
The countdown information should be like:
clssscMonitorThreads clssnmvDiskPingThread not scheduled for 16020 msecs

But in our case, there is not such countdown, the occssd just terminated from clssnmvDiskPingMonitorThread all of a sudden.
It should be a bug instead of VOTEDISK IO issue. I will raise an SR with oracle support for further checking.

3 Comments:

Kane Zhang said...

Oracle Support confirmed it is an unpublished bug: 13869978

Anonymous said...

We faced the similar issue also pointing it to a bug with a workaround posted in metalink. One of the RAC instances re-started and within couple of minutes db instance was up and running. Do you think it will restart any cluster resources beside DB?

Kane Zhang said...

Yes.
In the bug describled in this post, the CSSD will terminated, which means the whole cluster on this node will terminated and re-start automaticlly.
Rac level is relied on Cluster leyar, so all Rac resource(DB, LISTENER, VIP etc..) will also terminated and re-startup.

Post a Comment