June 2, 2012

what will happen if we hang CSSD process in Rac

in 11gR2, we have many new background processes including two cssd related processes cssdmonitor and cssdagent.  cssdagent process is charge of respawn cssd.bin.

And both of those processes are charge of monitor cssd state and system hang problem.

[root@node1 node1]# ps -ef|grep cssd
root 479  1 0 00:30 ? 00:00:01 /prod/grid/app/11.2.0/grid/bin/cssdmonitor
root 471  1 0 00:30 ? 00:00:01 /prod/grid/app/11.2.0/grid/bin/cssdagent
grid 568  1 0 00:30 ? 00:00:04 /prod/grid/app/11.2.0/grid/bin/ocssd.bin


Let's see what will happen if we hang the ocssd.bin process:
[root@node1 ~]# kill -SIGSTOP 568
[root@node1 ~]# kill -SIGSTOP 568
[root@node1 ~]# kill -SIGSTOP 568


After a few seconds the node got a reboot.   Let's go to check the log. We can find below information in alert$hostname.log:
2012-05-27 09:39:01.558
[ohasd(4545)]CRS-8011:reboot advisory message from host: node1, component: mo224552, with time stamp: L-2012-02-05-22:51:26.126
[ohasd(4545)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2012-05-27 09:39:01.634
[ohasd(4545)]CRS-8011:reboot advisory message from host: node1, component: ag014510, with time stamp: L-2012-05-26-05:35:13.032
[ohasd(4545)]CRS-8013:reboot advisory message text: Rebooting after limit 28500 exceeded; disk timeout 28180, network timeout 28500, last heartbeat from CSSD at epoch seconds 1337981671.618, 28541 milliseconds ago based on invariant clock value of 57493073


Below is the iformation in cssdagent and cssdmonitor's log:
2012-05-26 05:34:53.160: [ USRTHRD][2915036048] clsnproc_needreboot: Impending reboot at 50% of limit 28500; disk timeout 28180, network timeout 28500, last heartbeat from CSSD at epoch seconds 1337981671.618, 14871 milliseconds ago based on invariant clock 57493073; now polling at 100 ms
2012-05-26 05:35:02.748: [ USRTHRD][2915036048] clsnproc_needreboot: Impending reboot at 75% of limit 28500; disk timeout 28180, network timeout 28500, last heartbeat from CSSD at epoch seconds 1337981671.618, 21401 milliseconds ago based on invariant clock 57493073; now polling at 100 ms
2012-05-26 05:35:08.907: [ USRTHRD][2915036048] clsnproc_needreboot: Impending reboot at 90% of limit 28500; disk timeout 28180, network timeout 28500, last heartbeat from CSSD at epoch seconds 1337981671.618, 25681 milliseconds ago based on invariant clock 57493073; now polling at 100 ms

Since cssd was hang, so cssd itself's log won't have any information:
2012-05-26 05:34:25.052: [ CSSD][2894060432]clssnmSendingThread: sending status msg to all nodes
2012-05-26 05:34:25.052: [ CSSD][2894060432]clssnmSendingThread: sent 4 status msgs to all nodes
2012-05-26 05:34:30.525: [ CSSD][2894060432]clssnmSendingThread: sending status msg to all nodes
2012-05-26 05:34:30.525: [ CSSD][2894060432]clssnmSendingThread: sent 4 status msgs to all nodes
----HERE REBOOTED----

2012-05-27 09:39:24.030: [ CSSD][3046872768]clssscmain: Starting CSS daemon, version 11.2.0.1.0, in (clustered) mode with uniqueness value 1306460363
2012-05-27 09:39:24.032: [ CSSD][3046872768]clssscmain: Environment is production


0 Comments:

Post a Comment