June 11, 2012

CSSD died with no error during GM grock activity

Recently i observed one node crashed and rebooted case: during the node's clusterware starting, its CSSD suddenly died during GM(Group Management) grock activity without any error. And server got a reboot because of that.

CSSD shows no errors:

2012-06-08 09:18:10.495: [ CSSD][1106151744]clssgmTestSetLastGrockUpdate: grock(CLSN.oratab), updateseq(0) msgseq(1), lastupdt<(nil)>, ignoreseq(0)
2012-06-08 09:18:10.495: [ CSSD][1106151744]clssgmAddMember: granted member(0) flags(0x2) node(3) grock (0x2aaab0063ac0/CLSN.oratab)
2012-06-08 09:18:10.495: [ CSSD][1106151744]clssgmCommonAddMember: global lock grock CLSN.oratab member(0/Remote) node(3) flags 0x2 0xb00c1df0
2012-06-08 09:18:10.495: [ CSSD][1106151744]clssgmHandleGrockRcfgUpdate: grock(CLSN.oratab), updateseq(1), status(0), sendresp(1)
2012-06-08 09:18:10.533: [ CSSD][1106151744]clssgmTestSetLastGrockUpdate: grock(CLSN.oratab), updateseq(1) msgseq(2), lastupdt<0x2aaab0054dc0>, ignoreseq(0)
2012-06-08 09:18:10.533: [ CSSD][1106151744]clssgmRemoveMember: grock CLSN.oratab, member number 0 (0x2aaab00c1df0) node number 3 state 0x0 grock type 3
2012-06-08 09:18:10.533: [ CSSD][1106151744]clssgmResetGrock: grock(CLSN.oratab) reset=1
2012-06-08 09:18:10.533: [ CSSD][1106151744]clssgmHandleGrockRcfgUpdate: grock(CLSN.oratab), updateseq(2), status(0), sendresp(1)
2012-06-08 09:18:10.535: [ CSSD][1106151744]clssgmTestSetLastGrockUpdate: grock(CLSN.oratab), updateseq(2) msgseq(3), lastupdt<0x2aaab0097f50>, ignoreseq(0)
2012-06-08 09:18:10.535: [ CSSD][1106151744]clssgmDeleteGrock: (0x2aaab0063ac0) grock(CLSN.oratab) deleted
<<----here CSSD died, it was doing some GM grock maintance and suddenly died and then server reboot, here we see no abnormal information---->>
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = CSSD, LogLevel = 2, TraceLevel = 0
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = GIPCNM, LogLevel = 2, TraceLevel = 0
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = GIPCGM, LogLevel = 2, TraceLevel = 0
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = GIPCCM, LogLevel = 2, TraceLevel = 0
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = CLSF, LogLevel = 0, TraceLevel = 0
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = SKGFD, LogLevel = 0, TraceLevel = 0
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = GPNP, LogLevel = 1, TraceLevel = 0
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = OLR, LogLevel = 0, TraceLevel = 0
[ CSSD][684039456]clsugetconf : Configuration type [4].
2012-06-08 09:28:30.052: [ CSSD][684039456]clssscmain: Starting CSS daemon, version 11.2.0.2.0, in (clustered) mode with uniqueness value 1339162110


Very soon CSSDMONITOR found the CSSD had died so it then sync (reboot) the node. Below is from CSSDMONITOR log:
2012-06-08 09:18:12.479: [ CSSCLNT][1116543296]clsssRecvMsg: got a disconnect from the server while waiting for message type 27
2012-06-08 09:18:12.479: [ CSSCLNT][1113389376]clsssRecvMsg: got a disconnect from the server while waiting for message type 22
2012-06-08 09:18:12.479: [ USRTHRD][1113389376] clsnwork_queue: posting worker thread
2012-06-08 09:18:12.479: [ USRTHRD][1113389376] clsnpollmsg_main: exiting check loop
2012-06-08 09:18:12.479: [GIPCXCPT][1116543296]gipcInternalSend: connection not valid for send operation endp 0x86da430 [0000000000000162] { gipcEndpoint : l
ocalAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=031d7676-63f8ad6d-9852))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_cihcissdb759_)(GIPCI
D=63f8ad6d-031d7676-9884))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 9884, flags 0x3861e, usrFlags 0x20010 }, ret g
ipcretConnectionLost (12)
2012-06-08 09:18:12.479: [GIPCXCPT][1116543296]gipcSendSyncF [clsssServerRPC : clsss.c : 6271]: EXCEPTION[ ret gipcretConnectionLost (12) ] failed to send o
n endp 0x86da430 [0000000000000162] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=031d7676-63f8ad6d-9852))', remoteAddr 'clsc://(AD
DRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_cihcissdb759_)(GIPCID=63f8ad6d-031d7676-9884))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0,
pidPeer 9884, flags 0x3861e, usrFlags 0x20010 }, addr 0000000000000000, buf 0x428d0d80, len 80, flags 0x8000000
2012-06-08 09:18:12.479: [ CSSCLNT][1116543296]clsssServerRPC: send failed with err 12, msg type 7
2012-06-08 09:18:12.479: [ CSSCLNT][1116543296]clsssCommonClientExit: RPC failure, rc 3
2012-06-08 09:18:12.479: [ CSSCLNT][1094465856]clsssRecvMsg: got a disconnect from the server while waiting for message type 1
2012-06-08 09:18:12.479: [ CSSCLNT][1094465856]clssgsGroupGetStatus: communications failed (0/3/-1)
2012-06-08 09:18:12.479: [ CSSCLNT][1094465856]clssgsGroupGetStatus: returning 8
2012-06-08 09:18:12.479: [ USRTHRD][1114966336] clsnwork_process_work: calling sync

2012-06-08 09:18:12.479: [ USRTHRD][1094465856] clsnomon_status: Communications failure with CSS detected. Waiting for sync to complete...

Checked OS log and no errors gernerate during that period of time:

Jun 8 09:18:02 node759 snmpd[21008]: Connection from UDP: [127.0.0.1 ]:27643
Jun 8 09:18:02 node759 snmpd[21008]: Received SNMP packet(s) from UDP: [127.0.0.1]:27643
Jun 8 09:18:02 node759 snmpd[21008]: Connection from UDP: [127.0.0.1]:27643
Jun 8 09:18:02 node759 snmpd[21008]: Connection from UDP: [127.0.0.1]:26972
Jun 8 09:18:02 node759 snmpd[21008]: Received SNMP packet(s) from UDP: [127.0.0.1]:26972
Jun 8 09:18:02 node759 snmpd[21008]: Connection from UDP: [127.0.0.1]:17368
Jun 8 09:18:02 node759 snmpd[21008]: Received SNMP packet(s) from UDP: [127.0.0.1]:17368
<<<----here suddenly rebooted with no information----->>>
Jun 8 09:23:03 node759 syslogd 1.4.1: restart.
Jun 8 09:23:03 node759 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Jun 8 09:23:03 node759 kernel: g enabled (TM1)
Jun 8 09:23:03 node759 kernel: Intel(R) Xeon(R) CPU X7550 @ 2.00GHz stepping 06
Jun 8 09:23:03 node759 kernel: CPU 3: Syncing TSC to CPU 0.
Jun 8 09:23:03 node759 kernel: CPU 3: synchronized TSC with CPU 0 (last diff -6 cycles, maxerr 477 cycles)
Jun 8 09:23:03 node759 kernel: SMP alternatives: switching to SMP code
Jun 8 09:23:03 node759 kernel: Booting processor 4/32 APIC 0x10
Jun 8 09:23:03 node759 kernel: Initializing CPU#4


Searched on network, only find one bug report describle an similiar issue:
Bug 13954099: CSSD DIES SUDDENLY WITHOUT ANY ERRORS

Just like our case, In the bug report, cluster version is also 11.2.0.2, and the CSSD died during a GM grock activity without any errors.
Then CSSD monitor rebooted the node after found that CSSD died.

This problem is not repeatable, the next time we re-bring up the cluster on the node, it succeed. The issue not happen again.

No Patch avialable as yet.

0 Comments:

Post a Comment