July 29, 2012

Interconnect failed due to switch issue: ping got duplicate response

Two nodes cluster, the second node's clusterware suddenly terminated, and failed to re-join the cluster. Below is error message in alert_node2.log:

2012-07-27 03:15:35.090
[cssd(23234)]CRS-1612:Network communication with node server001 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 14.120 seconds
2012-07-27 03:15:42.105
[cssd(23234)]CRS-1611:Network communication with node server001 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 7.110 seconds
2012-07-27 03:15:47.115
[cssd(23234)]CRS-1610:Network communication with node server001 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.100 seconds
2012-07-27 03:15:49.213
[cssd(23234)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /d001/product/oracle/11.2.0.2/grid/log/server002/cssd/ocssd.log.
2012-07-27 03:15:49.213
[cssd(23234)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /d001/product/oracle/11.2.0.2/grid/log/server002/cssd/ocssd.log
2012-07-27 03:15:49.269
[cssd(23234)]CRS-1603:CSSD on node server002 shutdown by user.
2012-07-27 03:15:49.371
[cssd(23234)]CRS-1660:The CSS daemon shutdown has completed
2012-07-27 03:15:49.394
[ohasd(2277)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'server002'.

The red part is missleading. We encounter mutiple times that in log it shows "CSSD on node server002 shutdown by user" while no one touch it, the CSSD only terminated by itself's fatal.

Below error is from node2's CSSD:

2012-07-27 03:09:38.870: [GIPCHGEN][1105815872] gipchaNodeCreate: adding new node 0x19b8c2c0 { host 'server001', haName 'CSS_crs_transd', srcLuid 0bc2420b
-7a5ec50d, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 5573754, sentRegister 0, localMonitor 0, f
lags 0x0 }
....
....
2012-07-27 03:09:38.884: [GIPCHALO][1104238912] gipchaLowerDropMsg: dropping because of node failure msg 0x2aaaacd91708 { len 1160, seq 1, type gipchaHdrType
RecvEstablish (5), lastSeq 0, lastAck 0, minAck 0, flags 0x0, srcLuid 64ff8432-f6871f2e, dstLuid 00000000-00000000, msgId 1 }, node 0x19c3cb80 { host 'cintrn
ddb001', haName 'CSS_crs_transd', srcLuid 0bc2420b-a6f56356, dstLuid 64ff8432-f6871f2e numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], cre
ateTime 5573764, sentRegister 1, localMonitor 1, flags 0x8020 }


Looks like something wrong with interconnect, let's check the private IP:
oracle $ ping server001-priv
PING server001-priv (10.0.0.56) 56(84) bytes of data.
64 bytes from server001-priv (10.0.0.56): icmp_seq=1 ttl=64 time=61.3 ms
64 bytes from server001-priv (10.0.0.56): icmp_seq=1 ttl=64 time=63.3 ms (DUP)
64 bytes from server001-priv (10.0.0.56): icmp_seq=1 ttl=64 time=66.8 ms (DUP)
64 bytes from server001-priv (10.0.0.56): icmp_seq=1 ttl=64 time=70.0 ms (DUP)

Pay attention to the red part "(DUP)", it means for one ping the server got two responses. This is definetly incorrect. Network team checked on this and confirmed that there is something wrong from switch level. After network team correct that, our Rac back to normal.

0 Comments:

Post a Comment