June 4, 2012

cluster node eviction caused by bond switch

A three nodes production cluster's third node suddendly got a reboot.
Let's check.

Below is in the reoobed node's alert$hostname.log:

cssd(16352)]CRS-1610:Network communication with node node803 (3) missing for 90% of timeout interval. Removal of this node from cluster in 2.190 seconds
2012-06-02 00:45:17.904
[cssd(16352)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008: in /prod/product/oracle/11.2.0.2/grid/log/cihcispdb804/cssd/ocssd.log.


From the red part we can see seems some issue with network issue.

Let's go through the ocssd.log for more details:

2012-06-02 00:45:01.675: [ CSSD][1111071040]clssnmSendingThread: sending status msg to all nodes
2012-06-02 00:45:01.675: [ CSSD][1111071040]clssnmSendingThread: sent 4 status msgs to all nodes
2012-06-02 00:45:02.677: [ CSSD][1109494080]clssnmPollingThread: node node802 (1) at 50% heartbeat fatal, removal in 14.900 seconds
2012-06-02 00:45:02.677: [ CSSD][1109494080]clssnmPollingThread: node cihcispdb802 (1) is impending reconfig, flag 197646, misstime 15100
2012-06-02 00:45:02.678: [ CSSD][1109494080]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2012-06-02 00:45:02.678: [ CSSD][1102502208]clssnmvDHBValidateNCopy: node 1, cihcispdb802, has a disk HB, but no network HB, DHB has rcfg 211384656, wrtcn
t, 22587619, LATS 1736660004, lastSeqNo 22587575, uniqueness 1332725246, timestamp 1338612301/2887464976
..........
..........
2012-06-02 00:45:10.696: [ CSSD][1109494080]clssnmPollingThread: node cihcispdb802 (1) at 75% heartbeat fatal, removal in 6.880 seconds
2012-06-02 00:45:10.696: [ CSSD][1109494080]clssnmPollingThread: node cihcispdb803 (3) at 75% heartbeat fatal, removal in 7.200 seconds
..........
..........
2012-06-02 00:45:17.580: [ CSSD][1109494080]clssnmPollingThread: Removal started for node cihcispdb802 (1), flags 0x3040e, state 3, wt4c 0

2012-06-02 00:45:17.580: [ CSSD][1109494080]clssnmMarkNodeForRemoval: node 1, cihcispdb802 marked for removal
..........
..........

2012-06-02 00:45:17.581: [    CSSD][3390307616]NMEVENT_SUSPEND [00][00][00][0e]
..........
..........
2012-06-02 00:45:17.589: [ CSSD][1097791808]clssnmvDiskKillCheck: not evicted, file ORCL:ASM_DISK_01 flags 0x00000000, kill block unique 0, my unique 1336874747
..........
..........
2012-06-02 00:45:17.902: [ CSSD][1109494080]clssnmPollingThread: Removal started for node node803 (3), flags 0x6040e, state 3, wt4c 0
..........
..........
2012-06-02 00:45:17.904: [ CSSD][1112648000]clssnmCheckDskInfo: Surviving cohort: 1,3

2012-06-02 00:45:17.904: [ CSSD][1112648000](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, cihcispdb804, is smaller than cohort of 2 nodes led by node 1, node802, based on map type 2
2012-06-02 00:45:17.904: [GIPCGMOD][1094637888] gipcmodGipcCallback: EXCEPTION[ ret gipcretInvalidObject (3) ] failed during request completion for req (nil), endp 0x1ebda200
2012-06-02 00:45:17.904: [ CSSD][1112648000]###################################
2012-06-02 00:45:17.904: [ CSSD][1112648000]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread

2012-06-02 00:45:17.904: [ CSSD][1112648000]###################################


From above message we know that after network conmunication from node 3 to other two nodes lost, the node3 start to try to remove other two nodes from cluster by writting shooting head information into voting disk.

And then the node3 read shooting information from node1 via voting disk, then the node3 make a juedgement that node1 and node2 is a two node cluster which is a bigger then node3, so node3 choose to kill itself instead of killing other two nodes:

2012-06-02 00:45:17.904: [ CSSD][1112648000]clssnmCheckDskInfo: Surviving cohort: 1,3
2012-06-02 00:45:17.904: [ CSSD][1112648000](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, node
804, is smaller than cohort of 2 nodes led by node 1, node802, based on map type 2


Now we are sure the issue is caused by network.
Then we can go to check the OS log:

Jun 2 00:45:35 node804 kernel: nx_nic[eth0]: Device is DOWN. Fail count[8] Jun 2 00:45:35 node804 kernel: nx_nic[eth0]: Firmware hang detected. Severity code=0 Peg number=0 Error code=0 Return address=0
Jun 2 00:45:35 node804 kernel: nx_nic: Flash Version: Firmware[4.0.534], BIOS[2.1.0] Jun 2 00:45:35 cihcispdb804 kernel: nx_nic: No memory on card. Load Cut through.
Jun 2 00:45:37 node804 kernel: bonding: bond1: link status definitely down for interface eth1, disabling it

Jun 2 00:45:37 node804 kernel: bonding: bond1: making interface eth5 the new active one.
Jun 2 00:45:37 node804 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it

Jun 2 00:45:37 node804 kernel: bonding: bond0: making interface eth4 the new active one.
Jun 2 00:45:39 node804 kernel: nx_nic: Loading the firmware from flash.
Jun 2 00:45:42 node804 kernel: nx_nic: Gen2 strapping detected Jun 2 00:45:45 cihcispdb804 kernel: nx_nic HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter Board S/N Chip id 0x1

Jun 2 00:45:45 node804 kernel: nx_nic HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter Board S/N Chip id 0x1
Jun 2 00:45:45 node804 kernel: nx_nic HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter Board S/N Chip id 0x1
Jun 2 00:45:46 node804 kernel: nx_nic[eth1]: NIC Link is up
Jun 2 00:45:46 node804 kernel: bonding: bond1: link status definitely up for interface eth1.
Jun 2 00:45:46 node804 kernel: bonding: bond1: making interface eth1 the new active one.
Jun 2 00:45:48 node804 kernel: nx_nic[eth0]: NIC Link is up Jun 2 00:45:48 cihcispdb804 kernel: bonding: bond0: link status definitely up for interface eth0.
Jun 2 00:45:48 node804 kernel: bonding: bond0: making interface eth0 the new active one.
Jun 2 00:45:50 node804 kernel: nx_nic[eth0]: Device is UP.
Jun 2 00:45:54 node804 kernel: nx_nic[eth2]: NIC Link is up


Why bond swtich can make Oracle inter-connect failed?

Don't ask me, i don't know neither!

In our envirorment i had encounter dozens times that cluster failed when bond swtich.
And once bond switched back, the cluster recover again.
(during that time, ping ,ssh, still succeed, even checked mutilcast found no problem.)

And from internet you also can find some articles regarding same problem.

Maybe that is one of the reason that in 11g Oracle suggest you to use HAIP instead of bond for private interface.

0 Comments:

Post a Comment