September 22, 2012

cluster on first node failed to startup after reboot due to GPNP hang on third node

3 nodes Rac, in a PSU activity, the cluster on first node failed to startup after patching.

Below is the error in alertnode1.log:

2012-09-22 03:01:39.894
[ohasd(28563)]CRS-2112:The OLR service started on node cintrnpdb001.
2012-09-22 03:01:39.917
[ohasd(28563)]CRS-1301:Oracle High Availability Service started on node cintrnpdb001.
2012-09-22 03:01:39.918
[ohasd(28563)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2012-09-22 03:01:53.765
[gpnpd(28771)]CRS-2328:GPNPD started on node cintrnpdb001.
2012-09-22 03:01:56.228
[cssd(28831)]CRS-1713:CSSD daemon is started in clustered mode
2012-09-22 03:11:55.298
[/prod/product/oracle/11.2.0.2/grid/bin/cssdagent(28819)]CRS-5818:Aborted command 'start for resource: ora.cssd 1 1' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:2} in /prod/product/oracle/11.2.0.2/grid/log/cintrnpdb001/agent/ohasd/oracssdagent_root/oracssdagent_root.log.
2012-09-22 03:11:55.299
[cssd(28831)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /prod/product/oracle/11.2.0.2/grid/log/cintrnpdb001/cssd/ocssd.log
2012-09-22 03:11:55.299
[cssd(28831)]CRS-1603:CSSD on node cintrnpdb001 shutdown by user.


During the whole time cssd.log keep throwing below error:
2012-09-22 03:00:00.546: [    GPNP][2692574496]clsgpnpm_newWiredMsg: [at clsgpnpm.c:741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http://www.grid-pnp.org/2005/12/gpnp-errors#"]
2012-09-22 03:00:02.558: [    GPNP][2692574496]clsgpnpm_newWiredMsg: [at clsgpnpm.c:741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http://www.grid-pnp.org/2005/12/gpnp-errors#"]
2012-09-22 03:00:04.568: [    GPNP][2692574496]clsgpnpm_newWiredMsg: [at clsgpnpm.c:741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http://www.grid-pnp.org/2005/12/gpnp-errors#"]

Above information show that cssd failed to get GPNPD profile.

Let's go to check gpnpd to see what's wrong:
2012-09-22 02:46:51.594: [    GPNP][1088584000]clsgpnp_profileCallUrlInt: [at clsgpnp.c:2104] put-profile call to url "tcp://cintrnpdb003:33612" disco "mdns:service:gpnp._tcp.local.://cintrnpdb003:33612/agent=gpnpd,cname=crs_transp,host=cintrnpdb003,pid=20890/gpnpd h:cintrnpdb003 c:crs_transp" [f=0 claimed- host:cintrnpdb001 cname:crs_transp seq:6 auth:CN=GPnP_peer]
2012-09-22 02:47:05.424: [  OCRMSG][1085512000]GIPC error [29] msg [gipcretConnectionRefused]
2012-09-22 02:47:26.446: [  OCRMSG][1085512000]GIPC error [29] msg [gipcretConnectionRefused]

From above we can see when first node's gpnpd startup, it send a call to third node's gpnpd, but get no response.

So till now we can suspect that the issue is because gpnpd on third node have some issue.

After i kill gpnpd.bin on third node and let it re-start, the cluster on first node succeed startup immediately.
More......

September 12, 2012

CSSD terminated from clssnmvDiskPingMonitorThread without disk timeout countdown

One node's cluster suddenly terminated.
Below is the message in cluster's alert.log:

2012-09-11 11:41:30.328
[ctssd(22428)]CRS-2409:The clock on host node8 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchroniza
tion Service is running in observer mode.
2012-09-11 12:30:03.122
[cssd(21061)]CRS-1606:The number of voting files available, 0, is less than the minimum number of voting files required, 1, resulting in CSSD termination to
ensure data integrity; details at (:CSSNM00018:) in /prod/grid/11.2.0/grid/log/node8/cssd/ocssd.log
2012-09-11 12:30:03.123
[cssd(21061)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /prod/grid/11.2.0/grid/log/node8/cssd/ocssd.log 
2012-09-11 12:30:03.233 [cssd(21061)]CRS-1652:Starting clean up of CRSD resources.

Let‘s check ocssd.log:
2012-09-11 12:30:02.543: [    CSSD][1113450816]clssnmSendingThread: sending status msg to all nodes
2012-09-11 12:30:02.543: [    CSSD][1113450816]clssnmSendingThread: sent 5 status msgs to all nodes
2012-09-11 12:30:03.122: [    CSSD][1082489152](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 1 configured voting disks available, need 1
2012-09-11 12:30:03.123: [    CSSD][1082489152]###################################
2012-09-11 12:30:03.123: [    CSSD][1082489152]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread
2012-09-11 12:30:03.123: [    CSSD][1082489152]###################################
2012-09-11 12:30:03.123: [    CSSD][1082489152](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

So it looks like an IO issue to VOTEDISK.
But after checking we didn't find any abnormal or error message from neither OS level nor storage side.

Then i reviewed the ocssd.log, and with surprise i found there is no long-disk timeout countdown in log.

We know if ocssd failed to reach Votedisk, then it will start to count 200s. Only if after 200s the votedisk still unavailable, then occsd will terminated itself.
The countdown information should be like:
clssscMonitorThreads clssnmvDiskPingThread not scheduled for 16020 msecs

But in our case, there is not such countdown, the occssd just terminated from clssnmvDiskPingMonitorThread all of a sudden.
It should be a bug instead of VOTEDISK IO issue. I will raise an SR with oracle support for further checking.
More......

September 5, 2012

RAC: "CRS-0184: Cannot communicate with the CRS daemon“ due to EVMD

One node of a Rac server get some trouble, its CRSD keep terminating and rebooting. When try to query CRSD will get error “Cannot communicate with the CRS daemon”.

Below error repeate in alertnode1.log:

2012-08-28 19:32:32.995
[ohasd(26038)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode001'.
2012-08-28 19:32:38.319
[crsd(13117)]CRS-1012:The OCR service started on node racnode001.
2012-08-28 19:42:54.134
[crsd(13117)]CRS-0810:Cluster Ready Service aborted due to failure to communicate 
with Event Management Service with error [1]. 
Details at (:CRSD00120:) in /prod/grid/11.2.0.2/grid/log/racnode001/crsd/crsd.log.
2012-08-28 19:42:54.804
[ohasd(26038)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode001'.
2012-08-28 19:42:59.175
[crsd(10695)]CRS-1012:The OCR service started on node cihcispdb001.
2012-08-28 19:53:15.226
[crsd(10695)]CRS-0810:Cluster Ready Service aborted due to failure to communicate 
with Event Management Service with error [1]. 
Details at (:CRSD00120:) in /prod/grid/11.2.0.2/grid/log/racnode001/crsd/crsd.log.
2012-08-28 19:53:15.575
[ohasd(26038)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode001'.

From above log we can see CRSD keep failing and rebooting.

Let‘s check crsd.log to see detail:

2012-08-29 18:07:25.462: [ default][3503137264] CRS Daemon Starting
2012-08-29 18:07:25.464: [ default][1106962752] Policy Engine is not initialized yet!
2012-08-29 18:07:25.465: [ default][3503137264] ENV Logging level for Module: AGENT  1
2012-08-29 18:07:25.465: [ default][3503137264] ENV Logging level for Module: AGFW  0
.............
.............
2012-08-29 18:15:40.648: [ COMMCRS][3503137264]Authorization failed, network error
2012-08-29 18:15:55.584: [ CRSMAIN][1106962752] Policy Engine is not initialized yet!
2012-08-29 18:16:11.408: [ COMMCRS][3503137264]Authorization failed, network error
2012-08-29 18:16:25.586: [ CRSMAIN][1106962752] Policy Engine is not initialized yet!
2012-08-29 18:16:41.668: [ COMMCRS][3503137264]Authorization failed, network error
2012-08-29 18:16:55.597: [ CRSMAIN][1106962752] Policy Engine is not initialized yet!
2012-08-29 18:17:12.929: [ COMMCRS][3503137264]Authorization failed, network error
2012-08-29 18:17:25.609: [ CRSMAIN][1106962752] Policy Engine is not initialized yet!
2012-08-29 18:17:43.190: [ COMMCRS][3503137264]Authorization failed, network error
2012-08-29 18:17:43.191: [ CRSMAIN][3503137264] Created alert : (:CRSD00120:) :  Could not init EVM, error: 1
2012-08-29 18:17:43.191: [    CRSD][3503137264][PANIC] CRSD exiting: Could not init EVMMgr.1
2012-08-29 18:17:43.191: [    CRSD][3503137264] Done.
2012-08-29 18:17:44.287: [ default][3316335088] First attempt: init CSS context succeeded.
2012-08-29 18:17:44.290: [  clsdmt][1078745408]PID for the Process [29342], connkey 1
2012-08-29 18:17:44.291: [  clsdmt][1078745408]Creating PID [29342] file for home /prod/grid/11.2.0.2/grid 
host racnode001 bin crs to /prod/grid/11.2.0.2/grid/crs/init/
2012-08-29 18:17:44.291: [  clsdmt][1078745408]Writing PID [29342] to the file 
[/prod/grid/11.2.0.2/grid/crs/init/racnode001.pid]
2012-08-29 18:17:45.182: [ default][1078745408] Policy Engine is not initialized yet!

We can see When CRSD starting up, it failed at initial EVENT Manager step. Let's check what's wrong with EVENT MANAGER.
Below is from evmd.log:

2012-08-29 16:12:18.636: [ COMMCRS][54275088]clsc_send: waiting for space to write 
on icon (0x9d6fdc0). 0/88 sent
2012-08-29 16:12:26.660: [ COMMCRS][54275088]clsc_send: waiting for space to write 
on icon (0x9d6fdc0). 0/88 sent
2012-08-29 16:12:34.676: [ COMMCRS][54275088]clsc_send: waiting for space to write 
on icon (0x9d6fdc0). 0/88 sent
2012-08-29 16:12:42.711: [ COMMCRS][54275088]clsc_send: waiting for space to write 
on icon (0x9d6fdc0). 0/88 sent

EVMD keep reporting above error message. It means when try to write icon on “/prod/grid/11.2.0.2/grid/auth/evm”, it failed due to lack of space.

But the mount point have enough space, so the possible reason evmd still keep throwing above error could be either:
1. bug
2. in some moment in history, the mount point was lack of space and EVMD was trapped in that state, and even though space released later, EVMD not automatically recovered.

The solution is simple, i reload the EVMD, and CRSD also become fine automaticlly:

2012-08-29 20:42:44.276
[crsd(6075)]CRS-1012:The OCR service started on node racnode001.
2012-08-29 20:52:58.684
[crsd(6075)]CRS-0810:Cluster Ready Service aborted due to failure to communicate with Event Management Service with error [1]. Details at (:CRSD00120:) in /prod/grid/11.2.0.2/grid/log/racnode001/crsd/crsd.log.
2012-08-29 20:52:58.850
[ohasd(26038)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode001'.
2012-08-29 20:53:03.131
[crsd(4699)]CRS-1012:The OCR service started on node racnode001.
2012-08-29 21:03:17.894
[crsd(4699)]CRS-0810:Cluster Ready Service aborted due to failure to communicate with Event Management Service with error [1]. Details at (:CRSD00120:) in /prod/grid/11.2.0.2/grid/log/racnode001/crsd/crsd.log.
2012-08-29 21:03:18.609
[ohasd(26038)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode001'.
2012-08-29 21:03:22.962
[crsd(28965)]CRS-1012:The OCR service started on node racnode001.
2012-08-29 21:11:48.965
[evmd(21764)]CRS-1401:EVMD started on node racnode001. 
2012-08-29 21:11:52.774
[crsd(28965)]CRS-1201:CRSD started on node racnode001.

Then CRSD become normal:
racnode001 [+ASM1] /prod/grid/11.2.0.2/grid/log/racnode001/evmd > crsctl check crs
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

More......

September 2, 2012

Blocker=Waiter: Self Dead Lock caused within one session

We know very well about deadlock. We know deadlock happens when multiple sessions compete some resources.

And in some particular case, dead lock can also be caused within the same session, below is a example:

SQL> create index obj_ind4$ on obj$(name);
create index obj_ind4$ on obj$(name)
                          *
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-00060: deadlock detected while waiting for resource

Currently there is no lock on obj$, but when we try to create a index on it, it failed with dead lock issue.

Below is from the trace file:

*** 2012-07-02 09:13:45.155
*** SERVICE NAME:(SYS$USERS) 2012-07-02 09:13:45.154
*** SESSION ID:(135.46) 2012-07-02 09:13:45.154
DEADLOCK DETECTED
[Transaction Deadlock]
Current SQL statement for this session:
update obj$ set obj#=:6,type#=:7,ctime=:8,mtime=:9,stime=:10,status=:11,dataobj#=:13,flags=:14,oid$=:15,spare1=:16, spare2=:17 where owner#=:1 and name=:2 and namespace
=:3 and(remoteowner=:4 or remoteowner is null and :4 is null)and(linkname=:5 or linkname is null and :5 is null)and(subname=:12 or subname is null and :12 is null)
The following deadlock is not an ORACLE error. It is a
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:
Deadlock graph:
                       ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name          process session holds waits  process session holds waits
TM-00000012-00000000        21     135     S             21     135          SX
session 135: DID 0001-0015-0000001D     session 135: DID 0001-0015-0000001D
Rows waited on:
Session 135: obj - rowid = 00000025 - AAAAAlAABAAAADhAAA
  (dictionary objn - 37, file - 1, block - 225, slot - 0)
Information on the OTHER waiting sessions:
End of information on OTHER waiting sessions.
===================================================

From above we can clearly see the blocker is session 21 as well as waiter, which means the deadlock is caused by the session 21 itself.

From the current SQL the reason is obviously:
to create a object, oracle need to insert a row of the object's detail into obj$ table. The Insert is an DML which require level 3 TM lock on obj$ table.

And since we are creating the index on obj$ table, the session also require exclusive lock on the table which means no DML is allowed on ojb$.
That's where the dead lock comes.



More......