April 30, 2012

Trouble Shooting: CRS-4534: Cannot communicate with Event Manager

There is a three nodes Rac cluster.
All DBs are up and running fine on all three nodes.
oracle $ ps -ef|grep smon|wc -l
43
oracle $ ps -ef|grep tns|wc -l
50
oracle $ ps -ef|grep d.bin
root      8177     1  0 Apr28 ?        00:18:29 /stage/grid/11.2.0/grid/bin/orarootagent.bin
oracle    8550     1 22 Apr28 ?        08:52:59 /stage/grid/11.2.0/grid/bin/oraagent.bin
root     26480     1  0 Apr28 ?        00:03:26 /stage/grid/11.2.0/grid/bin/ohasd.bin reboot
oracle   29210     1  0 Apr28 ?        00:00:00 /stage/grid/11.2.0/grid/bin/mdnsd.bin
oracle   29232     1  0 Apr28 ?        00:01:09 /stage/grid/11.2.0/grid/bin/gpnpd.bin
oracle   29257     1  0 Apr28 ?        00:05:32 /stage/grid/11.2.0/grid/bin/gipcd.bin
root     29307     1  0 Apr28 ?        00:00:13 /stage/grid/11.2.0/grid/bin/cssdmonitor
root     29334     1  0 Apr28 ?        00:00:10 /stage/grid/11.2.0/grid/bin/cssdagent
oracle   29351     1  8 Apr28 ?        03:25:51 /stage/grid/11.2.0/grid/bin/ocssd.bin
root     29540     1  0 Apr28 ?        00:20:57 /stage/grid/11.2.0/grid/bin/orarootagent.bin
oracle   29556     1  0 Apr28 ?        00:00:18 /stage/grid/11.2.0/grid/bin/diskmon.bin -d -f
root     30160     1  0 Apr28 ?        00:00:33 /stage/grid/11.2.0/grid/bin/octssd.bin reboot
oracle   30350     1  0 Apr28 ?        00:00:04 /stage/grid/11.2.0/grid/bin/evmd.bin
root     31947     1  0 Apr28 ?        00:03:50 /stage/grid/11.2.0/grid/bin/crsd.bin reboot

From above output red part we aslo can see evmd.bin process is up.
But when use crsctl command, it shows that Event Manger not work on node 2 and node 3.
oracle $ crsctl check cluster -all
**************************************************************
node1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
node2:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
**************************************************************
node3:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
**************************************************************

Checked cluster alert.log, it also indicate EVMD started normal:
2012-04-28 18:29:48.154
[evmd(30579)]CRS-1401:EVMD started on node node2.

Below is from evmd.log:
2012-04-28 18:29:17.518: [ EVMD][1928129552] EVMD Starting
2012-04-28 18:29:17.518: [ EVMD][1928129552] Initializing OCR
2012-04-28 18:29:17.518: [ EVMD][1928129552] EVMD running in cluster mode
2012-04-28 18:29:18.524: [ EVMD][1928129552] Waiting for OCR. Seconds elapsed: 0
2012-04-28 18:29:48.136: [ EVMD][1928129552] Get OCR context succeeded
2012-04-28 18:29:48.153: [ EVMD][1928129552] Initializing Diagnostics Settings
2012-04-28 18:29:48.224: [ EVMD][1928129552] Authorization database built successfully.
2012-04-28 18:29:49.105: [ EVMAPP][1928129552] EVMD Started
2012-04-28 18:29:49.123: [ EVMD][1928129552] Authorization database built successfully.

All seems fine. Then why "CRS-4534: Cannot communicate with Event Manager"?

Let's check the current status for evmd:
oracle $ crsctl stat res -t -init
-------------------------------------------------------
NAME             TARGET          STATE       SERVER   
-------------------------------------------------------
ora.evmd      1  ONLINE   INTERMEDIATE node3

The current status INTERMEDIATE means the evmd did starting but not totally completed.

Below recorded in oracle_agent.logfile:
2012-04-28 18:45:50.936: [    AGFW][1113385280] {0:0:2} ora.evmd 1 1 state changed from: UNKNOWN to: OFFLINE
.................

2012-04-28 18:46:23.441: [    AGFW][1114609984] {0:0:2} ora.evmd 1 1 state changed from: UNKNOWN to: STARTING
.................

2012-04-28 20:41:27.768: [    AGFW][1116698944] {0:0:2} ora.evmd 1 1 state changed from: STARTING to: PARTIAL
.................

We  can see the evmd was starting in progress, status changed from UNKNOWN --> OFFLINE --> STARTING --> PARTIAL.
And next the status should change from PARTIAL to ONLINE.
But from the logfile, we don't see the final step, it never happen.

It failed at the final step due to below error message:
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] clsn_agent::abort {
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] abort {
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] abort command: check
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] tryActionLock {
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] did not get lock
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] tryActionLock }
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] clsn_agent::abort: Exception LockAcquireTimeoutException
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] clsn_agent::abort, agent exiting }

This is very similar to bug 11807012: oraagent.bin Exits After Check Timed Out

But this bug should have been fixed in 11.2.0.2.3,  and our server already patched 11gR2 Patch Set Update 4.

It looks like that PSU don't real fully solved that bug.


Raised SR and will check with Oracle Support further.

1 Comments:

Post a Comment