There is a three nodes Rac cluster.
All DBs are up and running fine on all three nodes.
oracle $ ps -ef|grep smon|wc -l
43
oracle $ ps -ef|grep tns|wc -l
50
oracle $ ps -ef|grep d.bin
root 8177 1 0 Apr28 ? 00:18:29 /stage/grid/11.2.0/grid/bin/orarootagent.bin
oracle 8550 1 22 Apr28 ? 08:52:59 /stage/grid/11.2.0/grid/bin/oraagent.bin
root 26480 1 0 Apr28 ? 00:03:26 /stage/grid/11.2.0/grid/bin/ohasd.bin reboot
oracle 29210 1 0 Apr28 ? 00:00:00 /stage/grid/11.2.0/grid/bin/mdnsd.bin
oracle 29232 1 0 Apr28 ? 00:01:09 /stage/grid/11.2.0/grid/bin/gpnpd.bin
oracle 29257 1 0 Apr28 ? 00:05:32 /stage/grid/11.2.0/grid/bin/gipcd.bin
root 29307 1 0 Apr28 ? 00:00:13 /stage/grid/11.2.0/grid/bin/cssdmonitor
root 29334 1 0 Apr28 ? 00:00:10 /stage/grid/11.2.0/grid/bin/cssdagent
oracle 29351 1 8 Apr28 ? 03:25:51 /stage/grid/11.2.0/grid/bin/ocssd.bin
root 29540 1 0 Apr28 ? 00:20:57 /stage/grid/11.2.0/grid/bin/orarootagent.bin
oracle 29556 1 0 Apr28 ? 00:00:18 /stage/grid/11.2.0/grid/bin/diskmon.bin -d -f
root 30160 1 0 Apr28 ? 00:00:33 /stage/grid/11.2.0/grid/bin/octssd.bin reboot
oracle 30350 1 0 Apr28 ? 00:00:04 /stage/grid/11.2.0/grid/bin/evmd.bin
root 31947 1 0 Apr28 ? 00:03:50 /stage/grid/11.2.0/grid/bin/crsd.bin reboot
From above output red part we aslo can see evmd.bin process is up.
But when use crsctl command, it shows that Event Manger not work on node 2 and node 3.
oracle $ crsctl check cluster -all
**************************************************************
node1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
node2:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
**************************************************************
node3:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
**************************************************************
Checked cluster alert.log, it also indicate EVMD started normal:
2012-04-28 18:29:48.154
[evmd(30579)]CRS-1401:EVMD started on node node2.
Below is from evmd.log:
2012-04-28 18:29:17.518: [ EVMD][1928129552] EVMD Starting
2012-04-28 18:29:17.518: [ EVMD][1928129552] Initializing OCR
2012-04-28 18:29:17.518: [ EVMD][1928129552] EVMD running in cluster mode
2012-04-28 18:29:18.524: [ EVMD][1928129552] Waiting for OCR. Seconds elapsed: 0
2012-04-28 18:29:48.136: [ EVMD][1928129552] Get OCR context succeeded
2012-04-28 18:29:48.153: [ EVMD][1928129552] Initializing Diagnostics Settings
2012-04-28 18:29:48.224: [ EVMD][1928129552] Authorization database built successfully.
2012-04-28 18:29:49.105: [ EVMAPP][1928129552] EVMD Started
2012-04-28 18:29:49.123: [ EVMD][1928129552] Authorization database built successfully.
All seems fine. Then why "CRS-4534: Cannot communicate with Event Manager"?
Let's check the current status for evmd:
oracle $ crsctl stat res -t -init
-------------------------------------------------------
NAME TARGET STATE SERVER
-------------------------------------------------------
ora.evmd 1 ONLINE INTERMEDIATE node3
The current status INTERMEDIATE means the evmd did starting but not totally completed.
Below recorded in oracle_agent.logfile:
2012-04-28 18:45:50.936: [ AGFW][1113385280] {0:0:2} ora.evmd 1 1 state changed from: UNKNOWN to: OFFLINE
.................
2012-04-28 18:46:23.441: [ AGFW][1114609984] {0:0:2} ora.evmd 1 1 state changed from: UNKNOWN to: STARTING
.................
2012-04-28 20:41:27.768: [ AGFW][1116698944] {0:0:2} ora.evmd 1 1 state changed from: STARTING to: PARTIAL
.................
We can see the evmd was starting in progress, status changed from UNKNOWN --> OFFLINE --> STARTING --> PARTIAL.
And next the status should change from PARTIAL to ONLINE.
But from the logfile, we don't see the final step, it never happen.
It failed at the final step due to below error message:
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] clsn_agent::abort {
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] abort {
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] abort command: check
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] tryActionLock {
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] did not get lock
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] tryActionLock }
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] clsn_agent::abort: Exception LockAcquireTimeoutException
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] clsn_agent::abort, agent exiting }
This is very similar to bug 11807012: oraagent.bin Exits After Check Timed Out
But this bug should have been fixed in 11.2.0.2.3, and our server already patched 11gR2 Patch Set Update 4.
It looks like that PSU don't real fully solved that bug.
Raised SR and will check with Oracle Support further.
More......
GE, AGFA. experienced in 9i/10g/11g/12c, AdvanceReplication, Stream, GoldenGate, RAC, DataGuard, ODA, EXADATA, GridControl. In my career I have always been assigned to handle most complicated and critical oracle issues without solution online, and I hope to record some of them via this blog.
April 30, 2012
Trouble Shooting: CRS-4534: Cannot communicate with Event Manager
April 24, 2012
rowid paging benefit from 11g new feature
Consider below request:
We need to get the 900 to 1000 records after ordered by time.
We call this kind of request as "paging". Because in most case we encounter this kind of request in web page's redirect buttons: page 1, page 2, page 3.....
We can jump to any page via clicking the conrrespoding buttion.
There are two kinds of paging method in oracle: Rownum paging and Rowid paging.
Almost all the documents which i read all announce that rowid is a better method then rownum for paging.
They said:
For the first a few pages, two methods have similiar performance.But as long as the page number become bigger and bigger, rowid paging perform better and better then rownum paging.
Pay attention to the red part. Is that really true? Let's make a experiment.
First test in 11g. Create test table as below:
11gR2 >create table test(id number,status VARCHAR2(7),type VARCHAR2(19),created date);
Table created.
11gR2 >insert into test select OBJECT_ID,STATUS,OBJECT_TYPE,CREATED from dba_objects;
12926 rows created.
11gR2 >alter table test modify created not null;
Table altered.
11gR2 >create index test_ind1 on test(CREATED);
Index created.
11gR2 >ANALYZE TABLE TEST compute statistics;
Table analyzed.
We are going to read row 900-1000 from table test order by created.
Test 11g paging with rownum method:
11gR2 >select *
from (
select rownum rn,t.*
from
(select id,status,type,created from test order by created) t
where rownum<1000)
where rn >900;
99 rows selected.
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 999 | 50949 | 9 (0)| 00:00:01 |
|* 1 | VIEW | | 999 | 50949 | 9 (0)| 00:00:01 |
|* 2 | COUNT STOPKEY | | | | | |
| 3 | VIEW | | 999 | 37962 | 9 (0)| 00:00:01 |
| 4 | TABLE ACCESS BY INDEX ROWID| TEST | 12926 | 277K| 9 (0)| 00:00:01 |
| 5 | INDEX FULL SCAN | TEST_IND1 | 999 | | 4 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
22 consistent gets
0 physical reads
0 redo size
3703 bytes sent via SQL*Net to client
590 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
99 rows processed
Test 11g paging with rowid method:
11gR2 >select /*+ ordered use_nl(p s) */ *
from (
select rownum rn,rd
from (select rowid rd from test order by created)
t where rownum<1000) p,
test s
where rn>900 and p.rd=s.rowid; 2 3 4 5 6 7
99 rows selected.
-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 999 | 52947 | 1003 (0)| 00:00:13 |
| 1 | NESTED LOOPS | | 999 | 52947 | 1003 (0)| 00:00:13 |
|* 2 | VIEW | | 999 | 24975 | 4 (0)| 00:00:01 |
|* 3 | COUNT STOPKEY | | | | | |
| 4 | VIEW | | 999 | 11988 | 4 (0)| 00:00:01 |
| 5 | INDEX FULL SCAN | TEST_IND1 | 12926 | 239K| 4 (0)| 00:00:01 |
| 6 | TABLE ACCESS BY USER ROWID| TEST | 1 | 28 | 1 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
19 consistent gets
0 physical reads
0 redo size
5450 bytes sent via SQL*Net to client
590 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
99 rows processed
From the red part we can see rowid paging(19 logic read) do better then rownum paging(22 logic read).
The reason is rowid method reduce the "TABLE ACCESS BY ROWID" for first 900 rows.
Let's check in 10g:
Build the Test table as excatly the same with 11g, created the index and gathered stats.
Test 11g paging with rownum method:
10gR2 >select *
from (
select rownum rn,t.*
from
(select id,status,type,created from test order by created) t
where rownum<1000)
where rn >900;
99 rows selected.
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 999 | 50949 | 17 (0)| 00:00:01 |
|* 1 | VIEW | | 999 | 50949 | 17 (0)| 00:00:01 |
|* 2 | COUNT STOPKEY | | | | | |
| 3 | VIEW | | 999 | 37962 | 17 (0)| 00:00:01 |
| 4 | TABLE ACCESS BY INDEX ROWID| TEST | 45620 | 1113K| 17 (0)| 00:00:01 |
| 5 | INDEX FULL SCAN | TEST_IND1 | 999 | | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
19 consistent gets
0 physical reads
0 redo size
3842 bytes sent via SQL*Net to client
558 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
99 rows processed
19 logic read, very close to 11g. It is a normal value and close to our expected.
Test 11g paging with rowid method:
10gR2 >select /*+ ordered use_nl(p s) */ *
from (
select rownum rn,rd
from (select rowid rd from test order by created)
t where rownum<200) p,
test s
where rn>100 and p.rd=s.rowid;
99 rows selected.
-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 199 | 11144 | 261 (1)| 00:00:04 |
| 1 | NESTED LOOPS | | 199 | 11144 | 261 (1)| 00:00:04 |
|* 2 | VIEW | | 199 | 4975 | 62 (2)| 00:00:01 |
|* 3 | COUNT STOPKEY | | | | | |
| 4 | VIEW | | 45620 | 534K| 62 (2)| 00:00:01 |
| 5 | INDEX FULL SCAN | TEST_IND1 | 45620 | 846K| 62 (2)| 00:00:01 |
| 6 | TABLE ACCESS BY USER ROWID| TEST | 1 | 31 | 1 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
108 consistent gets
0 physical reads
0 redo size
5252 bytes sent via SQL*Net to client
558 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
99 rows processed
wow, why logic turn to 108? The execution plan is same with 11g, but why so big difference in logic read?
rownum rowid
11g: 22 19
10g: 19 108
Why?
Below is the Reason:
11gR2 >select rowid,id from test where rowid in ('AAADSHAABAAAH3hAAA','AAADSHAABAAAH3hAAB','AAADSHAABAAAH3hAAC','AAADSHAABAAAH3hAAD','AAADSHAABAAAH3hAAE','AAADSHAABAAAH3hAAF','AAADSHAABAAAH3hAAG','AAADSHAABAAAH3hAAH','AAADSHAABAAAH3hAAI','AAADSHAABAAAH3hAAJ');
ROWID ID
------------------ ----------
AAADSHAABAAAH3hAAA 20
AAADSHAABAAAH3hAAB 46
AAADSHAABAAAH3hAAC 28
AAADSHAABAAAH3hAAD 15
AAADSHAABAAAH3hAAE 29
AAADSHAABAAAH3hAAF 3
AAADSHAABAAAH3hAAG 25
AAADSHAABAAAH3hAAH 41
AAADSHAABAAAH3hAAI 54
AAADSHAABAAAH3hAAJ 40
10 rows selected.
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
2 consistent gets
0 physical reads
0 redo size
875 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
10 rows processed
10gR2 >select rowid,id from test where rowid in ('AAAzo5AABAAAKnWAAA','AAAzo5AABAAAKnWAAB','AAAzo5AABAAAKnWAAC','AAAzo5AABAAAKnWAAD','AAAzo5AABAAAKnWAAE','AAAzo5AABAAAKnWAAF','AAAzo5AABAAAKnWAAG','AAAzo5AABAAAKnWAAH','AAAzo5AABAAAKnWAAI','AAAzo5AABAAAKnWAAJ');
ROWID ID
------------------ ----------
AAAzo5AABAAAKnWAAA 30
AAAzo5AABAAAKnWAAB 8
AAAzo5AABAAAKnWAAC 14
AAAzo5AABAAAKnWAAD 34
AAAzo5AABAAAKnWAAE 45
AAAzo5AABAAAKnWAAF 39
AAAzo5AABAAAKnWAAG 47
AAAzo5AABAAAKnWAAH 51
AAAzo5AABAAAKnWAAI 11
AAAzo5AABAAAKnWAAJ 48
10 rows selected.
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
10 consistent gets
0 physical reads
0 redo size
861 bytes sent via SQL*Net to client
492 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
10 rows processed
A Rowid is made by: OOOOOOFFFBBBBBBRRR
O is object id,F is file id,B is block id,R is row number.
From above we can see, read 10 rows from same block in 11g cost 2 block read while in 10g cost 10 block read.
Further research also confirmed in 10g, every rowid will gernerate an logic read, even all the rowid are come from same block.
But in 11g some new machanism are introduced, if next rowid is in the same block with current row, then current block can be reuse.
More......
April 9, 2012
MV refresh cost long time due to internal recursive call
We have a job defined to do mutiple MV refresh. Most MVs can refresh normally, but there are two MVs cost very long time to complete.
No locks/latch contention found during the refresh.
Below is the log:
START_TIME OBJECT_NAME TIME_COST(s)
------------------- ------------------------------ -------------
2012-01-06 00:45:44 MV_APP_CURR_ST_1_HR_YLD_AVAIL 17.65
2012-01-06 00:46:02 MV_APP_CURR_STATE_YIELD_AVAIL 2.7
2012-01-06 00:46:05 MV_APP_CURRENT_STATE 220.34
2012-01-06 00:49:45 MV_APP_CURR_POLE_1_HR_Y_AVAIL 22.93
2012-01-06 00:50:08 MV_APP_CURR_STATE_POLE_Y_AVAIL 3.62
2012-01-06 00:50:12 MV_APP_CURRENT_STATE_POLE 125.59
From above we can see MV_APP_CURRENT_STATE and MV_APP_CURRENT_STATE_POLE cost very long time.
When check these MVs, found their definition are simply. I tried to run MVs' SELECT part, both can return all record in 1 second.
But they will take 100+ seconds when refresh the MV with below sentence:
dbms_mview.refresh('"GAPC"."MV_APP_CURRENT_STATE_POLE"', atomic_refresh ='FALSE');
After make an 10046 trace, found during the MV refresh, below internal recursive SQL cost most of time:
********************************************************************************
SQL ID: 6v8ygt9zbxsnh Plan Hash: 2840274064
select count(*), max(scn)
from
( select d.scn scn from sys.sumdelta$ d, sys.sumdep$ o
where o.sumobj# = :1 and o.p_obj# = d.tableobj# and d.spare1 =
0 and d.scn > :2 union all select m.commit_scn scn
from sys.sumdelta$ d, sys.sumdep$ o, sys.snap_xcmt$ m where
o.sumobj# = :1 and o.p_obj# = d.tableobj# and d.spare1 = 1 and
d.xid = m.xid and m.commit_scn > :2 )
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 7 0.00 0.00 0 1 0 0
Execute 7 0.00 0.00 0 0 0 0
Fetch 7 31.25 367.18 99567 127778 0 7
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 21 31.26 367.19 99567 127779 0 7
Notice red parts, above internel recursive SQL cost 367 seconds while total MV refresh time is 370 seconds.
After checking above SQL, find its slowness mainly caused by table sumdelta$ which is 1.5GB big:
SQL> select TABLEOBJ#,count(*) from sys.sumdelta$ group by TABLEOBJ#;
TABLEOBJ# COUNT(*)
---------- ----------
11784739 1536910
2857695 449018
7506186 8808703
This table contains 10 millions records. And most record are last year's.
SQL> select owner,object_id,object_type,object_name,status from dba_objects where object_id in (11784739,2857695,7506186);
OWNER OBJECT_ID OBJECT_TYPE OBJECT_NAME STATUS
---------- ---------- ------------------- ------------------------------ -------
GAPC 11784739 TABLE MV_EVENT_METER_ONE_DAY VALID
GAPC 2857695 TABLE MV_APP_CURRENT_ALERT_STATE VALID
These records in sys.sumdelta$ do related to our issue MVs.
Till now the issue is clearly. Due to some bug, expired data not purged normally in sys.sumdelta$.
This table grew bigger and bigger, and make the SQL related to it become rather slow.
After truncate this table sys.sumdelta$.
Those issue MVs finished in a few seconds.
More......
April 6, 2012
buffer busy waits, resmgr:cpu quantum, and Suspending MMON action ..
When checking on one DB, below queries hang:
bdr1 >select s.sid,p.SPID,s.serial# from v$session s, v$process p where s.PADDR=p.ADDR and s.sid= (select sid from v$mystat where rownum<2);
SID SPID SERIAL#
--- ----- ------
591 10422 25701
bdr1 >SELECT INST_ID,DECODE(request,0,'Holder: ','Waiter: ')||sid sess, id1, id2, lmode, request, type,ctime
FROM gV$LOCK WHERE (id1, id2, type) IN (SELECT id1, id2, type from gV$LOCK WHERE request>0)
ORDER BY id1, request;
---Hang there
After waited 5 minutes but no lucky, i decide to have a check.
First check the wait event:
bdr1 >select event from v$session where sid=591;
EVENT
------------------------------------------
buffer busy waits
bdr1 >select P1,P2,P3 from v$session where sid=591;
P1 P2 P3
---- ------- ----------
81 5925 21
Wait for 10 seconds, then query again:
bdr1 >select P1,P2,P3 from v$session where sid=591;
P1 P2 P3
---- ------- ----------
81 5925 21
Above output means the session is waiting buffer busy waits on file 81 block 5925 for more then ten seconds.
It is so wired, i definitely haven't encounter this before. As we all known, buffer busy waits occurs when another session pining one block during an DML, and it is a very light operation. How can a session wait one block for ten seconds?
Let's check what the block is:
bdr1 >alter system dump datafile 81 block 5925;
System altered.
The trace file shows this block is an undo segment header block:
index state cflags wrap# uel scn dba parent-xid nub stmt_num cmt
-------------------------------------------------------------
0x00 9 0x00 0x30998 0x0013 0x0b4f.57013a1b 0x198009af 0x0000.000.00000000 0x00000001 0x00000000 1332741051
0x01 9 0x00 0x30999 0x002f 0x0b4f.57013c9b 0x198009af 0x0000.000.00000000 0x00000001 0x00000000 1332741600
0x02 9 0x00 0x30999 0x001d 0x0b4f.5701397a 0x198009af 0x0000.000.00000000 0x00000001 0x00000000 1332740819
0x03 9 0x00 0x3099a 0x0021 0x0b4f.57013d32 0x198009b2 0x0000.000.00000000 0x00000001 0x00000000 1332741601
0x04 9 0x00 0x30999 0x0020 0x0b4f.57013d05 0x198009b2 0x0000.000.00000000 0x00000004 0x00000000 1332741600
0x05 10 0x00 0x3099a 0x001d 0x0b4f.57015ee9 0x198009ab 0x0000.000.00000000 0x00000001 0x00000000 0
......
bdr1 >SELECT SEGMENT_TYPE,OWNER||'.'||SEGMENT_NAME FROM DBA_EXTENTS where file_id= 81 AND block_id BETWEEN 5925 AND 5925+BLOCKS -1;
SEGMENT_TYPE OWNER||'.'||SEGMENT_NAME
------------------ ---------------------------------------------
TYPE2 UNDO SYS._SYSSMU3_424634979$
TYPE2 UNDO SYS._SYSSMU9_2057384446$
Tried to flush shared_pool and buffer_cache, but appear useless:
bdr1 >alter system flush shared_pool;
System altered.
bdr1 >select P1,P2,P3 from v$session where sid=591;
P1 P2 P3
---- ------- ----------
81 5925 21
bdr1 >alter system flush buffer_cache;
System altered.
bdr1 >select P1,P2,P3 from v$session where sid=591;
P1 P2 P3
---- ------- ----------
81 5925 21
Let's check which session is holding that undo segment header block:
bdr1 >select BLOCKING_INSTANCE,BLOCKING_SESSION from v$session where sid=591;
BLOCKING_INSTANCE BLOCKING_SESSION
----------------- ----------------
1 55
The block is holding by SID 55. Let's check the session 55:
SSID SERIAL# LOGON_TIME STATE STATUS EVENT PROGRAM
----- ---------- ----------- -------- ----- ------ --------
55 18167 26-MAR-12 02:00 WAITING ACTIVE resmgr:cpu quantum
oracle@ora03.globaltax.ge.com (M000)
Pay attention to the red part's event: resmgr:cpu quantum. It is related to resouce manager.
This session is MMON slave process.
Watched some time and confirmed this session is hang.
Let's check the alert.log and see weather we can find some clue:
Thu Mar 29 19:00:40 2012
Suspending MMON action 'AWR Auto CPU USAGE Task' for 82800 seconds
Thu Mar 29 19:10:42 2012
Suspending MMON slave action kewfmadsa_ for 82800 seconds
Thu Mar 29 23:05:58 2012
Suspending MMON action 'undo usage' for 82800 seconds
From the alert.log we can see that the MMON suspended, and unluckily at that moment it was pining the undo segment header block, and therefore blocking other sessions.
Suspending MMON is a new feature in 11G. If the system is so over-loaded that it takes over 15 minutes to gather statistics or other MMON tasks, this error is expected. It is a functionality enhancement in 11g, as it prevents MMON from locking resources those other processes might be waiting for. In 10g , mmon slaves are allowed to run indefinitely.
Let's check the OS load:
bdr1 >ho free -m
total used free shared buffers cached
Mem: 128874 128107 766 0 302 117527
-/+ buffers/cache: 10277 118596
Swap: 66463 4350 62113
Cpu(s): 12.8%us, 2.4%sy, 0.0%ni, 84.4%id, 0.3%wa, 0.0%hi, 0.1%si, 0.0%st
The load is quite low, then why MMON suspend?
The MMON is charge of AWR, let's check the AWR report:
Enter value for num_days: 3
Listing the last 3 days of Completed Snapshots
Snap
Instance DB Name Snap Id Snap Started Level
------------ ------------ --------- ------------------ -----
bdr1 BDR1 31208 24 Mar 2012 00:00 1
................
................
31251 25 Mar 2012 19:00 1
31252 25 Mar 2012 20:00 1
31253 25 Mar 2012 21:00 1
31254 25 Mar 2012 22:00 1
31255 25 Mar 2012 23:00 1
31256 26 Mar 2012 00:00 1
31257 26 Mar 2012 01:00 1
Specify the Begin and End Snapshot Ids
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Enter value for begin_snap:
bdr1 >
bdr1 >ho date
Fri Mar 30 05:41:54 EDT 2012
We can see currently is Mar30th, but the latest AWR report is gernerated at Mar26th. MMON facing the issue since then.
Let's check the alert.log what happend at Mar26th:
Sun Mar 25 22:03:08 2012
DBMS_STATS: GATHER_STATS_JOB encountered errors. Check the trace file.
ORA-20011: Approximate NDV failed: ORA-29913: error in executing ODCIEXTTABLEOPEN callout
ORA-29400: data cartridge error
KUP-04040: file outdata_to_show.txt in STADM_DIR not found
Mon Mar 26 02:00:00 2012
Closing scheduler window
Closing Resource Manager plan via scheduler window
Clearing Resource Manager plan via parameter
Setting Resource Manager plan SCHEDULER[0xD9D6EFAULT_MAINTENANCE_PLAN via scheduler window
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Mon Mar 26 02:28:21 2012
Suspending MMON action 'undo usage' for 82800 seconds
Mon Mar 26 03:10:25 2012
Process 0x0xdcbda2a8 appears to be hung in Auto SQL Tuning task
Current time = 1332745824, process death time = 1332745800
Attempting to kill process 0x0xdcbda2a8 with OS pid = 11769
OSD kill succeeded for process 0xdcbda2a8
Till now, we can obviously consider this issue somehow related to resouce manager.
After created an SR, Oracle Support confirmed this is an resource manager's bug, the solution is to disable resource manager:
1. alter system set resource_manager_plan='';
2. For each window_name:
execute dbms_scheduler.set_attribute('WEEKNIGHT_WINDOW','RESOURCE_PLAN','');
More......