When checking on one DB, below queries hang:
bdr1 >select s.sid,p.SPID,s.serial# from v$session s, v$process p where s.PADDR=p.ADDR and s.sid= (select sid from v$mystat where rownum<2);
SID SPID SERIAL#
--- ----- ------
591 10422 25701
bdr1 >SELECT INST_ID,DECODE(request,0,'Holder: ','Waiter: ')||sid sess, id1, id2, lmode, request, type,ctime
FROM gV$LOCK WHERE (id1, id2, type) IN (SELECT id1, id2, type from gV$LOCK WHERE request>0)
ORDER BY id1, request;
---Hang there
After waited 5 minutes but no lucky, i decide to have a check.
First check the wait event:
bdr1 >select event from v$session where sid=591;
EVENT
------------------------------------------
buffer busy waits
bdr1 >select P1,P2,P3 from v$session where sid=591;
P1 P2 P3
---- ------- ----------
81 5925 21
Wait for 10 seconds, then query again:
bdr1 >select P1,P2,P3 from v$session where sid=591;
P1 P2 P3
---- ------- ----------
81 5925 21
Above output means the session is waiting buffer busy waits on file 81 block 5925 for more then ten seconds.
It is so wired, i definitely haven't encounter this before. As we all known, buffer busy waits occurs when another session pining one block during an DML, and it is a very light operation. How can a session wait one block for ten seconds?
Let's check what the block is:
bdr1 >alter system dump datafile 81 block 5925;
System altered.
The trace file shows this block is an undo segment header block:
index state cflags wrap# uel scn dba parent-xid nub stmt_num cmt
-------------------------------------------------------------
0x00 9 0x00 0x30998 0x0013 0x0b4f.57013a1b 0x198009af 0x0000.000.00000000 0x00000001 0x00000000 1332741051
0x01 9 0x00 0x30999 0x002f 0x0b4f.57013c9b 0x198009af 0x0000.000.00000000 0x00000001 0x00000000 1332741600
0x02 9 0x00 0x30999 0x001d 0x0b4f.5701397a 0x198009af 0x0000.000.00000000 0x00000001 0x00000000 1332740819
0x03 9 0x00 0x3099a 0x0021 0x0b4f.57013d32 0x198009b2 0x0000.000.00000000 0x00000001 0x00000000 1332741601
0x04 9 0x00 0x30999 0x0020 0x0b4f.57013d05 0x198009b2 0x0000.000.00000000 0x00000004 0x00000000 1332741600
0x05 10 0x00 0x3099a 0x001d 0x0b4f.57015ee9 0x198009ab 0x0000.000.00000000 0x00000001 0x00000000 0
......
bdr1 >SELECT SEGMENT_TYPE,OWNER||'.'||SEGMENT_NAME FROM DBA_EXTENTS where file_id= 81 AND block_id BETWEEN 5925 AND 5925+BLOCKS -1;
SEGMENT_TYPE OWNER||'.'||SEGMENT_NAME
------------------ ---------------------------------------------
TYPE2 UNDO SYS._SYSSMU3_424634979$
TYPE2 UNDO SYS._SYSSMU9_2057384446$
Tried to flush shared_pool and buffer_cache, but appear useless:
bdr1 >alter system flush shared_pool;
System altered.
bdr1 >select P1,P2,P3 from v$session where sid=591;
P1 P2 P3
---- ------- ----------
81 5925 21
bdr1 >alter system flush buffer_cache;
System altered.
bdr1 >select P1,P2,P3 from v$session where sid=591;
P1 P2 P3
---- ------- ----------
81 5925 21
Let's check which session is holding that undo segment header block:
bdr1 >select BLOCKING_INSTANCE,BLOCKING_SESSION from v$session where sid=591;
BLOCKING_INSTANCE BLOCKING_SESSION
----------------- ----------------
1 55
The block is holding by SID 55. Let's check the session 55:
SSID SERIAL# LOGON_TIME STATE STATUS EVENT PROGRAM
----- ---------- ----------- -------- ----- ------ --------
55 18167 26-MAR-12 02:00 WAITING ACTIVE resmgr:cpu quantum
oracle@ora03.globaltax.ge.com (M000)
Pay attention to the red part's event: resmgr:cpu quantum. It is related to resouce manager.
This session is MMON slave process.
Watched some time and confirmed this session is hang.
Let's check the alert.log and see weather we can find some clue:
Thu Mar 29 19:00:40 2012
Suspending MMON action 'AWR Auto CPU USAGE Task' for 82800 seconds
Thu Mar 29 19:10:42 2012
Suspending MMON slave action kewfmadsa_ for 82800 seconds
Thu Mar 29 23:05:58 2012
Suspending MMON action 'undo usage' for 82800 seconds
From the alert.log we can see that the MMON suspended, and unluckily at that moment it was pining the undo segment header block, and therefore blocking other sessions.
Suspending MMON is a new feature in 11G. If the system is so over-loaded that it takes over 15 minutes to gather statistics or other MMON tasks, this error is expected. It is a functionality enhancement in 11g, as it prevents MMON from locking resources those other processes might be waiting for. In 10g , mmon slaves are allowed to run indefinitely.
Let's check the OS load:
bdr1 >ho free -m
total used free shared buffers cached
Mem: 128874 128107 766 0 302 117527
-/+ buffers/cache: 10277 118596
Swap: 66463 4350 62113
Cpu(s): 12.8%us, 2.4%sy, 0.0%ni, 84.4%id, 0.3%wa, 0.0%hi, 0.1%si, 0.0%st
The load is quite low, then why MMON suspend?
The MMON is charge of AWR, let's check the AWR report:
Enter value for num_days: 3
Listing the last 3 days of Completed Snapshots
Snap
Instance DB Name Snap Id Snap Started Level
------------ ------------ --------- ------------------ -----
bdr1 BDR1 31208 24 Mar 2012 00:00 1
................
................
31251 25 Mar 2012 19:00 1
31252 25 Mar 2012 20:00 1
31253 25 Mar 2012 21:00 1
31254 25 Mar 2012 22:00 1
31255 25 Mar 2012 23:00 1
31256 26 Mar 2012 00:00 1
31257 26 Mar 2012 01:00 1
Specify the Begin and End Snapshot Ids
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Enter value for begin_snap:
bdr1 >
bdr1 >ho date
Fri Mar 30 05:41:54 EDT 2012
We can see currently is Mar30th, but the latest AWR report is gernerated at Mar26th. MMON facing the issue since then.
Let's check the alert.log what happend at Mar26th:
Sun Mar 25 22:03:08 2012
DBMS_STATS: GATHER_STATS_JOB encountered errors. Check the trace file.
ORA-20011: Approximate NDV failed: ORA-29913: error in executing ODCIEXTTABLEOPEN callout
ORA-29400: data cartridge error
KUP-04040: file outdata_to_show.txt in STADM_DIR not found
Mon Mar 26 02:00:00 2012
Closing scheduler window
Closing Resource Manager plan via scheduler window
Clearing Resource Manager plan via parameter
Setting Resource Manager plan SCHEDULER[0xD9D6EFAULT_MAINTENANCE_PLAN via scheduler window
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Mon Mar 26 02:28:21 2012
Suspending MMON action 'undo usage' for 82800 seconds
Mon Mar 26 03:10:25 2012
Process 0x0xdcbda2a8 appears to be hung in Auto SQL Tuning task
Current time = 1332745824, process death time = 1332745800
Attempting to kill process 0x0xdcbda2a8 with OS pid = 11769
OSD kill succeeded for process 0xdcbda2a8
Till now, we can obviously consider this issue somehow related to resouce manager.
After created an SR, Oracle Support confirmed this is an resource manager's bug, the solution is to disable resource manager:
1. alter system set resource_manager_plan='';
2. For each window_name:
execute dbms_scheduler.set_attribute('WEEKNIGHT_WINDOW','RESOURCE_PLAN','');
GE, AGFA. experienced in 9i/10g/11g/12c, AdvanceReplication, Stream, GoldenGate, RAC, DataGuard, ODA, EXADATA, GridControl. In my career I have always been assigned to handle most complicated and critical oracle issues without solution online, and I hope to record some of them via this blog.
April 6, 2012
buffer busy waits, resmgr:cpu quantum, and Suspending MMON action ..
Subscribe to:
Post Comments (Atom)
3 Comments:
Very good investigation.
Great Work....
thanks for sharing right information resmgr:cpu quantum
Post a Comment