April 6, 2012

buffer busy waits, resmgr:cpu quantum, and Suspending MMON action ..

When checking on one DB, below queries hang:
bdr1 >select s.sid,p.SPID,s.serial# from v$session s, v$process p where s.PADDR=p.ADDR and s.sid= (select sid from v$mystat where rownum<2);
SID SPID  SERIAL#
--- ----- ------
591 10422 25701

bdr1 >SELECT INST_ID,DECODE(request,0,'Holder: ','Waiter: ')||sid sess, id1, id2, lmode, request, type,ctime
FROM gV$LOCK WHERE (id1, id2, type) IN (SELECT id1, id2, type from gV$LOCK WHERE request>0)
ORDER BY id1, request;

---Hang there

After waited 5 minutes but no lucky, i decide to have a check.
First check the wait event:
bdr1 >select event from v$session where sid=591;
EVENT
------------------------------------------
buffer busy waits

bdr1 >select P1,P2,P3 from v$session where sid=591;
P1   P2      P3
---- ------- ----------
81   5925    21


Wait for 10 seconds, then query again:
bdr1 >select P1,P2,P3 from v$session where sid=591;
P1   P2      P3
---- ------- ----------
81   5925    21


Above output means the session is waiting buffer busy waits on file 81 block 5925 for more then ten seconds.
It is so wired, i definitely haven't encounter this before. As we all known, buffer busy waits occurs when another session pining one block during an DML, and it is a very light operation. How can a session wait one block for ten seconds?

Let's check what the block is:
bdr1 >alter system dump datafile 81 block 5925;
System altered.


The trace file shows this block is an undo segment header block:
index state cflags wrap# uel scn dba parent-xid nub stmt_num cmt
-------------------------------------------------------------

0x00 9 0x00 0x30998 0x0013 0x0b4f.57013a1b 0x198009af 0x0000.000.00000000 0x00000001 0x00000000 1332741051
0x01 9 0x00 0x30999 0x002f 0x0b4f.57013c9b 0x198009af 0x0000.000.00000000 0x00000001 0x00000000 1332741600
0x02 9 0x00 0x30999 0x001d 0x0b4f.5701397a 0x198009af 0x0000.000.00000000 0x00000001 0x00000000 1332740819
0x03 9 0x00 0x3099a 0x0021 0x0b4f.57013d32 0x198009b2 0x0000.000.00000000 0x00000001 0x00000000 1332741601
0x04 9 0x00 0x30999 0x0020 0x0b4f.57013d05 0x198009b2 0x0000.000.00000000 0x00000004 0x00000000 1332741600
0x05 10 0x00 0x3099a 0x001d 0x0b4f.57015ee9 0x198009ab 0x0000.000.00000000 0x00000001 0x00000000 0
......

bdr1 >SELECT SEGMENT_TYPE,OWNER||'.'||SEGMENT_NAME FROM DBA_EXTENTS where file_id= 81 AND block_id BETWEEN 5925 AND 5925+BLOCKS -1;
SEGMENT_TYPE       OWNER||'.'||SEGMENT_NAME
------------------ ---------------------------------------------
TYPE2 UNDO         SYS._SYSSMU3_424634979$
TYPE2 UNDO         SYS._SYSSMU9_2057384446$


Tried to flush shared_pool and buffer_cache, but appear useless:
bdr1 >alter system flush shared_pool;
System altered.
bdr1 >select P1,P2,P3 from v$session where sid=591;
P1   P2      P3
---- ------- ----------
81   5925    21


bdr1 >alter system flush buffer_cache;
System altered.
bdr1 >select P1,P2,P3 from v$session where sid=591;
P1   P2      P3
---- ------- ----------
81   5925    21


Let's check which session is holding that undo segment header block:
bdr1 >select BLOCKING_INSTANCE,BLOCKING_SESSION from v$session where sid=591;
BLOCKING_INSTANCE BLOCKING_SESSION
----------------- ----------------
1                 55

The block is holding by SID 55. Let's check the session 55:
SSID SERIAL# LOGON_TIME      STATE    STATUS  EVENT  PROGRAM
----- ---------- ----------- -------- -----   ------ --------
55   18167   26-MAR-12 02:00 WAITING  ACTIVE  resmgr:cpu quantum
oracle@ora03.globaltax.ge.com (M000)

Pay attention to the red part's event: resmgr:cpu quantum. It is related to resouce manager.

This session is MMON slave process.
Watched some time and confirmed this session is hang.

Let's check the alert.log and see weather we can find some clue:
Thu Mar 29 19:00:40 2012
Suspending MMON action 'AWR Auto CPU USAGE Task' for 82800 seconds
Thu Mar 29 19:10:42 2012
Suspending MMON slave action kewfmadsa_ for 82800 seconds
Thu Mar 29 23:05:58 2012
Suspending MMON action 'undo usage' for 82800 seconds

From the alert.log we can see that the MMON suspended, and unluckily at that moment it was pining the undo segment header block, and therefore blocking other sessions.

Suspending MMON is a new feature in 11G. If the system is so over-loaded that it takes over 15 minutes to gather statistics or other MMON tasks, this error is expected. It is a functionality enhancement in 11g, as it prevents MMON from locking resources those other processes might be waiting for. In 10g , mmon slaves are allowed to run indefinitely. 


Let's check the OS load:
bdr1 >ho free -m
             total       used       free     shared    buffers     cached
Mem:        128874     128107        766          0        302     117527
-/+ buffers/cache:      10277     118596
Swap:        66463       4350      62113

Cpu(s): 12.8%us,  2.4%sy,  0.0%ni, 84.4%id,  0.3%wa,  0.0%hi,  0.1%si,  0.0%st



The load is quite low, then why MMON suspend? 
The MMON is charge of AWR, let's check the AWR report:
Enter value for num_days: 3
Listing the last 3 days of Completed Snapshots
                                                        Snap
Instance     DB Name        Snap Id    Snap Started    Level
------------ ------------ --------- ------------------ -----
bdr1         BDR1             31208 24 Mar 2012 00:00      1
................
................
                              31251 25 Mar 2012 19:00      1
                              31252 25 Mar 2012 20:00      1
                              31253 25 Mar 2012 21:00      1
                              31254 25 Mar 2012 22:00      1
                              31255 25 Mar 2012 23:00      1
                              31256 26 Mar 2012 00:00      1
                              31257 26 Mar 2012 01:00      1


Specify the Begin and End Snapshot Ids
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Enter value for begin_snap:
bdr1 >
bdr1 >ho date
Fri Mar 30 05:41:54 EDT 2012

We can see currently is Mar30th, but the latest AWR report is gernerated at Mar26th. MMON facing the issue since then. 

Let's check the alert.log what happend at Mar26th:
Sun Mar 25 22:03:08 2012
DBMS_STATS: GATHER_STATS_JOB encountered errors.  Check the trace file.
ORA-20011: Approximate NDV failed: ORA-29913: error in executing ODCIEXTTABLEOPEN callout
ORA-29400: data cartridge error
KUP-04040: file outdata_to_show.txt in STADM_DIR not found
Mon Mar 26 02:00:00 2012
Closing scheduler window
Closing Resource Manager plan via scheduler window
Clearing Resource Manager plan via parameter
Setting Resource Manager plan SCHEDULER[0xD9D6EFAULT_MAINTENANCE_PLAN via scheduler window
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Mon Mar 26 02:28:21 2012
Suspending MMON action 'undo usage' for 82800 seconds
Mon Mar 26 03:10:25 2012
Process 0x0xdcbda2a8 appears to be hung in Auto SQL Tuning task
Current time = 1332745824, process death time = 1332745800
Attempting to kill process 0x0xdcbda2a8 with OS pid = 11769
OSD kill succeeded for process 0xdcbda2a8


Till now, we can obviously consider this issue somehow related to resouce manager.
 After created an SR, Oracle Support confirmed this is an resource manager's bug, the solution is to disable resource manager:
1. alter system set resource_manager_plan='';
2. For each window_name:
execute dbms_scheduler.set_attribute('WEEKNIGHT_WINDOW','RESOURCE_PLAN',''); 

2 Comments:

Anonymous said...

Very good investigation.

Anonymous said...

Great Work....

Post a Comment