August 7, 2012

SYSTEMSTATE DUMP: Shooting DB Hang, sqlplus Hang, self deadlock latch, PMON deadlock latch

One DB is hang. Try to login with "sqlplus / as sysdba" also fail, just hang there.

I use "-prelim" option to login and make a systemstate dump.

Below is the ass.awk output for systemstate dump:

System State 1
~~~~~~~~~~~~~~~~
1:
2:  0: waiting for 'pmon timer'    <---- PMON is fine at this moment
3:  0: waiting for 'rdbms ipc message'
4:  0: waiting for 'VKTM Logical Idle Wait'
5:  0: waiting for 'rdbms ipc message'
6:  0: waiting for 'DIAG idle wait'
7:  0: waiting for 'rdbms ipc message'
8:  0: waiting for 'DIAG idle wait'
9:  0: waiting for 'rdbms ipc message'
10: 0: waiting for 'rdbms ipc message'
11: 0: waiting for 'rdbms ipc message'
12: 0: waiting for 'rdbms ipc message'
13: 0: waiting for 'rdbms ipc message'
14: 0: waiting for 'rdbms ipc message'
15: 0: waiting for 'rdbms ipc message'
16: 0: waiting for 'rdbms ipc message'
17: 0: waiting for 'rdbms ipc message'
18: 0: waiting for 'rdbms ipc message'
19: 0: waiting for 'rdbms ipc message'
20: 0: waiting for 'rdbms ipc message'
21: 0: waiting for 'rdbms ipc message'
22: 0: waiting for 'rdbms ipc message'
23: 0: waiting for 'rdbms ipc message'
24: 0: waiting for 'rdbms ipc message'
25: 0: waiting for 'rdbms ipc message'
26: 0: waiting for 'rdbms ipc message'
27: 0: waiting for 'rdbms ipc message'
28: 0: waiting for 'rdbms ipc message'
29: 0: waiting for 'rdbms ipc message'
30: 0: waiting for 'rdbms ipc message'
31: 0: waiting for 'rdbms ipc message'
32: 0: waiting for 'smon timer'
33: 0: waiting for 'rdbms ipc message'
34: 0: waiting for 'latch free'        [Latch 60009fc8]
35: 0: waiting for 'rdbms ipc message'
36:                                    [Latch 60009fc8]
37: 0: waiting for 'enq: PV - syncstart'[Enqueue PV-00000000-00000000]
38:                                    [Latch 60009fc8]
39: 0: waiting for 'SQL*Net message from client'
40: 0: waiting for 'VKRM Idle'
41: 0: waiting for 'rdbms ipc message'
42: 0: waiting for 'rdbms ipc message'
43: 0: waiting for 'rdbms ipc message'
44: 0: waiting for 'Streams AQ: qmn coordinator idle wait'
45: 0: waiting for 'enq: PR - contention'[Enqueue PR-00000000-00000000]
46:                                    [Latch 60009fc8]
47: 0: waiting for 'Streams AQ: qmn slave idle wait'
48:                                    [Latch 60009fc8]
49: 0: waiting for 'SQL*Net message from client'
50:                                    [Latch 60009fc8]
51: 9: waited for 'Streams AQ: waiting for time management or cleanup tasks'
52: 0: waiting for 'SQL*Net message from client'
Blockers
~~~~~~~~

        Above is a list of all the processes. If they are waiting for a resource
        then it will be given in square brackets. Below is a summary of the
        waited upon resources, together with the holder of that resource.
        Notes:
        ~~~~~
         o A process id of '???' implies that the holder was not found in the
           systemstate.

                    Resource Holder State
              Latch 60009fc8    ??? Blocker
Enqueue PV-00000000-00000000    34: 34: is waiting for Latch 60009fc8
Enqueue PR-00000000-00000000    34: 34: is waiting for Latch 60009fc8

Object Names
~~~~~~~~~~~~
Latch 60009fc8   process allocation
Enqueue PV-00000000-00000000
Enqueue PR-00000000-00000000

Obviously the root blocker is latch: 60009fc8. But unluckily the systemstate dump don't have the infor that which session is holding the latch:
alpcispdb552[oracle]_pgbs> cat pgbs_diag_2254.trc|grep 60009fc8
      waiting for 60009fc8  process allocation level=7
        address=0x60009fc8, number=0x9, tries=0x0
          p1: 'address'=0x60009fc8
      waiting for 60009fc8  process allocation level=7
      waiting for 60009fc8  process allocation level=7
      waiting for 60009fc8  process allocation level=7
      waiting for 60009fc8  process allocation level=7
      waiting for 60009fc8  process allocation level=7

Then let's choose any one blocked process to check:
PROCESS 34: MMON
  ----------------------------------------
  SO: 0x1972e95a8, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x1972e95a8, name=process, file=ksu.h LINE:12451, pg=0
  (process) Oracle pid:34, ser:1, calls cur/top: 0x189db28e8/0x19f52b748
            flags : (0x2) SYSTEM
            flags2: (0x20),  flags3: (0x0)
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 68
              last post received-location: kso2.h LINE:467 ID:ksoreq_reply
              last process to post me: 19f2ef058 165 0
              last post sent: 0 0 26
              last post sent-location: ksa2.h LINE:282 ID:ksasnd
              last process posted by me: 1982dec38 1 6
    (latch info) wait_event=0 bits=0
        Location from where call was made: ksu.h LINE:13713 ID:ksu_unreserve:
      waiting for 60009fc8  process allocation level=7
        Location from where latch is held: ksu.h LINE:13677 ID:ksuapc: proc addr
        Context saved from call: 6965578584
        state=busy [holder orapid=0] wlstate=free [value=0]
          gotten 57043 times wait, failed first 54 sleeps 0
          gotten 26253 times nowait, failed: 1
        possible holder pid = 0 ospid=26755
    Process Group: DEFAULT, pseudo proc: 0x19734a4b8
    O/S info: user: oracle, term: UNKNOWN, ospid: 2339
    OSD pid info: Unix process pid: 2339, image: oracle@server552(MMON)

Above blocked process is MMON, pay attention to red part, it is waiting for 60009fc8, and latch's possible holder ospid: 26755.

Let’s check ospid 26755:

PROCESS 50:
  ----------------------------------------
  SO: 0x19f2ef058, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x19f2ef058, name=process, file=ksu.h LINE:12451, pg=0
  (process) Oracle pid:50, ser:165, calls cur/top: (nil)/(nil)
            flags : (0x0) -
            flags2: (0x8000),  flags3: (0x0)
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 0
              last post received-location: No post
              last process to post me: none
              last post sent: 0 0 0
              last post sent-location: No post
              last process posted by me: none
    (latch info) wait_event=0 bits=0
        Location from where call was made: ksu.h LINE:13721 ID:ksudlp:
        Context saved from call: 6965620824
      waiting for 60009fc8  process allocation level=7
        Location from where latch is held: ksu.h LINE:13677 ID:ksuapc: proc addr
        Context saved from call: 6965578584
        state=busy [holder orapid=0] wlstate=free [value=0]
          gotten 57043 times wait, failed first 54 sleeps 0
          gotten 26253 times nowait, failed: 1
        possible holder pid = 0 ospid=26755
    O/S info: user: oracle, term: UNKNOWN, ospid: 26755
    OSD pid info: Unix process pid: 26755, image: oracle@alpcispdb552)

We can see the Process 26755 is blocked by itself. A self deadlock latch issue.

Then i kill the Process 26755 from OS level with "kill -9" to get the latch released. But unluckly the DB is still hang.

I make a latest systemstate and check:

System State 1
~~~~~~~~~~~~~~~~
1:
2:  0: waiting for 'latch free'        [Latch 60009fc8]  ---- PMON is blocked
3:  0: waiting for 'rdbms ipc message'
4:  0: waiting for 'VKTM Logical Idle Wait'
5:  0: waiting for 'rdbms ipc message'
6:
7:  0: waiting for 'rdbms ipc message'
8:  0: waiting for 'DIAG idle wait'
9:  0: waiting for 'rdbms ipc message'
10: 0: waiting for 'rdbms ipc message'
11: 0: waiting for 'rdbms ipc message'
12: 0: waiting for 'rdbms ipc message'
13: 0: waiting for 'rdbms ipc message'
14: 0: waiting for 'rdbms ipc message'
15: 0: waiting for 'rdbms ipc message'
16: 0: waiting for 'rdbms ipc message'
17: 0: waiting for 'rdbms ipc message'
18: 0: waiting for 'rdbms ipc message'
19: 0: waiting for 'rdbms ipc message'
20: 0: waiting for 'rdbms ipc message'
21: 0: waiting for 'rdbms ipc message'
22: 0: waiting for 'rdbms ipc message'
23: 0: waiting for 'rdbms ipc message'
24: 0: waiting for 'rdbms ipc message'
25: 0: waiting for 'rdbms ipc message'
26: 0: waiting for 'rdbms ipc message'
27: 0: waiting for 'rdbms ipc message'
28: 0: waiting for 'rdbms ipc message'
29: 0: waiting for 'rdbms ipc message'
30: 0: waiting for 'rdbms ipc message'
31: 0: waiting for 'rdbms ipc message'
32: 0: waiting for 'smon timer'
33: 0: waiting for 'rdbms ipc message'
34: 0: waiting for 'latch free'        [Latch 60009fc8]
35: 0: waiting for 'rdbms ipc message'
36:                                    [Latch 60009fc8]
37: 0: waiting for 'enq: PV - syncstart'[Enqueue PV-00000000-00000000]
38:                                    [Latch 60009fc8]
39: 0: waiting for 'SQL*Net message from client'
40: 0: waiting for 'VKRM Idle'
41: 0: waiting for 'rdbms ipc message'
42: 0: waiting for 'rdbms ipc message'
43: 0: waiting for 'rdbms ipc message'
44: 0: waiting for 'Streams AQ: qmn coordinator idle wait'
45: 0: waiting for 'enq: PR - contention'[Enqueue PR-00000000-00000000]
46:                                    [Latch 60009fc8]
47: 0: waiting for 'Streams AQ: qmn slave idle wait'
48:                                    [Latch 60009fc8]
49: 0: waiting for 'SQL*Net message from client'
50:                                    [Latch 60009fc8]
51: 9: waited for 'Streams AQ: waiting for time management or cleanup tasks'
52: 0: waiting for 'SQL*Net message from client'
Blockers
~~~~~~~~

        Above is a list of all the processes. If they are waiting for a resource
        then it will be given in square brackets. Below is a summary of the
        waited upon resources, together with the holder of that resource.
        Notes:
        ~~~~~
         o A process id of '???' implies that the holder was not found in the
           systemstate.

                    Resource Holder State
              Latch 60009fc8    ??? Blocker
Enqueue PV-00000000-00000000    34: 34: is waiting for Latch 60009fc8
Enqueue PR-00000000-00000000    34: 34: is waiting for Latch 60009fc8

Object Names
~~~~~~~~~~~~
Latch 60009fc8   process allocation
Enqueue PV-00000000-00000000
Enqueue PR-00000000-00000000

We can see pmon is also blocked on latch 60009fc8.

And below is the process dump for PMON:

PROCESS 2: PMON
  ----------------------------------------
  SO: 0x19935b370, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x19935b370, name=process, file=ksu.h LINE:12451, pg=0
  (process) Oracle pid:2, ser:1, calls cur/top: 0x19f525a88/0x19f525a88
            flags : (0xe) SYSTEM
            flags2: (0x0),  flags3: (0x0)
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 68
              last post received-location: kso2.h LINE:467 ID:ksoreq_reply
              last process to post me: 19f2edfd8 2 2
              last post sent: 0 0 26
              last post sent-location: ksa2.h LINE:282 ID:ksasnd
              last process posted by me: 1982dec38 1 6
    (latch info) wait_event=0 bits=0
        Location from where call was made: ksu.h LINE:13721 ID:ksudlp:
        Context saved from call: 6814711808
      waiting for 60009fc8  process allocation level=7
        Location from where latch is held: ksu.h LINE:13677 ID:ksuapc: proc addr
        Context saved from call: 6965578584
        state=busy [holder orapid=0] wlstate=free [value=0]
          gotten 57043 times wait, failed first 54 sleeps 0
          gotten 26253 times nowait, failed: 1
        possible holder pid = 0 ospid=26755
    Process Group: DEFAULT, pseudo proc: 0x19734a4b8
    O/S info: user: oracle, term: UNKNOWN, ospid: 1311
    OSD pid info: Unix process pid: 1311, image: oracle@server552(PMON)

We can see PMON is waiting on latch 60009fc8, blocked by the process that we just killed.

It is clear to us now:
process 26755 was in self deadlock state on latch 60009fc8 and blocked other processes. Later i killed process 26755 to release the latch, then pmon should go to clean the process and release the resource.
But before pmon start to work, pmon itself also require the same latch: 60009fc8 and hence blocked and can't progress.
So it is a typiclly pmon deadlock.

We had to bounce the DB at last.

More......