December 20, 2012

Performance Shooting: How much can SEQUENCE slow down a SQL

How much can SEQUENCE drag down speed of SQL?
10%? 20%? Below my working case show you SEQUENCE drag down 600% speed of one SQL.

It is during my recently shooting of a job's performance issue, i notice this SQL in the job block.
This is quite a simple SQL, two tables join, only thing secial is that it is using a sequence GGGC_SSBA.SEQ_MIM_CCCC_PPPP_SSS.NEXTVAL:

dwxxxx2 > SELECT /*+ USE_HASH(S1 TM)*/    GGGC_SSBA.SEQ_MIM_CCCC_PPPP_SSS.NEXTVAL        ,330404
  ,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
  ,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
  ,TO_DATE('2100-12-31', 'YYYY-MM-DD')    ,S1.MATNR AS MATNR      ,S1.VKORG AS
  VKORG   ,S1.VTWEG AS VTWEG      ,S1.KONDM AS KONDM      ,S1.DWERK AS DWERK
  ,S1.VMSTA AS VMSTA      ,S1.VMSTD AS VMSTD      ,S1.LVORM AS LVORM
  ,S1.DT_TM_XTR AS DT_TM_XTR      ,S1.CDHDR_UDATE AS CDHDR_UDATE  
  ,S1.PROVG AS PROVG FROM    GEPC_TEMP_TABLE.KKK_MIM_XXXX_RRRR_SSS TM
  GEPC_SOURCE.EEEIE_ZZZZZ_MMML_SLS S1 WHERE      TM.SOURCE_ROWID = S1.ROWID;

643418 rows selected.
Elapsed: 00:01:00.21

Execution Plan
----------------------------------------------------------
Plan hash value: 3276368232
--------------------------------------------------------------------------------------------
| Id  | Operation           | Name                  | Rows  | Bytes | Cost (%CPU)| Time    |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |                       | 36837 |  2590K|   759   (3)| 00:00:14|
|   1 |  SEQUENCE           | SEQ_MIM_CCCC_PPPP_SSS |       |       |            |         |
|*  2 |   HASH JOIN         |                       | 36837 |  2590K|   759   (3)| 00:00:14|
|   3 |    TABLE ACCESS FULL| KKK_MIM_XXXX_RRRR_SSS | 36837 |   359K|    29   (0)| 00:00:01|
|   4 |    TABLE ACCESS FULL| EEEIE_ZZZZZ_MMML_SLS  |   643K|    38M|   724   (3)| 00:00:14|
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("TM"."SOURCE_ROWID"="S1".ROWID)

Statistics
----------------------------------------------------------
     128804  recursive calls
      96793  db block gets
      77309  consistent gets
        661  physical reads
   21141944  redo size
   25998756  bytes sent via SQL*Net to client
     472358  bytes received via SQL*Net from client
      42896  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
     643418  rows processed

From above red part we see the SQL esclape time is one minute.
That is too long after considerring the query's base tables data volumn, it should not take that much time.

We also notice that big "recursive calls" and "db block gets" Statistics.
It is very clear that they are caused by the sequence in the SQL.

So let's remove the sequence from the query and check again:
dwxxxx2 > SELECT /*+ USE_HASH(S1 TM)*/    330404
  ,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
  ,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
  ,TO_DATE('2100-12-31', 'YYYY-MM-DD')    ,S1.MATNR AS MATNR      ,S1.VKORG AS
  VKORG   ,S1.VTWEG AS VTWEG      ,S1.KONDM AS KONDM      ,S1.DWERK AS DWERK
  ,S1.VMSTA AS VMSTA      ,S1.VMSTD AS VMSTD      ,S1.LVORM AS LVORM
  ,S1.DT_TM_XTR AS DT_TM_XTR      ,S1.CDHDR_UDATE AS CDHDR_UDATE  
  ,S1.PROVG AS PROVG FROM    GEPC_TEMP_TABLE.KKK_MIM_XXXX_RRRR_SSS TM
  GEPC_SOURCE.EEEIE_ZZZZZ_MMML_SLS S1 WHERE      TM.SOURCE_ROWID = S1.ROWID;

643418 rows selected.
Elapsed: 00:00:10.23

Execution Plan
----------------------------------------------------------
Plan hash value: 2074598172
-------------------------------------------------------------------------------------------
| Id  | Operation          | Name                  | Rows  | Bytes | Cost (%CPU)| Time    |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                       | 36837 |  2590K|   759   (3)| 00:00:14|
|*  1 |  HASH JOIN         |                       | 36837 |  2590K|   759   (3)| 00:00:14|
|   2 |   TABLE ACCESS FULL| KKK_MIM_XXXX_RRRR_SSS | 36837 |   359K|    29   (0)| 00:00:01|
|   3 |   TABLE ACCESS FULL| EEEIE_ZZZZZ_MMML_SLS  |   643K|    38M|   724   (3)| 00:00:14|
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("TM"."SOURCE_ROWID"="S1".ROWID)

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
      44998  consistent gets
          0  physical reads
          0  redo size
   21501260  bytes sent via SQL*Net to client
     472358  bytes received via SQL*Net from client
      42896  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
     643418  rows processed

Without the sequence the job completes in 10 seconds.
In another word, the sequence drag down 600% speed of this SQL.

Let's check the sequence attribute:
dwxxxx2 >select * from dba_sequences where SEQUENCE_NAME='SEQ_MIM_CCCC_PPPP_SSS';
SEQUENCE_OWNER SEQUENCE_NAME         INCREMENT_BY C O CACHE_SIZE LAST_NUMBER
-------------- --------------------- ------------ - - ---------- -----------
GGGC_SSBA      SEQ_MIM_CCCC_PPPP_SSS            1 N N         20   109594869

The CACHE_SIZE is 20.
20 is a default value, it is not suitable in our case.
I increase the value to 500 and re-execute the SQL.

This time:
dwxxxx2 > SELECT /*+ USE_HASH(S1 TM)*/    GGGC_SSBA.SEQ_MIM_CCCC_PPPP_SSS.NEXTVAL        ,330404
  ,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
  ,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
  ,TO_DATE('2100-12-31', 'YYYY-MM-DD')    ,S1.MATNR AS MATNR      ,S1.VKORG AS
  VKORG   ,S1.VTWEG AS VTWEG      ,S1.KONDM AS KONDM      ,S1.DWERK AS DWERK
  ,S1.VMSTA AS VMSTA      ,S1.VMSTD AS VMSTD      ,S1.LVORM AS LVORM
  ,S1.DT_TM_XTR AS DT_TM_XTR      ,S1.CDHDR_UDATE AS CDHDR_UDATE  
  ,S1.PROVG AS PROVG FROM    GEPC_TEMP_TABLE.KKK_MIM_XXXX_RRRR_SSS TM
  GEPC_SOURCE.EEEIE_ZZZZZ_MMML_SLS S1 WHERE      TM.SOURCE_ROWID = S1.ROWID;

643418 rows selected.
Elapsed: 00:00:29.60

Execution Plan
----------------------------------------------------------
Plan hash value: 3276368232
--------------------------------------------------------------------------------------------
| Id  | Operation           | Name                  | Rows  | Bytes | Cost (%CPU)| Time    |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |                       | 36837 |  2590K|   759   (3)| 00:00:14|
|   1 |  SEQUENCE           | SEQ_MIM_CCCC_PPPP_SSS |       |       |            |         |
|*  2 |   HASH JOIN         |                       | 36837 |  2590K|   759   (3)| 00:00:14|
|   3 |    TABLE ACCESS FULL| KKK_MIM_XXXX_RRRR_SSS | 36837 |   359K|    29   (0)| 00:00:01|
|   4 |    TABLE ACCESS FULL| EEEIE_ZZZZZ_MMML_SLS  |   643K|    38M|   724   (3)| 00:00:14|
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("TM"."SOURCE_ROWID"="S1".ROWID)

Statistics
----------------------------------------------------------
       5153  recursive calls
       3870  db block gets
      46287  consistent gets
          0  physical reads
     844460  redo size
   25998756  bytes sent via SQL*Net to client
     472358  bytes received via SQL*Net from client
      42896  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
     643418  rows processed

SQL completes in 30 seconds. Also "db block gets" and "recursive calls" is much smaller then first serneriao--cache is 20.

30 seconds means still the sequence drag down 3 times of the sql's speed.
It is still not reasonable.
The DB have mutiple samiliar queries, I don't have enough free time to look deep into this. I may raise a ticket with oracle support to check this once i get free time.
More......

December 7, 2012

MMON Slave Process Reports ORA-7445 [kglic0] Error, Plus Database Hangs

A DB suddenly hang during a load. User come to me.
When i login the server, i can see the issue DB consuming 100% CPU on the server.

Tried "sqlplus / as sysdba" but also hang.
I gernerate a systemstate dump and notice that the PMON is blocked:

PROCESS 2: PMON
    O/S info: user: oracle, term: UNKNOWN, ospid: 1193 
    OSD pid info: Unix process pid: 1193, image: oracle@alpcisddb484.corporate.ge.com (PMON)
      waiting for 60108b48 Child shared pool level=7 child#=3 
        Location from where latch is held: kgh.h LINE:6387 ID:kghalo: 
        Context saved from call: 0
        state=busy [holder orapid=89] wlstate=free [value=0]
          waiters [orapid (seconds since: put on list, posted, alive check)]:
possible holder pid = 89 ospid=10478

Let's check which session is holding latch 60108b48, and what is the session waiting for:
PROCESS 89: 
    O/S info: user: oracle, term: UNKNOWN, ospid: 10478 
    OSD pid info: Unix process pid: 10478, image: oracle@alpcisddb484.corporate.ge.com
      holding    (efd=8) 60108b48 Child shared pool level=7 child#=3 
    Current Wait Stack:
     0: waiting for 'library cache: mutex X'
        idn=0xfc57b2b7, value=0xd100000000, where=0x4f
        wait_id=57 seq_num=60 snap_id=1
        wait times: snap=45 min 11 sec, exc=45 min 11 sec, total=45 min 11 sec
        wait times: max=infinite, heur=45 min 11 sec
        wait counts: calls=0 os=0
        in_wait=1 iflags=0x5a2

Let’s see which session is holding mutex 0xfc57b2b7 and what is the session waiting for:
PROCESS 42: M000
    SO: 0x19f40dda8, type: 4, owner: 0x19f2de3f8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x19f2de3f8, name=session, file=ksu.h LINE:12459, pg=0
    (session) sid: 209 ser: 521 trans: (nil), creator: 0x19f2de3f8
    Current Wait Stack:
      Not in wait; last wait ended 45 min 10 sec ago 

OK, till now we found the root blocking is M000.
It is holding mutex:
PROCESS 42: M000
    SO: 0x19f40dda8, type: 4, owner: 0x19f2de3f8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x19f2de3f8, name=session, file=ksu.h LINE:12459, pg=0
    (session) sid: 209 ser: 521 trans: (nil), creator: 0x19f2de3f8
    Current Wait Stack:
      Not in wait; last wait ended 45 min 10 sec ago 
      KGL-UOL SO Cache(total=240, free=0)
      KGX Atomic Operation Log 0x17d569fc0
       Mutex 0x19eb2b4c8(209, 0) idn 1fc57b2b7 oper EXCL
       Library Cache uid 209 efd 7 whr 49 slp 0
       oper=0 pt1=(nil) pt2=(nil) pt3=(nil)

But above also shows that M000 itself not in a wait since 45 mins back. What is M000 doing and why it doesn't release that mutex all along?
It is un-reasonable.

After review the process state of M000, i notice the process is already dead:
SO: 0x19f2de3f8, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x19f2de3f8, name=process, file=ksu.h LINE:12451, pg=0
  (process) Oracle pid:42, ser:16, calls cur/top: 0x1908c1eb8/0x1906a28f8
            flags : (0x3) DEAD
            flags2: (0x8030),  flags3: (0x0) 
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
    Cleanup details:
      Found dead = 44 min 27 sec ago
      Total Cleanup attempts = 10, Cleanup time = 15 min 49 sec, Cleanup timer = 15 min 45 sec
      Last attempt (full) ended 54 sec ago, Length = 2 min 30 sec, Cleanup timer = 2 min 30 sec

OK. Now it is clear to us that M000's death leading to the DB hang.
But why M000 suddenly dead?
After checking alert.log i find blow:
Thu Dec 06 22:00:48 2012
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x1060] [PC:0x90ED147, kglic0()+1075] [flags: 0x0, count: 1]
Errors in file /x484/dump01/oracle/hrxt/diag/rdbms/hrxdt/hrxdt_m000_10618.trc  (incident=6337):
ORA-07445: exception encountered: core dump [kglic0()+1075] [SIGSEGV] [ADDR:0x1060] [PC:0x90ED147] [Address not mapped to object] []

Now all are clear.
After searching on metalink with above symbol, our scenario excatly matches this bug:
MMON Slave Process Reports ORA-7445 [kglic0] Error, Plus Database Hangs [ID 1487720.1]
More......

December 2, 2012

solve ORA-1658 VIA errorstack

Sometimes a issue itself is not complicated, but it is hard due to other reason, such as lack of neccesary information.

For example, one day user come to us and complain that their application job got below error:
F|Nov 29 20:47:34|UNXSERVER|E|27451|GXXS01|conv|ABORTED-SQLCODE -1658 AT 1223 ErrText = |
F|Nov 29 20:47:34|UNXSERVER|E|27451|GXXS01|conv|ENDING DUE TO SERIOUS SQL ERROR|
F|Nov 29 20:47:34|UNXSERVER|E|27462|GXXS02|conv|ABORTED-SQLCODE -1658 AT 1223 ErrText = |
F|Nov 29 20:47:34|UNXSERVER|E|27462|GXXS02|conv|ENDING DUE TO SERIOUS SQL ERROR|


All they can offer us is above error message and a DB name.
Then how should a DBA look into it?

Above error is not a standard DB error message which should begin with: "ORA-".
But from "SQLCODE -1658", we know it should be a DB error, it is not in a standard format due to their application catch the error message and throw it out with its own exception defination.

Let's see what is 1658:
server003[oracle]_db058> oerr ora 1658
01658, 00000, "unable to create INITIAL extent for segment in tablespace %s"// *Cause: Failed to find sufficient contiguous space to allocate INITIAL
// extent for segment being created.
// *Action: Use ALTER TABLESPACE ADD DATAFILE to add additional space to the
// tablespace or retry with a smaller value for INITIAL


OK, seems it is a simple tablespace free size issue.
But after checking in the DB, found there is not error message in alert.log.
The database have 30+ samiliar very tiny tablespaces.
So what tablespace caused the issue:

Tablespace                    Free MB   Total MB        PCT
------------------------- ----------- ---------- ----------
ICDM_20140801                       5         10         50
ICDM_20130901                       5         10         50
ICDM_20131201                       5         10         50
ICDM_20130501                       5         10         50
ICDM_20130701                       5         10         50
ICDM_20140101                       5         10         50
ICDM_20130801                       5         10         50
ICDM_20131001                       5         10         50
ICDM_20130301                       5         10         50
ICDM_20131101                       5         10         50
ICDM_20140201                       5         10         50
ICDM_20140501                       5         10         50
ICDM_20140601                       5         10         50
ICDM_20130401                       5         10         50
ICDM_20140301                       5         10         50
ICDM_20130601                       5         10         50
ICDM_20140401                       5         10         50
ICDM_20140701                       5         10         50
.............

Still user doesn't know anything, they got the error when running a job. But they don't know what is the current SQL and which table got the error on which tablespace.

Then i enabled the errorstack:
alter system set events '1658 trace name errorstack level 1';
Through above command i require to gernerate errorstack for error id ORA-1658 at DB level.

Then i ask user to re-run their job again.
Of course it fails again with error ORA-1658, but with the help of errorstack, this time we get the detail in alert.log:
Errors in file /server/backup/oracle/dump/getsstm2/diag/rdbms/db058/db058/trace/db058_ora_13006.trc:
ORA-01658: unable to create INITIAL extent for segment in tablespace FILE_20130101


OK, now we got the tablespace name: FILE_20130101.
After adding 4gb to FILE_20130101, the issue solved.

More......