July 30, 2012

RMAN senior: duplicate database from tape without target DB, "RMAN-05548: The set of duplicated tablespaces is not self-contained"

Today, user require to restore a table from old backup.
The table is small, a few thousands rows. But the DB is big, 200gb. We don't have sufficient space to support restoring the whole DB.

We decide to resotre the table and its tablespace only, we will use "duplicate database skip tablespace" clause to skip irrelative tablespaces:

server001[oracle]_ORCLPROD> rman target / auxiliary  sys/oracle@temp2 catalog rmanadmin/rman_psswd@rmancatalog

Recovery Manager: Release 11.2.0.2.0 - Production on Sun Jul 29 04:23:34 2012

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

connected to target database: ORCLPROD (DBID=1650044020)
connected to recovery catalog database
connected to auxiliary database: TEMP2 (not mounted)

RMAN> run
{
set until time "to_date('03-JUL-201212:00:00','dd-mm-yyyyhh24:mi:ss')";
allocate auxiliary  channel ch1 type 'sbt_tape' parms 'ENV=(NB_ORA_SERVER=netbacksev1_b,NB_ORA_CLIENT=server001_b)';
allocate auxiliary  channel ch2 type 'sbt_tape' parms 'ENV=(NB_ORA_SERVER=netbacksev1_b,NB_ORA_CLIENT=server001_b)';
duplicate target database to temp2 nofilenamecheck skip tablespace USERS,TBS1,TBS2,TBS3,TBS4,TBS5,TBS6,TBS7;
}

executing command: SET until clause
..........
..........
Violation: ORA-39906: Constraint SYS_C0050693 between table GETS_SO.GETS_SO_ATS_PROJECT in tablespace OTD and table GETS_SO.GETS_SO_SWPT_SSO_SHOP_ORD in tablespace SWPT_TABLES.
Violation: ORA-39908: Index GETS_SO.GETS_SO_NRE_BUSINESS_MST_PK in tablespace OTX_INDEXES enforces primary constraints  of table GETS_SO.GETS_SO_NRE_BUSINESS_MST in tablespace OTD.
Violation: ORA-39908: Index GETS_SO.GETS_SO_NRE_CUSTOMER_MST_PK in tablespace OTX_INDEXES enforces primary constraints  of table GETS_SO.GETS_SO_NRE_CUSTOMER_MST in tablespace OTD.
Violation: ORA-39908: Index GETS_SO.GETS_SO_NRE_ESR_DETAIL_PK in tablespace OTX_INDEXES enforces primary constraints  of table GETS_SO.GETS_SO_NRE_ESR_DETAIL in tablespace OTD.
Violation: ORA-39907: Index GETS_SO.GETS_SO_NRE_ESR_DSGN_TSK_IE2 in tablespace OTX_INDEXES points to table GETS_SO.GETS_SO_NRE_ESR_DSGN_TSK in tablespace OTD.
Violation: ORA-39907: Index GETS_SO.GETS_SO_NRE_ESR_RPT_TSK_IE1 in tablespace OTX_INDEXES points to table GETS_SO.GETS_SO_NRE_ESR_RPT_TSK in tablespace OTD.
..........
..........

released channel: ch1
released channel: ch2
released channel: ch3
released channel: ch4
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 07/29/2012 05:13:00
RMAN-05501: aborting duplication of target database
RMAN-05548: The set of duplicated tablespaces is not self-contained

The red part is an enhancement in 11g. When we skipped some tablespaces, oracle will check dependency on remaining tablespace through TTS transport check procedure. If the check failed, then rman will refuse to restore.

How to skip that? The only method is to use no-target duplicate.
Since the tablespace checking relies on target database, so with this method, oracle can't check the dependency from target DB:

no target duplicate:
rman auxiliary / catalog catalog rmanadmin/rman_psswd@rmancatalog
set dbid 21111;
run
{
set until time "to_date('03-JUL-201212:00:00','dd-mm-yyyyhh24:mi:ss')";
allocate auxiliary  channel ch1 type 'sbt_tape' parms 'ENV=(NB_ORA_SERVER=netbacksev1_b,NB_ORA_CLIENT=server001_b)';
allocate auxiliary  channel ch2 type 'sbt_tape' parms 'ENV=(NB_ORA_SERVER=netbacksev1_b,NB_ORA_CLIENT=server001_b)';
duplicate database to temp backup location  'sbt_tape' nofilenamecheck skip tablespace USERS,TBS1,TBS2,TBS3,TBS4,TBS5,TBS6,TBS7;
}

More......

July 21, 2012

From 11.2.0.2.5, oracle don't support ADJUST SCN any more

Today, after incomplete recovery and opened a DB, i decided to increasing SCN for the DB.
I set "_allow_error_simulation=true", and use:

alter session set events 'IMMEDIATE trace name adjust_scn level 4490';
ALTER SESSION SET EVENTS '10015 TRACE NAME ADJUST_SCN LEVEL 4490';
at open/mount state for serveral times, but the SCN didn't get changed!

The alert.log shew that oracle even didn't try to turn the SCN.
Very wiered.

Then i decide to try "_minimum_giga_scn" once, i added "_minimum_giga_scn=7000" into init.ora and then startup the DB:

SYS @ apci > startup
ORA-01078: failure in processing system parameters
LRM-00101: unknown parameter name '_minimum_giga_scn'

Kidding? I removed the parameter and again i can startup DB normal.
Let's query the hidden parameter:

SYS @ apci > select ksppinm from x$ksppi where ksppinm like '%giga%';
no rows selected

Let's check different version databases:
10.2.0.4, no problem as we expected:
SYS @ 10gDB> select ksppinm from x$ksppi where ksppinm like '%giga%';
KSPPINM
--------------------------------------------------------------------------------
_minimum_giga_scn

11.2.0.2.4, parameter still there:
SYS @ XXDB >select ksppinm from x$ksppi where ksppinm like '%giga%';
KSPPINM
--------------------------------------------------------------------------------
_minimum_giga_scn

11.2.0.2.5, from this version the parameter disapper:
SQL>  select ksppinm from x$ksppi where ksppinm like '%giga%';
no rows selected

11.2.0.2.6:
SYS @ DB1 > select ksppinm from x$ksppi where ksppinm like '%giga%';
no rows selected

11.2.0.3:
SYS @ DB2 > select ksppinm from x$ksppi where ksppinm like '%giga%';
no rows selected

We can see oracle remove the ADJUST SCN function since 11.2.0.2.5, so we can't use oracle's function to adjust SCN in future.
In there any alternative way?

Yes, we can do first increasing the checkpoint_change# value in datafile header block to a appopraite value, and then re-create controlfile.
So the new controlfile will read our new value as current SCN.

Below is the step:

SYS @ test8 > select current_scn from v$database;
CURRENT_SCN
-----------
    4324729

BBED> set offset 484
        OFFSET          484

BBED> dump
 File: /d735/data02/oracle/test8/data/system01.dbf (1)
 Block: 1                Offsets:  484 to  995           Dba:0x00400001
------------------------------------------------------------------------
 3ffe4100 00000000 59d60a2f 01000000 01000000 62010000 1000db8a 02000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

BBED> modify /x dddddddd
Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) Y
 File: /d735/data02/oracle/test8/data/system01.dbf (1)
 Block: 1                Offsets:  484 to  995           Dba:0x00400001
------------------------------------------------------------------------
 dddddddd 00000000 59d60a2f 01000000 01000000 62010000 1000db8a 02000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

BBED> sum apply;
Check value for File 1, Block 1:
current = 0x5036, required = 0x5036

SYS @ test8 > startup nomount;
ORACLE instance started.

Total System Global Area  730714112 bytes
Fixed Size                  2230080 bytes
Variable Size             318769344 bytes
Database Buffers          322961408 bytes
Redo Buffers               86753280 bytes
SYS @ test8 > CREATE CONTROLFILE REUSE DATABASE "TEST8" RESETLOGS  NOARCHIVELOG
  2      MAXLOGFILES 200
  3      MAXLOGMEMBERS 4
 ...........
 ...........
 29    '/d735/data02/oracle/test8/data/user03_new.dbf'
 30  CHARACTER SET UTF8
 31  ;
Control file created.

SYS @ test8 > alter database open resetlogs;
Database altered.

SYS @ test8 > select current_scn from v$database;
CURRENT_SCN
-----------
 3722305178

Done.
More......

July 5, 2012

Instance evicted due to HAIP failure

Rac cluster, one server's instances abruptly evicted and re-joined the cluster after 10 minutes. Below is from DB's alert log:

Wed Jun 27 00:34:55 2012
Errors in file /q001/product/ora_base/diag/rdbms/ttpoltq/ttpoltq1/trace/ttpoltq1_ora_21759.trc  (incident=139278):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 169.254.25.210 not found. Check output from ifconfig command
.............
ORA-29740: evicted by instance number 3, group incarnation 60
LMON (ospid: 20361): terminating the instance due to error 29740
Instance terminated by LMON, pid = 20361

From red part we know there must be something wrong with interconnect.

After checking found interfaces on 10.0.0.* network were working fine.
Also GPNP logfile shew no abnormal information.

Then the issue must be caused by HAIP.
Let's check CSSD's logfile:

oracle $ cat ocssd.log|egrep -i 'haip|disk'
2012-06-27 00:31:44.417: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 7910 msecs
2012-06-27 00:31:52.433: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 15920 msecs
2012-06-27 00:32:00.449: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 23940 msecs
2012-06-27 00:32:08.170: [    CSSD][1105717568]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 31660 msecs
2012-06-27 00:32:16.481: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 39970 msecs
2012-06-27 00:32:24.497: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 47980 msecs
2012-06-27 00:32:32.513: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 56000 msecs
2012-06-27 00:32:40.945: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 64430 msecs
2012-06-27 00:32:49.236: [    CSSD][1105717568]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 72720 msecs
2012-06-27 00:32:58.565: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 82040 msecs
2012-06-27 00:33:06.581: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 90060 msecs
2012-06-27 00:33:14.597: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 98070 msecs
2012-06-27 00:33:15.539: [    CSSD][1109621056]clssnmvDiskCheck: (ORCL:ASM_DISK01) No I/O completed after 50% maximum time, 200000 ms, will be considered unusable in 99990 ms
2012-06-27 00:33:22.614: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 106090 msecs
2012-06-27 00:33:30.629: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 114100 msecs
2012-06-27 00:33:34.069: [    CSSD][1105717568]clssgmFenceClient: fencing client (0x19278b0), member 1 in group haip.cluster_interconnect, no share, death fence 1, SAGE fence 0
2012-06-27 00:33:34.069: [    CSSD][1105717568]clssgmUnreferenceMember: global grock haip.cluster_interconnect member 1 refcount is 1
2012-06-27 00:33:34.069: [    CSSD][1093089600]clssgmTermMember: Terminating member 1 (0x18afc30) in grock haip.cluster_interconnect
2012-06-27 00:33:34.069: [    CSSD][1093089600]clssgmUnreferenceMember: global grock haip.cluster_interconnect member 1 refcount is 0
2012-06-27 00:33:34.070: [    CSSD][1111198016]clssgmRemoveMember: grock haip.cluster_interconnect, member number 1 (0x18afc30) node number 1 state 0x0 grock type 2
2012-06-27 00:33:34.070: [    CSSD][1111198016]clssgmGrantLocks: 1-> new master (3/3) group haip.cluster_interconnect
2012-06-27 00:33:34.072: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c30034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:33:34.072: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c30034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(109)
2012-06-27 00:33:34.242: [    CSSD][1105717568]clssgmJoinGrock: global grock haip.cluster_interconnect new client 0x2aaab0611a70 with con 0x12622491, requested num 1, flags 0x4000100
2012-06-27 00:33:34.242: [    CSSD][1105717568]clssgmAddGrockMember: adding member to grock haip.cluster_interconnect
2012-06-27 00:33:34.242: [    CSSD][1111198016]clssgmAddMember: Adding fencing for member 1, group haip.cluster_interconnect, death 1, SAGE 0
2012-06-27 00:33:34.242: [    CSSD][1111198016]clssgmAddMember: member (1/0x2aaab051c000) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab00e3190/haip.cluster_interconnect)
2012-06-27 00:33:34.242: [    CSSD][1111198016]clssgmCommonAddMember: global group grock haip.cluster_interconnect member(1/Local) node(1) flags 0x0 0x423b6300
2012-06-27 00:33:34.245: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c70034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:33:34.245: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c70034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(110)
2012-06-27 00:33:34.562: [    CSSD][1105717568]clssgmExitGrock: client 1 (0x2aaab0611a70), grock haip.cluster_interconnect, member 1
2012-06-27 00:33:34.562: [    CSSD][1105717568]clssgmUnregisterPrimary: Unregistering member 1 (0x2aaab051c000) in global grock haip.cluster_interconnect
2012-06-27 00:33:34.562: [    CSSD][1105717568]clssgmUnreferenceMember: global grock haip.cluster_interconnect member 1 refcount is 0
2012-06-27 00:33:34.562: [    CSSD][1111198016]clssgmRemoveMember: grock haip.cluster_interconnect, member number 1 (0x2aaab051c000) node number 1 state 0x4 grock type 2
2012-06-27 00:33:34.564: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c80034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:33:34.565: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c80034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(111)
2012-06-27 00:33:34.627: [    CSSD][1105717568]clssgmJoinGrock: global grock haip.cluster_interconnect new client 0x2aaab04022b0 with con 0x1262256c, requested num 1, flags 0x4000100
2012-06-27 00:33:34.627: [    CSSD][1105717568]clssgmAddGrockMember: adding member to grock haip.cluster_interconnect
2012-06-27 00:33:34.627: [    CSSD][1111198016]clssgmAddMember: Adding fencing for member 1, group haip.cluster_interconnect, death 1, SAGE 0
2012-06-27 00:33:34.627: [    CSSD][1111198016]clssgmAddMember: member (1/0x2aaab04c0f80) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab00e3190/haip.cluster_interconnect)
2012-06-27 00:33:34.627: [    CSSD][1111198016]clssgmCommonAddMember: global group grock haip.cluster_interconnect member(1/Local) node(1) flags 0x0 0x423b6300
2012-06-27 00:33:34.629: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6cb0034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:33:34.630: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6cb0034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(112)
2012-06-27 00:33:39.647: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 123120 msecs
2012-06-27 00:33:47.663: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 131130 msecs
2012-06-27 00:33:55.680: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 139150 msecs
2012-06-27 00:34:04.115: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 147580 msecs
2012-06-27 00:34:05.658: [    CSSD][1109621056]clssnmvDiskCheck: (ORCL:ASM_DISK01) No I/O completed after 75% maximum time, 200000 ms, will be considered unusable in 49880 ms
2012-06-27 00:34:11.712: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 155180 msecs
2012-06-27 00:34:19.728: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 163190 msecs
2012-06-27 00:34:28.314: [    CSSD][1105717568]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 171770 msecs
2012-06-27 00:34:35.732: [    CSSD][1109621056]clssnmvDiskCheck: (ORCL:ASM_DISK01) No I/O completed after 90% maximum time, 200000 ms, will be considered unusable in 19810 ms
2012-06-27 00:34:36.762: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 180220 msecs
2012-06-27 00:34:43.044: [    CSSD][1111198016]clssgmRemoveMember: grock haip.cluster_interconnect, member number 3 (0x2aaaac27b800) node number 3 state 0x0 grock type 2
2012-06-27 00:34:43.044: [    CSSD][1111198016]clssgmGrantLocks: 3-> new master (2/2) group haip.cluster_interconnect
2012-06-27 00:34:43.046: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7200034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:34:43.047: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7200034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(113)
2012-06-27 00:34:43.291: [    CSSD][1111198016]clssgmAddMember: member (3/0x2aaaac5844a0) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab00e3190/haip.cluster_interconnect)
2012-06-27 00:34:43.291: [    CSSD][1111198016]clssgmCommonAddMember: global group grock haip.cluster_interconnect member(3/Remote) node(3) flags 0x0 0xac5844a0
2012-06-27 00:34:43.293: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7400034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:34:43.293: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7400034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(114)
2012-06-27 00:34:43.636: [    CSSD][1111198016]clssgmRemoveMember: grock haip.cluster_interconnect, member number 3 (0x2aaaac5844a0) node number 3 state 0x0 grock type 2
2012-06-27 00:34:43.638: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x74f0034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(3)
2012-06-27 00:34:43.639: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x74f0034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(115)
2012-06-27 00:34:43.730: [    CSSD][1111198016]clssgmAddMember: member (3/0x1dba9d0) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab00e3190/haip.cluster_interconnect)
2012-06-27 00:34:43.731: [    CSSD][1111198016]clssgmCommonAddMember: global group grock haip.cluster_interconnect member(3/Remote) node(3) flags 0x0 0x1dba9d0
2012-06-27 00:34:43.733: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x74a0034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:34:43.733: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x74a0034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(116)
2012-06-27 00:34:45.457: [    CSSD][1105717568]clssgmExitGrock: client 1 (0x2aaab04022b0), grock haip.cluster_interconnect, member 1
2012-06-27 00:34:45.457: [    CSSD][1105717568]clssgmUnregisterPrimary: Unregistering member 1 (0x2aaab04c0f80) in global grock haip.cluster_interconnect
2012-06-27 00:34:45.457: [    CSSD][1105717568]clssgmUnreferenceMember: global grock haip.cluster_interconnect member 1 refcount is 0
2012-06-27 00:34:45.457: [    CSSD][1111198016]clssgmRemoveMember: grock haip.cluster_interconnect, member number 1 (0x2aaab04c0f80) node number 1 state 0x4 grock type 2
2012-06-27 00:34:45.459: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7640034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:34:45.459: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7640034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(117)
2012-06-27 00:43:28.566: [    CSSD][1105717568]clssgmJoinGrock: global grock haip.cluster_interconnect new client 0x2aaab43103e0 with con 0x126435d1, requested num 1, flags 0x4000100
2012-06-27 00:43:28.566: [    CSSD][1105717568]clssgmAddGrockMember: adding member to grock haip.cluster_interconnect
2012-06-27 00:43:28.566: [    CSSD][1111198016]clssgmAddMember: Adding fencing for member 1, group haip.cluster_interconnect, death 1, SAGE 0
2012-06-27 00:43:28.566: [    CSSD][1111198016]clssgmAddMember: member (1/0x2aaab46fbd20) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab00e3190/haip.cluster_interconnect)
2012-06-27 00:43:28.566: [    CSSD][1111198016]clssgmCommonAddMember: global group grock haip.cluster_interconnect member(1/Local) node(1) flags 0x0 0x423b6300
2012-06-27 00:43:28.568: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x2aa0035) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:43:28.569: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x2aa0035) of grock(haip.cluster_interconnect) received all acks, grock update sequence(118)
2012-06-27 00:43:44.470: [    CSSD][1105717568]clssgmExecuteClientRequest: VOTEDISKQUERY recvd from proc 32 (0x2aaab4269a70)
2012-06-27 00:43:45.958: [    CSSD][1105717568]clssgmExecuteClientRequest: VOTEDISKQUERY recvd from proc 32 (0x2aaab4269a70)
2012-06-27 00:43:48.770: [    CSSD][1105717568]clssgmExecuteClientRequest: VOTEDISKQUERY recvd from proc 34 (0x2aaab05c4340)
2012-06-27 00:43:48.771: [    CSSD][1105717568]clssgmExecuteClientRequest: VOTEDISKQUERY recvd from proc 34 (0x2aaab05c4340)
From above we can see the cluster started to countdown votedisk timeout:
2012-06-27 00:31:44.417: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 7910 msecs

The long disk timeout threshold is 200 seconds in 11gR2.
During the countdown process, the cluster had a check on HAIP and belive HAIP already died:
2012-06-27 00:33:34.069: [ CSSD][1105717568]clssgmFenceClient: fencing client (0x19278b0), member 1 in group haip.cluster_interconnect, no share, death fence 1, SAGE fence 0

Cluster start to cleanup HAIP, and tried mutiple times to restart HAIP.

And later, the votedisk is online again, so votedisk timeout countdown stopped, and HAIP also re-brought up on-line, and all instances also be brought up, and all things back to normal then.

More......

July 3, 2012

MMAN in idle state but blocking other sessions which waiting 'SGA: allocation forcing component growth'

Find many blocking in one instance of an three nodes Rac DB.
Most sessions are either hang or running slow, after tracing found the blocker is MMAN process.

Below is from the trace:

Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (apccfp1.apccfp11)
                   os id: 15557
              process id: 64, oracle@alpcispdb604 (J001)
              session id: 2
        session serial #: 23481
    }
    is waiting for 'latch free' with wait info:
    {
                      p1: 'address'=0x6003f518
                      p2: 'number'=0x189
                      p3: 'tries'=0x0
            time in wait: 22.106728 sec
           timeout after: never
                 wait id: 3995
                blocking: 0 sessions
             current sql: UPDATE T_XXXX HSC SET (HSC.HOSTAPP_STATE_ID, HSC..................
            wait history:
              * time between current wait and wait #1: 0.023292 sec
              1.       event: 'gc current block 2-way'
                 time waited: 0.000410 sec
                     wait id: 3994            p1: ''=0x1
                                              p2: ''=0x1a32e
                                              p3: ''=0x1
              * time between wait #1 and #2: 0.000438 sec
              2.       event: 'library cache pin'
                 time waited: 0.000327 sec
                     wait id: 3993            p1: 'handle address'=0x1b7e8b1f0
                                              p2: 'pin address'=0x13b609ef0
                                              p3: '100*mode+namespace'=0xffffffff00160002
              * time between wait #2 and #3: 0.000158 sec
              3.       event: 'library cache lock'
                 time waited: 0.000374 sec
                     wait id: 3992            p1: 'handle address'=0x1b7e8b1f0
                                              p2: 'lock address'=0x172458968
                                              p3: '100*mode+namespace'=0xffffffff00160002
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 1 (apccfp1.apccfp11)
                   os id: 15454
              process id: 65, oracle@alpcispdb604
              session id: 28
        session serial #: 33079
    }
    which is waiting for 'SGA: allocation forcing component growth' with wait info:
    {
            time in wait: 0.047688 sec (last interval)
            time in wait: 4.538964 sec (total)
      heur. time in wait: 4.539095 sec
           timeout after: never
                 wait id: 44447
                blocking: 17 sessions
             current sql: 
            wait history:
              * time between current wait and wait #1: 0.000000 sec
              1.       event: 'SGA: allocation forcing component growth'
                 time waited: 0.000053 sec
                     wait id: 44535
              * time between wait #1 and #2: 0.000000 sec
              2.       event: 'SGA: allocation forcing component growth'
                 time waited: 0.050851 sec (last interval)
                 time waited: 4.491223 sec (total)
                     wait id: 44447
              * time between wait #2 and #3: 0.000000 sec
              3.       event: 'SGA: allocation forcing component growth'
                 time waited: 0.000122 sec
                     wait id: 44534
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 1 (apccfp1.apccfp11)
                   os id: 25858
              process id: 18, oracle@alpcispdb604 (MMAN)
              session id: 451
        session serial #: 1
    }
    which is waiting for 'rdbms ipc message' with wait info:
    {
                      p1: 'timeout'=0x1f
            time in wait: 0.047679 sec
           timeout after: 0.262321 sec
                 wait id: 3739777
                blocking: 18 sessions
             current sql: 
            wait history:
              * time between current wait and wait #1: 0.000060 sec
              1.       event: 'rdbms ipc message'
                 time waited: 0.050904 sec
                     wait id: 3739776         p1: 'timeout'=0x24
              * time between wait #1 and #2: 0.000067 sec
              2.       event: 'rdbms ipc message'
                 time waited: 0.050946 sec
                     wait id: 3739775         p1: 'timeout'=0x29
              * time between wait #2 and #3: 0.000055 sec
              3.       event: 'rdbms ipc message'
                 time waited: 0.050844 sec
                     wait id: 3739774         p1: 'timeout'=0x2e
    }

Chain 1 Signature: 'rdbms ipc message'<='SGA: allocation forcing component growth'<='latch free'
Chain 1 Signature Hash: 0xd39c75b6
-------------------------------------------------------------------------------

There are mutiple other chains, but all chains' root is MMAN.

From above we can see some sessions were facing suffcient memory issue, hence waiting MMAN to turn memory for them.

These blocked sessions were also hodling some resouces, as a result they were blocking more sessions.

And during the whole time, MMAN is idle and do nothing.
We tried to flush shared_pool but also hang, also blocked by MMAN. Since MMAN is critical background process, we can't kill it. Therefore we had to bounce the first instance at last.

More......

July 1, 2012

diagnose 04030: koh-kghu sessi and large of free space

Recently one of our prod DB repeatly got 04030 error:

ORA-04030: out of process memory when trying allocate 64024 bytes (qmxeventInitCt,qmxeventInitctx:buffer)
ORA-04030: out of process memory when trying allocate 824504 bytes (pga heap,kco buffer)
ORA-04030: out of process memory when trying allocate 16328 bytes (PLSQL Opt Pool,pdz7M04_Create)

ORA-04030: (pga heap,kco buffer) may indiate kco buffer in PGA heap consuming too much space and leading to 04030.

But in our case, each ORA-04030's parameters are different.
It means some other type allocated most of PGA, and left no space for other types PGA request.


Trace file shows the 04031 coming from an daily expdp job:
Below is current SQL:

*** 2012-06-30 22:25:39.959
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=b1n2qbfhwsyhb) -----
CREATE TABLE "ET$036AA8ED0002"
   (    "ABSTRACT_ID",
        "FEED_ID",
        "SEC_STATE",
        "IS_ACTIVE",
        "CREATED_BY",
        "CREATED_DATE",
        "UPDATED_BY",
        "UPDATED_DATE",
        "XML_RSS_ABSTRACT",
        "XML_ABSTRACT_SOURCE"
   ) ORGANIZATION EXTERNAL
    ( TYPE ORACLE_DATAPUMP DEFAULT DIRECTORY "DATA_RAD" ACCESS PARAMETERS ( DEBUG = (0 , 0) DATAPUMP INTERNAL TABLE "POCDBA02"."T_NR_ABSTRACT"  JOB ( "SYSTEM
","SYS_EXPORT_FULL_20",1) WORKERID 1 PARALLEL 1 VERSION '11.2.0.2' ENCRYPTPASSWORDISNULL  COMPRESSION DISABLED  ENCRYPTION DISABLED ) LOCATION ('bogus.dat')
)  PARALLEL 1 REJECT LIMIT UNLIMITED
    AS SELECT /*+ PARALLEL(KU$,1) */ "ABSTRACT_ID", "FEED_ID", "SEC_STATE", "IS_ACTIVE", "CREATED_BY", "CREATED_DATE", "UPDATED_BY", "UPDATED_DATE", SYS_XMLT
_2_SC("XML_RSS_ABSTRACT"), SYS_XMLT_2_SC("XML_ABSTRACT_SOURCE")
    FROM RELATIONAL("PXXX02"."T_NR_AXXXCT" NOT XMLTYPE) KU$

Below is the PGA map from trace file, notice red part:
-------------------------
Private memory usage per Oracle process
Top 10 processes:
-------------------------
(percentage is of 4529 MB total allocated memory)
91% pid 62: 4104 MB used of 4107 MB allocated  <= CURRENT PROC
 2% pid 86: 16 MB used of 83 MB allocated (64 MB freeable)
 1% pid 37: 38 MB used of 41 MB allocated (1088 KB freeable)
 1% pid 38: 28 MB used of 29 MB allocated
 0% pid 15: 13 MB used of 13 MB allocated (64 KB freeable)
 0% pid 13: 13 MB used of 13 MB allocated (64 KB freeable)
 0% pid 14: 13 MB used of 13 MB allocated (64 KB freeable)
 0% pid 59: 6482 KB used of 13 MB allocated (6272 KB freeable)
 0% pid 20: 12 MB used of 13 MB allocated
 0% pid 19: 8853 KB used of 13 MB allocated
From current proc we see the expdp consumed 4000+mb PGA and reached the max limit at OS level. By default in 64-bit linux, the max limit for single process PGA is 4GB, controled by below parameter in OS:
vm.max_map_count=65536
And below parameter in DB:
_realfree_heap_pagesize_hint 65536

The easist solution is increasing the limit to allow process to apply more memory, but the best way is to identify why this process consumed so much memory.
Let's check into the trace file:
=======================================
TOP 10 MEMORY USES FOR THIS PROCESS
---------------------------------------
*** 2012-06-30 22:25:39.707
74% 3041 MB, 66260 chunks: "free memory               "
         session heap    ds=0x2b4fb3e26d30  dsprt=0xb7eec80
25% 1010 MB, 64838 chunks: "qmxtgCreateTempLob        "
         koh-kghu sessi  ds=0x2b4fb6638bd0  dsprt=0x2b4fb3e26d30
 0% 8859 KB, 2185 chunks: "kxsFrame4kPage            "
         session heap    ds=0x2b4fb3e26d30  dsprt=0xb7eec80
 0% 6019 KB, 378 chunks: "pmuccst: adt/record       "  PL/SQL
         koh-kghu sessi  ds=0x2b4fb492b110  dsprt=0x2b4fb3e26d30
 0% 5562 KB,  46 chunks: "XVM Storage               "
         XVM subheap of  ds=0x2b4fb498b278  dsprt=0x2b4fb3d89b38
 0% 3300 KB, 207 chunks: "pmucalm coll              "  PL/SQL
         koh-kghu sessi  ds=0x2b4fb48facf8  dsprt=0x2b4fb3e26d30
 0% 2312 KB, 2627 chunks: "permanent memory          "  SQL
         kxs-heap-p      ds=0x2b4fb5d31898  dsprt=0x2b4fb3e26d30
 0% 2281 KB, 185 chunks: "pl/sql vc2                "  PL/SQL
         koh-kghu sessi  ds=0x2b4fb66a8e30  dsprt=0x2b4fb3e26d30
 0% 1319 KB,  85 chunks: "static frame of inst      "  PL/SQL
         koh-kghu sessi  ds=0x2b4fb5bbfa70  dsprt=0x2b4fb3e26d30
 0% 1288 KB, 160 chunks: "kxsFrame8kPage            "
         session heap    ds=0x2b4fb3e26d30  dsprt=0xb7eec80
We can see the first one is belong to session heap(UGA), and type is "free memory", this is very wierd.

I restart the expdp job, and during it is running, the PGA usage keep increasing, and when the PGA allocation size reached 2gb, i made an PGA dump:
news2 >oradebug setospid 11903 
Statement processed. 
news2 >oradebug dump heapdump 536870913 
Statement processed.

Let's check our trace file:
server803.corporate.ge.com[oracle]_news2> cat db2_dw00_11903.trc.2|grep -i "total heap size"
HEAP DUMP heap name="session heap"  desc=0x2b88a22b6d30
Total heap size    =1768309088
HEAP DUMP heap name="koh-kghu sessi"  desc=0x2b88a487cd60
Total heap size    =429703184
HEAP DUMP heap name="koh-kghu sessi"  desc=0x2b88a2db6110
Total heap size    = 10794656
HEAP DUMP heap name="koh-kghu sessi"  desc=0x2b88a239d2b8
Total heap size    =  4149616
HEAP DUMP heap name="koh-kghu sessi"  desc=0x2b88a22de820
Total heap size    =  1272824
HEAP DUMP heap name="koh-kghu sessi"  desc=0x2b88a2da2c48
Total heap size    =  1178344
HEAP DUMP heap name="pga heap"  desc=0xb7e9240
Total heap size    = 11246712
HEAP DUMP heap name="XVMsubheap"  desc=0x2b88a2214b38
Total heap size    =  5709704
HEAP DUMP heap name="XVM subheap of"  desc=0x2b88a2e67d00
Total heap size    =  5708240
HEAP DUMP heap name="qmxdContextEnc"  desc=0x2b88a2e67c00
Total heap size    =     4216
HEAP DUMP heap name="KSFQ heap"  desc=0x2b88a21d7f30
Total heap size    =  1074104
HEAP DUMP heap name="koh-kghu call "  desc=0x2b88a2740010
Total heap size    =   940200
HEAP DUMP heap name="KFK_IO_SUBHEAP"  desc=0x2b88a20684a8
Total heap size    =   476688
HEAP DUMP heap name="PLS PGA hp"  desc=0x2b88a2210bc0
Total heap size    =   135280
HEAP DUMP heap name="top call heap"  desc=0xb7eea60
Total heap size    =   720680
HEAP DUMP heap name="callheap"  desc=0x2b88a2311108
Total heap size    =   285856
HEAP DUMP heap name="qbcqtcHTHeap"  desc=0x2b88a2354b00
Total heap size    =     6400
HEAP DUMP heap name="qcpi4.c.kgght"  desc=0x2b88a2354738
Total heap size    =     3624
HEAP DUMP heap name="kggec.c.kggfa"  desc=0x2b88a2354478
Total heap size    =     1528
HEAP DUMP heap name="PLS CGA hp"  desc=0x2b88a2b83ee8
Total heap size    =     1992
HEAP DUMP heap name="callheap"  desc=0xb7edc18
Total heap size    =   131064
HEAP DUMP heap name="callheap"  desc=0xb7edb60
Total heap size    =    38000
HEAP DUMP heap name="kti call subhe"  desc=0x2b88a220d7e8
Total heap size    =    36968
HEAP DUMP heap name="top uga heap"  desc=0xb7eec80
Total heap size    =1769554808
HEAP DUMP heap name="session heap"  desc=0x2b88a22b6d30
Total heap size    =1768309088
HEAP DUMP heap name="koh-kghu sessi"  desc=0x2b88a487cd60
Total heap size    =429703184
HEAP DUMP heap name="koh-kghu sessi"  desc=0x2b88a2db6110
Total heap size    = 10794656
HEAP DUMP heap name="koh-kghu sessi"  desc=0x2b88a239d2b8
Total heap size    =  4149616
HEAP DUMP heap name="koh-kghu sessi"  desc=0x2b88a22de820
Total heap size    =  1272824
HEAP DUMP heap name="koh-kghu sessi"  desc=0x2b88a2da2c48
Total heap size    =  1178344

Notice outstanding large part, "top uga heap" is big, caused by "session heap". 
Then let's check session heap:
HEAP DUMP heap name="session heap"  desc=0x2b88a22b6d30
...........
Total heap size    =1768309088
Total free space   =1296798672
...........
FIVE LARGEST SUB HEAPS for heap name="session heap"   desc=0x2b88a22b6d30
  Subheap ds=0x2b88a487cd60  heap name=  koh-kghu sessi  size=       430756008
   owner=(nil)  latch=(nil)
  Subheap ds=0x2b88a2db6110  heap name=  koh-kghu sessi  size=        10821000
   owner=(nil)  latch=(nil)
  Subheap ds=0x2b88a239d2b8  heap name=  koh-kghu sessi  size=         4159400
   owner=(nil)  latch=(nil)
  Subheap ds=0x2b88a22de820  heap name=  koh-kghu sessi  size=         1276088
   owner=(nil)  latch=(nil)
  Subheap ds=0x2b88a2da2c48  heap name=  koh-kghu sessi  size=         1181128
   owner=(nil)  latch=(nil)

We can see session heap is 1.7gb big. 0.5gb consumend on "koh-kghu sessi", and all other 1.2gb are free!
Eventhough the space is free, but it only belong to session heap and can't be used by other types in different heaps.

Why “koh-kghu sessi” consumend so much space and why so many free space are allcoated in session heap while it shouldn't be?

After searched on metalink, we may hit two bugs here, one for free memory, one for "koh-kghu sessi":
1.ORA-04030 encountered by a process which allocate large free memory [ID 1439928.1]
2.Bug 10194031 - EXPDP of OR XML leaks memory / runs slow [ID 10194031.8]



More......