How much can SEQUENCE drag down speed of SQL?
10%? 20%? Below my working case show you SEQUENCE drag down 600% speed of one SQL.
It is during my recently shooting of a job's performance issue, i notice this SQL in the job block.
This is quite a simple SQL, two tables join, only thing secial is that it is using a sequence GGGC_SSBA.SEQ_MIM_CCCC_PPPP_SSS.NEXTVAL:
dwxxxx2 > SELECT /*+ USE_HASH(S1 TM)*/ GGGC_SSBA.SEQ_MIM_CCCC_PPPP_SSS.NEXTVAL ,330404
,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
,TO_DATE('2100-12-31', 'YYYY-MM-DD') ,S1.MATNR AS MATNR ,S1.VKORG AS
VKORG ,S1.VTWEG AS VTWEG ,S1.KONDM AS KONDM ,S1.DWERK AS DWERK
,S1.VMSTA AS VMSTA ,S1.VMSTD AS VMSTD ,S1.LVORM AS LVORM
,S1.DT_TM_XTR AS DT_TM_XTR ,S1.CDHDR_UDATE AS CDHDR_UDATE
,S1.PROVG AS PROVG FROM GEPC_TEMP_TABLE.KKK_MIM_XXXX_RRRR_SSS TM
GEPC_SOURCE.EEEIE_ZZZZZ_MMML_SLS S1 WHERE TM.SOURCE_ROWID = S1.ROWID;
643418 rows selected.
Elapsed: 00:01:00.21
Execution Plan
----------------------------------------------------------
Plan hash value: 3276368232
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 36837 | 2590K| 759 (3)| 00:00:14|
| 1 | SEQUENCE | SEQ_MIM_CCCC_PPPP_SSS | | | | |
|* 2 | HASH JOIN | | 36837 | 2590K| 759 (3)| 00:00:14|
| 3 | TABLE ACCESS FULL| KKK_MIM_XXXX_RRRR_SSS | 36837 | 359K| 29 (0)| 00:00:01|
| 4 | TABLE ACCESS FULL| EEEIE_ZZZZZ_MMML_SLS | 643K| 38M| 724 (3)| 00:00:14|
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("TM"."SOURCE_ROWID"="S1".ROWID)
Statistics
----------------------------------------------------------
128804 recursive calls
96793 db block gets
77309 consistent gets
661 physical reads
21141944 redo size
25998756 bytes sent via SQL*Net to client
472358 bytes received via SQL*Net from client
42896 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
643418 rows processed
From above red part we see the SQL esclape time is one minute.
That is too long after considerring the query's base tables data volumn, it should not take that much time.
We also notice that big "recursive calls" and "db block gets" Statistics.
It is very clear that they are caused by the sequence in the SQL.
So let's remove the sequence from the query and check again:
dwxxxx2 > SELECT /*+ USE_HASH(S1 TM)*/ 330404
,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
,TO_DATE('2100-12-31', 'YYYY-MM-DD') ,S1.MATNR AS MATNR ,S1.VKORG AS
VKORG ,S1.VTWEG AS VTWEG ,S1.KONDM AS KONDM ,S1.DWERK AS DWERK
,S1.VMSTA AS VMSTA ,S1.VMSTD AS VMSTD ,S1.LVORM AS LVORM
,S1.DT_TM_XTR AS DT_TM_XTR ,S1.CDHDR_UDATE AS CDHDR_UDATE
,S1.PROVG AS PROVG FROM GEPC_TEMP_TABLE.KKK_MIM_XXXX_RRRR_SSS TM
GEPC_SOURCE.EEEIE_ZZZZZ_MMML_SLS S1 WHERE TM.SOURCE_ROWID = S1.ROWID;
643418 rows selected.
Elapsed: 00:00:10.23
Execution Plan
----------------------------------------------------------
Plan hash value: 2074598172
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 36837 | 2590K| 759 (3)| 00:00:14|
|* 1 | HASH JOIN | | 36837 | 2590K| 759 (3)| 00:00:14|
| 2 | TABLE ACCESS FULL| KKK_MIM_XXXX_RRRR_SSS | 36837 | 359K| 29 (0)| 00:00:01|
| 3 | TABLE ACCESS FULL| EEEIE_ZZZZZ_MMML_SLS | 643K| 38M| 724 (3)| 00:00:14|
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("TM"."SOURCE_ROWID"="S1".ROWID)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
44998 consistent gets
0 physical reads
0 redo size
21501260 bytes sent via SQL*Net to client
472358 bytes received via SQL*Net from client
42896 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
643418 rows processed
Without the sequence the job completes in 10 seconds.
In another word, the sequence drag down 600% speed of this SQL.
Let's check the sequence attribute:
dwxxxx2 >select * from dba_sequences where SEQUENCE_NAME='SEQ_MIM_CCCC_PPPP_SSS';
SEQUENCE_OWNER SEQUENCE_NAME INCREMENT_BY C O CACHE_SIZE LAST_NUMBER
-------------- --------------------- ------------ - - ---------- -----------
GGGC_SSBA SEQ_MIM_CCCC_PPPP_SSS 1 N N 20 109594869
The CACHE_SIZE is 20.
20 is a default value, it is not suitable in our case.
I increase the value to 500 and re-execute the SQL.
This time:
dwxxxx2 > SELECT /*+ USE_HASH(S1 TM)*/ GGGC_SSBA.SEQ_MIM_CCCC_PPPP_SSS.NEXTVAL ,330404
,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
,TO_DATE('2100-12-31', 'YYYY-MM-DD') ,S1.MATNR AS MATNR ,S1.VKORG AS
VKORG ,S1.VTWEG AS VTWEG ,S1.KONDM AS KONDM ,S1.DWERK AS DWERK
,S1.VMSTA AS VMSTA ,S1.VMSTD AS VMSTD ,S1.LVORM AS LVORM
,S1.DT_TM_XTR AS DT_TM_XTR ,S1.CDHDR_UDATE AS CDHDR_UDATE
,S1.PROVG AS PROVG FROM GEPC_TEMP_TABLE.KKK_MIM_XXXX_RRRR_SSS TM
GEPC_SOURCE.EEEIE_ZZZZZ_MMML_SLS S1 WHERE TM.SOURCE_ROWID = S1.ROWID;
643418 rows selected.
Elapsed: 00:00:29.60
Execution Plan
----------------------------------------------------------
Plan hash value: 3276368232
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 36837 | 2590K| 759 (3)| 00:00:14|
| 1 | SEQUENCE | SEQ_MIM_CCCC_PPPP_SSS | | | | |
|* 2 | HASH JOIN | | 36837 | 2590K| 759 (3)| 00:00:14|
| 3 | TABLE ACCESS FULL| KKK_MIM_XXXX_RRRR_SSS | 36837 | 359K| 29 (0)| 00:00:01|
| 4 | TABLE ACCESS FULL| EEEIE_ZZZZZ_MMML_SLS | 643K| 38M| 724 (3)| 00:00:14|
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("TM"."SOURCE_ROWID"="S1".ROWID)
Statistics
----------------------------------------------------------
5153 recursive calls
3870 db block gets
46287 consistent gets
0 physical reads
844460 redo size
25998756 bytes sent via SQL*Net to client
472358 bytes received via SQL*Net from client
42896 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
643418 rows processed
SQL completes in 30 seconds. Also "db block gets" and "recursive calls" is much smaller then first serneriao--cache is 20.
30 seconds means still the sequence drag down 3 times of the sql's speed.
It is still not reasonable.
The DB have mutiple samiliar queries, I don't have enough free time to look deep into this. I may raise a ticket with oracle support to check this once i get free time.
More......
GE, AGFA. experienced in 9i/10g/11g/12c, AdvanceReplication, Stream, GoldenGate, RAC, DataGuard, ODA, EXADATA, GridControl. In my career I have always been assigned to handle most complicated and critical oracle issues without solution online, and I hope to record some of them via this blog.
December 20, 2012
Performance Shooting: How much can SEQUENCE slow down a SQL
December 7, 2012
MMON Slave Process Reports ORA-7445 [kglic0] Error, Plus Database Hangs
A DB suddenly hang during a load. User come to me.
When i login the server, i can see the issue DB consuming 100% CPU on the server.
Tried "sqlplus / as sysdba" but also hang.
I gernerate a systemstate dump and notice that the PMON is blocked:
PROCESS 2: PMON
O/S info: user: oracle, term: UNKNOWN, ospid: 1193
OSD pid info: Unix process pid: 1193, image: oracle@alpcisddb484.corporate.ge.com (PMON)
waiting for 60108b48 Child shared pool level=7 child#=3
Location from where latch is held: kgh.h LINE:6387 ID:kghalo:
Context saved from call: 0
state=busy [holder orapid=89] wlstate=free [value=0]
waiters [orapid (seconds since: put on list, posted, alive check)]:
possible holder pid = 89 ospid=10478
Let's check which session is holding latch 60108b48, and what is the session waiting for:
PROCESS 89:
O/S info: user: oracle, term: UNKNOWN, ospid: 10478
OSD pid info: Unix process pid: 10478, image: oracle@alpcisddb484.corporate.ge.com
holding (efd=8) 60108b48 Child shared pool level=7 child#=3
Current Wait Stack:
0: waiting for 'library cache: mutex X'
idn=0xfc57b2b7, value=0xd100000000, where=0x4f
wait_id=57 seq_num=60 snap_id=1
wait times: snap=45 min 11 sec, exc=45 min 11 sec, total=45 min 11 sec
wait times: max=infinite, heur=45 min 11 sec
wait counts: calls=0 os=0
in_wait=1 iflags=0x5a2
Let’s see which session is holding mutex 0xfc57b2b7 and what is the session waiting for:
PROCESS 42: M000
SO: 0x19f40dda8, type: 4, owner: 0x19f2de3f8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x19f2de3f8, name=session, file=ksu.h LINE:12459, pg=0
(session) sid: 209 ser: 521 trans: (nil), creator: 0x19f2de3f8
Current Wait Stack:
Not in wait; last wait ended 45 min 10 sec ago
OK, till now we found the root blocking is M000.
It is holding mutex:
PROCESS 42: M000
SO: 0x19f40dda8, type: 4, owner: 0x19f2de3f8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x19f2de3f8, name=session, file=ksu.h LINE:12459, pg=0
(session) sid: 209 ser: 521 trans: (nil), creator: 0x19f2de3f8
Current Wait Stack:
Not in wait; last wait ended 45 min 10 sec ago
KGL-UOL SO Cache(total=240, free=0)
KGX Atomic Operation Log 0x17d569fc0
Mutex 0x19eb2b4c8(209, 0) idn 1fc57b2b7 oper EXCL
Library Cache uid 209 efd 7 whr 49 slp 0
oper=0 pt1=(nil) pt2=(nil) pt3=(nil)
But above also shows that M000 itself not in a wait since 45 mins back. What is M000 doing and why it doesn't release that mutex all along?
It is un-reasonable.
After review the process state of M000, i notice the process is already dead:
SO: 0x19f2de3f8, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x19f2de3f8, name=process, file=ksu.h LINE:12451, pg=0
(process) Oracle pid:42, ser:16, calls cur/top: 0x1908c1eb8/0x1906a28f8
flags : (0x3) DEAD
flags2: (0x8030), flags3: (0x0)
intr error: 0, call error: 0, sess error: 0, txn error 0
intr queue: empty
ksudlp FALSE at location: 0
Cleanup details:
Found dead = 44 min 27 sec ago
Total Cleanup attempts = 10, Cleanup time = 15 min 49 sec, Cleanup timer = 15 min 45 sec
Last attempt (full) ended 54 sec ago, Length = 2 min 30 sec, Cleanup timer = 2 min 30 sec
OK. Now it is clear to us that M000's death leading to the DB hang.
But why M000 suddenly dead?
After checking alert.log i find blow:
Thu Dec 06 22:00:48 2012
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x1060] [PC:0x90ED147, kglic0()+1075] [flags: 0x0, count: 1]
Errors in file /x484/dump01/oracle/hrxt/diag/rdbms/hrxdt/hrxdt_m000_10618.trc (incident=6337):
ORA-07445: exception encountered: core dump [kglic0()+1075] [SIGSEGV] [ADDR:0x1060] [PC:0x90ED147] [Address not mapped to object] []
Now all are clear.
After searching on metalink with above symbol, our scenario excatly matches this bug:
MMON Slave Process Reports ORA-7445 [kglic0] Error, Plus Database Hangs [ID 1487720.1]
More......
December 2, 2012
solve ORA-1658 VIA errorstack
Sometimes a issue itself is not complicated, but it is hard due to other reason, such as lack of neccesary information.
For example, one day user come to us and complain that their application job got below error:
F|Nov 29 20:47:34|UNXSERVER|E|27451|GXXS01|conv|ABORTED-SQLCODE -1658 AT 1223 ErrText = |
F|Nov 29 20:47:34|UNXSERVER|E|27451|GXXS01|conv|ENDING DUE TO SERIOUS SQL ERROR|
F|Nov 29 20:47:34|UNXSERVER|E|27462|GXXS02|conv|ABORTED-SQLCODE -1658 AT 1223 ErrText = |
F|Nov 29 20:47:34|UNXSERVER|E|27462|GXXS02|conv|ENDING DUE TO SERIOUS SQL ERROR|
All they can offer us is above error message and a DB name.
Then how should a DBA look into it?
Above error is not a standard DB error message which should begin with: "ORA-".
But from "SQLCODE -1658", we know it should be a DB error, it is not in a standard format due to their application catch the error message and throw it out with its own exception defination.
Let's see what is 1658:
server003[oracle]_db058> oerr ora 1658
01658, 00000, "unable to create INITIAL extent for segment in tablespace %s"// *Cause: Failed to find sufficient contiguous space to allocate INITIAL
// extent for segment being created.
// *Action: Use ALTER TABLESPACE ADD DATAFILE to add additional space to the
// tablespace or retry with a smaller value for INITIAL
OK, seems it is a simple tablespace free size issue.
But after checking in the DB, found there is not error message in alert.log.
The database have 30+ samiliar very tiny tablespaces.
So what tablespace caused the issue:Tablespace Free MB Total MB PCT
------------------------- ----------- ---------- ----------
ICDM_20140801 5 10 50
ICDM_20130901 5 10 50
ICDM_20131201 5 10 50
ICDM_20130501 5 10 50
ICDM_20130701 5 10 50
ICDM_20140101 5 10 50
ICDM_20130801 5 10 50
ICDM_20131001 5 10 50
ICDM_20130301 5 10 50
ICDM_20131101 5 10 50
ICDM_20140201 5 10 50
ICDM_20140501 5 10 50
ICDM_20140601 5 10 50
ICDM_20130401 5 10 50
ICDM_20140301 5 10 50
ICDM_20130601 5 10 50
ICDM_20140401 5 10 50
ICDM_20140701 5 10 50
.............
Still user doesn't know anything, they got the error when running a job. But they don't know what is the current SQL and which table got the error on which tablespace.
Then i enabled the errorstack:
alter system set events '1658 trace name errorstack level 1';
Through above command i require to gernerate errorstack for error id ORA-1658 at DB level.
Then i ask user to re-run their job again.
Of course it fails again with error ORA-1658, but with the help of errorstack, this time we get the detail in alert.log:
Errors in file /server/backup/oracle/dump/getsstm2/diag/rdbms/db058/db058/trace/db058_ora_13006.trc:
ORA-01658: unable to create INITIAL extent for segment in tablespace FILE_20130101
OK, now we got the tablespace name: FILE_20130101.
After adding 4gb to FILE_20130101, the issue solved.
More......
November 27, 2012
LOCK diagnose: NOWAIT lock requests could hang in RAC (blocked in compatble mode)
This summary is not available. Please
click here to view the post.
More......
November 5, 2012
DEBUG RMAN: "RMAN-20052: invalid datafile create SCN" due to PLUGIN_SCN out of SYNC
Below error persists for one database during SYNC operation:RMAN> resync catalog;
starting full resync of recovery catalog
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of resync command on default channel at 11/05/2012 07:10:03
RMAN-20052: invalid datafile create SCN
Target DB version:
11.2.0.2.6
Catalog DB version:
11.2.0.2.6
Catalog schema version:
catalog > Select * from rcver ;
VERSION
------------
11.02.00.03
After compare between target and catalog, found the create_scn for all datafiles and tempfiles are SYNC in both DB:CATALOG DB:
catalog > select file#,CREATION_CHANGE# from RC_DATAFILE where DB_NAME='ATPPSS' order by 1;
FILE# CREATION_CHANGE#
---------- ----------------
1 7
2 2178
3 969548
4 18151
5 1002153
6 1053956
7 1054333
8 1501572
9 1502015
10 1506616
11 1515561
12 1516921
13 5361004
14 5779799
14 rows selected.
TARGET DB:
target > select file#,CREATION_CHANGE# from v$datafile;
FILE# CREATION_CHANGE#
---------- ----------------
1 7
2 2178
3 969548
4 18151
5 1002153
6 1053956
7 1054333
8 1501572
9 1502015
10 1506616
11 1515561
12 1516921
13 5361004
14 5779799
14 rows selected.
After debug rman session, we can see the error report when doing DBMS_RCVCAT.CHECKDATAFILE on file# 13:DBGRPC: krmxrpc - channel default kpurpc2 err=0 db=rcvcat proc=RMANADMIN.DBMS_RCVCAT.CHECKDATAFILE excl: 0
DBGRCVCAT: checkDatafile - size changed for file#: 12 from 131072 to 153600
DBGRCVCAT: incResynActions - Not debugging
DBGRCVCAT: setCloneName: file#=12, creation_fname=1516921, plugin_change#=0, old_clone_fname=, new_clone_fname=NONE
DBGRESYNC: channel default: Resyncing datafile +DATA/atppss/datafile/example.3079.785280031 [23:38:41.110] (resync)
DBGRESYNC: channel default: Calling checkDataFile for fileno 13 size 131072 [23:38:41.110] (resync)
DBGRPC: krmxrpc - channel default kpurpc2 err=20052 db=rcvcat proc=RMANADMIN.DBMS_RCVCAT.CHECKDATAFILE excl: 97
DBGRPC: krmxrpc - channel default kpurpc2 err=0 db=rcvcat proc=RMANADMIN.DBMS_RCVCAT.CANCELCKPT excl: 0
DBGRCVCAT: cancelCkpt - rollback, released all locks
DBGRPC: krmxrpc - channel default kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.CFILEUSECURRENT excl: 0
Below is the defination for 20052 in DBMS_RCVCAT.CHECKDATAFILE:PROCEDURE CHECKDATAFILE()
IF (plugged_readonly = 'NO' AND create_scn > this_ckp_scn) THEN
raise_application_error(-20052, 'Invalid datafile create SCN');
ELSIF (plugged_readonly = 'YES' AND plugin_scn > this_ckp_scn) THEN
raise_application_error(-20055, 'Invalid datafile plugin SCN');
END IF;
...........
IF (create_scn = dfRec.create_scn AND
plugin_scn = dfRec.plugin_scn) THEN
...........
ELSIF ((case when plugged_readonly = 'NO' then
create_scn else plugin_scn end) >
(case when dfRec.plugged_readonly = 'NO' then
dfRec.create_scn else dfRec.plugin_scn end)) THEN
...........
ELSE -- (create_scn < dfRec.create_scn)
-- The client passed us a create SCN for this datafile that is
-- less than the SCN in the rcvcat. I.e., the target database
-- controlfile now contains a previous incarnation of this datafile.
-- This can happen only if the user has some old controlfile, or
-- has done a resetlogs and not told us about it.
IF (plugged_readonly = 'NO') THEN
raise_application_error(-20052, 'Invalid datafile create SCN');
ELSE
raise_application_error(-20055, 'Invalid datafile plugin SCN');
END IF;
END IF;
All the judgement done by three argument: plugin_scn, plugged_readonly,create_scn. Let's check above three values in both target DB and catalog DB: CATALOG DB:
catalog > select CREATE_SCN,PLUGGED_READONLY,PLUGIN_SCN from df where DBINC_KEY='3199420' and FILE# =13;
CREATE_SCN PLU PLUGIN_SCN
---------- --- ----------
5361004 NO 992188
TARGET DB:
target > select CREATION_CHANGE#,PLUGGED_READONLY,PLUGIN_CHANGE# from v$datafile where file#=13;
CREATION_CHANGE# PLU PLUGIN_CHANGE#
---------------- --- --------------
5361004 NO 0
Now we can clearly see that red part's differ is the reason we got the error.
Since pluggable-DB is a new feature in 12c, and not published in any current version as yet.
So currently all databases’ all datafiles’ PLUGIN_CHANGE# should be 0.
The PLUGIN_SCN is 992188 in CATALOG DB is obviously not correct. Un-Register/Re-register the DB in catalog may or may not help to solve this issue.
More......
October 29, 2012
LMON (ospid: 11863): terminating the instance due to error 481
Two nodes RAC, suddenly begin reconfiguration and very soon the first instance terminated with error:
LMON (ospid: 11863): terminating the instance due to error 481
So we have two problems:
1. Why the configration suddenly began?
2. Why the first instance terminated during the reconfiguration?
1:Why the configration suddenly began?
From the first node's lmon trace file:
* DRM RCFG called (swin 0)
*** 2012-10-22 01:23:17.277
CGS recovery timeout = 85 sec
Begin DRM(27911) (swin 0)
* drm quiesce
*** 2012-10-22 01:25:28.038
* Request pseudo reconfig due to drm quiesce hang
*** 2012-10-22 01:25:28.051
kjxgmrcfg: Reconfiguration started, type 6
CGS/IMR TIMEOUTS:
CSS recovery timeout = 31 sec (Total CSS waittime = 65)
IMR Reconfig timeout = 75 sec
CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 94 0.
...............
From above we can see it is due to DRM(DYNAMIC BLOCK RE-MASTER) query timeout, hence the reconfigration began.
Also we can see the type is 6.
We often see type 1,2,3 as reconfigration reason. But type 6 is not a normal reason.
After check with oracle support, they also confirmed that type 6 means DRM.
OK. So now the reconfigration is clear: DRM query timeout.
Next let's see why the first instance terminated during the re-configration step.
Since this reconfigration is not triggered due to interconnect failure nor controlfile heartbeat issue, so instance eviction is not absolutly required.
After checking from below LMON trace we can see the reason why LMON terminated the instance:
*** 2012-10-22 01:25:29.650
All grantable enqueues granted
2012-10-22 01:25:29.650271 : * Begin lmon rcfg step KJGA_RCFG_PCMREPLAY
*** 2012-10-22 01:30:56.000
2012-10-22 01:30:56.000338 : * kjfclmsync: waited 327 secs for lmses to finish parallel rcfg work, terminating instance
kjzduptcctx: Notifying DIAG for crash event
...............
From red part it is clear when during the reconfigration, the LMS failed to finish its work(maybe either hang, blocked by some resouce, or some bug) and timeout, hence LMON terminated the instace.
Next we should go ahead to check LMS process trace file. Since we set max_dump_file_size for the DB and lms trace file already reached its max size and not updating for a long time, there is no information in LMS for us to refer.
The issue can be avoided by set below parameter to disable DRM:
_gc_read_mostly_locking=FALSE
More......
October 11, 2012
A few troubles when doing PSU patching on AIX RAC
when help a friend patching an AIX Rac server, i got a few troubles, record them here.
1. opatch failed and reporting:
Required amount of space(23428.225MB) is not available.
UtilSession failed:
Prerequisite check "CheckSystemSpace" failed.
Log file location: /oracle/app/11.2.0/grid/cfgtoollogs/opatch/opatch2012-10-07_18-26-07PM_1.log
OPatch failed with error code 73
For this bug we can use below parameter to skip space check when using opatch:
OPatch.SKIP_VERIFY_SPACE=true
2. opatch failed and report:
Patching component oracle.rdbms, 11.2.0.3.0...
Copy failed from '/oracle/11.2.0.3.3PSU/13919095/custom rver/13919095/files b bclsra11.so' to '/oracle/app/11.2.0/grid b bclsra11.so'...
Please verify all applications associated with the Oracle Home '/oracle/app/11.2.0/grid' are shut down. If this is AIX, please perform solution documented in Note 739963.1 on https://myoraclesupport.oracle.com.
Do you want to retry copying the file? [y|n]
opatch reports a file is in use while currently from fuser command we can see there is no process holding the file.
It is a known issue in AIX, and the fix is to run below command as root:
4. opatch failed and report:
.storage is locked by another opatch session.
This is because another opatch activity just terminated due to the third issue and left its lock file t there.
The solution is remove *lock* file under $ORACLE_HOME/.storage
/usr/sbin/slibclean
5. Finally patch succeed, but when starting grid service on second node, failed with below error:
2012-10-07 13:23:18.069: [ AGFW][5912] {0:0:31} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /oracle/grid/11.2.0.3/bin/oraagent Category: -2 Operation: setpcred Loc: spawnproc26 OS error: 2 Other : child couldn't setpcred
On metalink there is only one bug report for same error messsage and same case as us----CRS failed to startup after patching.
No workaround.
More......
September 22, 2012
cluster on first node failed to startup after reboot due to GPNP hang on third node
3 nodes Rac, in a PSU activity, the cluster on first node failed to startup after patching.
Below is the error in alertnode1.log:
2012-09-22 03:01:39.894
[ohasd(28563)]CRS-2112:The OLR service started on node cintrnpdb001.
2012-09-22 03:01:39.917
[ohasd(28563)]CRS-1301:Oracle High Availability Service started on node cintrnpdb001.
2012-09-22 03:01:39.918
[ohasd(28563)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2012-09-22 03:01:53.765
[gpnpd(28771)]CRS-2328:GPNPD started on node cintrnpdb001.
2012-09-22 03:01:56.228
[cssd(28831)]CRS-1713:CSSD daemon is started in clustered mode
2012-09-22 03:11:55.298
[/prod/product/oracle/11.2.0.2/grid/bin/cssdagent(28819)]CRS-5818:Aborted command 'start for resource: ora.cssd 1 1' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:2} in /prod/product/oracle/11.2.0.2/grid/log/cintrnpdb001/agent/ohasd/oracssdagent_root/oracssdagent_root.log.
2012-09-22 03:11:55.299
[cssd(28831)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /prod/product/oracle/11.2.0.2/grid/log/cintrnpdb001/cssd/ocssd.log
2012-09-22 03:11:55.299
[cssd(28831)]CRS-1603:CSSD on node cintrnpdb001 shutdown by user.
During the whole time cssd.log keep throwing below error:
2012-09-22 03:00:00.546: [ GPNP][2692574496]clsgpnpm_newWiredMsg: [at clsgpnpm.c:741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http://www.grid-pnp.org/2005/12/gpnp-errors#"]
2012-09-22 03:00:02.558: [ GPNP][2692574496]clsgpnpm_newWiredMsg: [at clsgpnpm.c:741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http://www.grid-pnp.org/2005/12/gpnp-errors#"]
2012-09-22 03:00:04.568: [ GPNP][2692574496]clsgpnpm_newWiredMsg: [at clsgpnpm.c:741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http://www.grid-pnp.org/2005/12/gpnp-errors#"]
Above information show that cssd failed to get GPNPD profile.
Let's go to check gpnpd to see what's wrong:
2012-09-22 02:46:51.594: [ GPNP][1088584000]clsgpnp_profileCallUrlInt: [at clsgpnp.c:2104] put-profile call to url "tcp://cintrnpdb003:33612" disco "mdns:service:gpnp._tcp.local.://cintrnpdb003:33612/agent=gpnpd,cname=crs_transp,host=cintrnpdb003,pid=20890/gpnpd h:cintrnpdb003 c:crs_transp" [f=0 claimed- host:cintrnpdb001 cname:crs_transp seq:6 auth:CN=GPnP_peer]
2012-09-22 02:47:05.424: [ OCRMSG][1085512000]GIPC error [29] msg [gipcretConnectionRefused]
2012-09-22 02:47:26.446: [ OCRMSG][1085512000]GIPC error [29] msg [gipcretConnectionRefused]
From above we can see when first node's gpnpd startup, it send a call to third node's gpnpd, but get no response.
So till now we can suspect that the issue is because gpnpd on third node have some issue.
After i kill gpnpd.bin on third node and let it re-start, the cluster on first node succeed startup immediately.
More......
September 12, 2012
CSSD terminated from clssnmvDiskPingMonitorThread without disk timeout countdown
One node's cluster suddenly terminated.
Below is the message in cluster's alert.log:
2012-09-11 11:41:30.328
[ctssd(22428)]CRS-2409:The clock on host node8 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchroniza
tion Service is running in observer mode.
2012-09-11 12:30:03.122
[cssd(21061)]CRS-1606:The number of voting files available, 0, is less than the minimum number of voting files required, 1, resulting in CSSD termination to
ensure data integrity; details at (:CSSNM00018:) in /prod/grid/11.2.0/grid/log/node8/cssd/ocssd.log
2012-09-11 12:30:03.123
[cssd(21061)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /prod/grid/11.2.0/grid/log/node8/cssd/ocssd.log
2012-09-11 12:30:03.233 [cssd(21061)]CRS-1652:Starting clean up of CRSD resources.
Let‘s check ocssd.log:
2012-09-11 12:30:02.543: [ CSSD][1113450816]clssnmSendingThread: sending status msg to all nodes
2012-09-11 12:30:02.543: [ CSSD][1113450816]clssnmSendingThread: sent 5 status msgs to all nodes
2012-09-11 12:30:03.122: [ CSSD][1082489152](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 1 configured voting disks available, need 1
2012-09-11 12:30:03.123: [ CSSD][1082489152]###################################
2012-09-11 12:30:03.123: [ CSSD][1082489152]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread
2012-09-11 12:30:03.123: [ CSSD][1082489152]###################################
2012-09-11 12:30:03.123: [ CSSD][1082489152](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
So it looks like an IO issue to VOTEDISK.
But after checking we didn't find any abnormal or error message from neither OS level nor storage side.
Then i reviewed the ocssd.log, and with surprise i found there is no long-disk timeout countdown in log.
We know if ocssd failed to reach Votedisk, then it will start to count 200s. Only if after 200s the votedisk still unavailable, then occsd will terminated itself.
The countdown information should be like:
clssscMonitorThreads clssnmvDiskPingThread not scheduled for 16020 msecs
But in our case, there is not such countdown, the occssd just terminated from clssnmvDiskPingMonitorThread all of a sudden.
It should be a bug instead of VOTEDISK IO issue. I will raise an SR with oracle support for further checking.
More......
September 5, 2012
RAC: "CRS-0184: Cannot communicate with the CRS daemon“ due to EVMD
One node of a Rac server get some trouble, its CRSD keep terminating and rebooting. When try to query CRSD will get error “Cannot communicate with the CRS daemon”.
Below error repeate in alertnode1.log:
2012-08-28 19:32:32.995 [ohasd(26038)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode001'. 2012-08-28 19:32:38.319 [crsd(13117)]CRS-1012:The OCR service started on node racnode001. 2012-08-28 19:42:54.134 [crsd(13117)]CRS-0810:Cluster Ready Service aborted due to failure to communicate
with Event Management Service with error [1].
Details at (:CRSD00120:) in /prod/grid/11.2.0.2/grid/log/racnode001/crsd/crsd.log.
2012-08-28 19:42:54.804
[ohasd(26038)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode001'.
2012-08-28 19:42:59.175
[crsd(10695)]CRS-1012:The OCR service started on node cihcispdb001.
2012-08-28 19:53:15.226
[crsd(10695)]CRS-0810:Cluster Ready Service aborted due to failure to communicate
with Event Management Service with error [1].
Details at (:CRSD00120:) in /prod/grid/11.2.0.2/grid/log/racnode001/crsd/crsd.log. 2012-08-28 19:53:15.575 [ohasd(26038)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode001'.
From above log we can see CRSD keep failing and rebooting.
Let‘s check crsd.log to see detail:
2012-08-29 18:07:25.462: [ default][3503137264] CRS Daemon Starting
2012-08-29 18:07:25.464: [ default][1106962752] Policy Engine is not initialized yet!
2012-08-29 18:07:25.465: [ default][3503137264] ENV Logging level for Module: AGENT 1
2012-08-29 18:07:25.465: [ default][3503137264] ENV Logging level for Module: AGFW 0
.............
.............
2012-08-29 18:15:40.648: [ COMMCRS][3503137264]Authorization failed, network error
2012-08-29 18:15:55.584: [ CRSMAIN][1106962752] Policy Engine is not initialized yet!
2012-08-29 18:16:11.408: [ COMMCRS][3503137264]Authorization failed, network error
2012-08-29 18:16:25.586: [ CRSMAIN][1106962752] Policy Engine is not initialized yet!
2012-08-29 18:16:41.668: [ COMMCRS][3503137264]Authorization failed, network error
2012-08-29 18:16:55.597: [ CRSMAIN][1106962752] Policy Engine is not initialized yet!
2012-08-29 18:17:12.929: [ COMMCRS][3503137264]Authorization failed, network error
2012-08-29 18:17:25.609: [ CRSMAIN][1106962752] Policy Engine is not initialized yet!
2012-08-29 18:17:43.190: [ COMMCRS][3503137264]Authorization failed, network error
2012-08-29 18:17:43.191: [ CRSMAIN][3503137264] Created alert : (:CRSD00120:) : Could not init EVM, error: 1
2012-08-29 18:17:43.191: [ CRSD][3503137264][PANIC] CRSD exiting: Could not init EVMMgr.1
2012-08-29 18:17:43.191: [ CRSD][3503137264] Done.
2012-08-29 18:17:44.287: [ default][3316335088] First attempt: init CSS context succeeded.
2012-08-29 18:17:44.290: [ clsdmt][1078745408]PID for the Process [29342], connkey 1
2012-08-29 18:17:44.291: [ clsdmt][1078745408]Creating PID [29342] file for home /prod/grid/11.2.0.2/grid
host racnode001 bin crs to /prod/grid/11.2.0.2/grid/crs/init/ 2012-08-29 18:17:44.291: [ clsdmt][1078745408]Writing PID [29342] to the file
[/prod/grid/11.2.0.2/grid/crs/init/racnode001.pid] 2012-08-29 18:17:45.182: [ default][1078745408] Policy Engine is not initialized yet!
We can see When CRSD starting up, it failed at initial EVENT Manager step. Let's check what's wrong with EVENT MANAGER.
Below is from evmd.log:
2012-08-29 16:12:18.636: [ COMMCRS][54275088]clsc_send: waiting for space to write
on icon (0x9d6fdc0). 0/88 sent 2012-08-29 16:12:26.660: [ COMMCRS][54275088]clsc_send: waiting for space to write
on icon (0x9d6fdc0). 0/88 sent 2012-08-29 16:12:34.676: [ COMMCRS][54275088]clsc_send: waiting for space to write
on icon (0x9d6fdc0). 0/88 sent 2012-08-29 16:12:42.711: [ COMMCRS][54275088]clsc_send: waiting for space to write
on icon (0x9d6fdc0). 0/88 sent
EVMD keep reporting above error message. It means when try to write icon on “/prod/grid/11.2.0.2/grid/auth/evm”, it failed due to lack of space.
But the mount point have enough space, so the possible reason evmd still keep throwing above error could be either:
1. bug
2. in some moment in history, the mount point was lack of space and EVMD was trapped in that state, and even though space released later, EVMD not automatically recovered.
The solution is simple, i reload the EVMD, and CRSD also become fine automaticlly:
2012-08-29 20:42:44.276
[crsd(6075)]CRS-1012:The OCR service started on node racnode001.
2012-08-29 20:52:58.684
[crsd(6075)]CRS-0810:Cluster Ready Service aborted due to failure to communicate with Event Management Service with error [1]. Details at (:CRSD00120:) in /prod/grid/11.2.0.2/grid/log/racnode001/crsd/crsd.log.
2012-08-29 20:52:58.850
[ohasd(26038)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode001'.
2012-08-29 20:53:03.131
[crsd(4699)]CRS-1012:The OCR service started on node racnode001.
2012-08-29 21:03:17.894
[crsd(4699)]CRS-0810:Cluster Ready Service aborted due to failure to communicate with Event Management Service with error [1]. Details at (:CRSD00120:) in /prod/grid/11.2.0.2/grid/log/racnode001/crsd/crsd.log.
2012-08-29 21:03:18.609
[ohasd(26038)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode001'.
2012-08-29 21:03:22.962
[crsd(28965)]CRS-1012:The OCR service started on node racnode001.
2012-08-29 21:11:48.965
[evmd(21764)]CRS-1401:EVMD started on node racnode001.
2012-08-29 21:11:52.774
[crsd(28965)]CRS-1201:CRSD started on node racnode001.
Then CRSD become normal:
racnode001 [+ASM1] /prod/grid/11.2.0.2/grid/log/racnode001/evmd > crsctl check crs CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online
More......
September 2, 2012
Blocker=Waiter: Self Dead Lock caused within one session
We know very well about deadlock. We know deadlock happens when multiple sessions compete some resources.
And in some particular case, dead lock can also be caused within the same session, below is a example:
SQL> create index obj_ind4$ on obj$(name);
create index obj_ind4$ on obj$(name)
*
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-00060: deadlock detected while waiting for resource
Currently there is no lock on obj$, but when we try to create a index on it, it failed with dead lock issue.
Below is from the trace file:
*** 2012-07-02 09:13:45.155 *** SERVICE NAME:(SYS$USERS) 2012-07-02 09:13:45.154 *** SESSION ID:(135.46) 2012-07-02 09:13:45.154 DEADLOCK DETECTED [Transaction Deadlock] Current SQL statement for this session: update obj$ set obj#=:6,type#=:7,ctime=:8,mtime=:9,stime=:10,status=:11,dataobj#=:13,flags=:14,oid$=:15,spare1=:16, spare2=:17 where owner#=:1 and name=:2 and namespace =:3 and(remoteowner=:4 or remoteowner is null and :4 is null)and(linkname=:5 or linkname is null and :5 is null)and(subname=:12 or subname is null and :12 is null) The following deadlock is not an ORACLE error. It is a deadlock due to user error in the design of an application or from issuing incorrect ad-hoc SQL. The following information may aid in determining the deadlock: Deadlock graph: ---------Blocker(s)-------- ---------Waiter(s)--------- Resource Name process session holds waits process session holds waits TM-00000012-00000000 21 135 S 21 135 SX session 135: DID 0001-0015-0000001D session 135: DID 0001-0015-0000001D Rows waited on: Session 135: obj - rowid = 00000025 - AAAAAlAABAAAADhAAA (dictionary objn - 37, file - 1, block - 225, slot - 0) Information on the OTHER waiting sessions: End of information on OTHER waiting sessions. ===================================================
From above we can clearly see the blocker is session 21 as well as waiter, which means the deadlock is caused by the session 21 itself.
From the current SQL the reason is obviously:
to create a object, oracle need to insert a row of the object's detail into obj$ table. The Insert is an DML which require level 3 TM lock on obj$ table.
And since we are creating the index on obj$ table, the session also require exclusive lock on the table which means no DML is allowed on ojb$.
That's where the dead lock comes.
More......
August 7, 2012
SYSTEMSTATE DUMP: Shooting DB Hang, sqlplus Hang, self deadlock latch, PMON deadlock latch
One DB is hang. Try to login with "sqlplus / as sysdba" also fail, just hang there.
I use "-prelim" option to login and make a systemstate dump.
Below is the ass.awk output for systemstate dump:
System State 1 ~~~~~~~~~~~~~~~~ 1: 2: 0: waiting for 'pmon timer' <---- PMON is fine at this moment 3: 0: waiting for 'rdbms ipc message' 4: 0: waiting for 'VKTM Logical Idle Wait' 5: 0: waiting for 'rdbms ipc message' 6: 0: waiting for 'DIAG idle wait' 7: 0: waiting for 'rdbms ipc message' 8: 0: waiting for 'DIAG idle wait' 9: 0: waiting for 'rdbms ipc message' 10: 0: waiting for 'rdbms ipc message' 11: 0: waiting for 'rdbms ipc message' 12: 0: waiting for 'rdbms ipc message' 13: 0: waiting for 'rdbms ipc message' 14: 0: waiting for 'rdbms ipc message' 15: 0: waiting for 'rdbms ipc message' 16: 0: waiting for 'rdbms ipc message' 17: 0: waiting for 'rdbms ipc message' 18: 0: waiting for 'rdbms ipc message' 19: 0: waiting for 'rdbms ipc message' 20: 0: waiting for 'rdbms ipc message' 21: 0: waiting for 'rdbms ipc message' 22: 0: waiting for 'rdbms ipc message' 23: 0: waiting for 'rdbms ipc message' 24: 0: waiting for 'rdbms ipc message' 25: 0: waiting for 'rdbms ipc message' 26: 0: waiting for 'rdbms ipc message' 27: 0: waiting for 'rdbms ipc message' 28: 0: waiting for 'rdbms ipc message' 29: 0: waiting for 'rdbms ipc message' 30: 0: waiting for 'rdbms ipc message' 31: 0: waiting for 'rdbms ipc message' 32: 0: waiting for 'smon timer' 33: 0: waiting for 'rdbms ipc message' 34: 0: waiting for 'latch free' [Latch 60009fc8] 35: 0: waiting for 'rdbms ipc message' 36: [Latch 60009fc8] 37: 0: waiting for 'enq: PV - syncstart'[Enqueue PV-00000000-00000000] 38: [Latch 60009fc8] 39: 0: waiting for 'SQL*Net message from client' 40: 0: waiting for 'VKRM Idle' 41: 0: waiting for 'rdbms ipc message' 42: 0: waiting for 'rdbms ipc message' 43: 0: waiting for 'rdbms ipc message' 44: 0: waiting for 'Streams AQ: qmn coordinator idle wait' 45: 0: waiting for 'enq: PR - contention'[Enqueue PR-00000000-00000000] 46: [Latch 60009fc8] 47: 0: waiting for 'Streams AQ: qmn slave idle wait' 48: [Latch 60009fc8] 49: 0: waiting for 'SQL*Net message from client' 50: [Latch 60009fc8] 51: 9: waited for 'Streams AQ: waiting for time management or cleanup tasks' 52: 0: waiting for 'SQL*Net message from client' Blockers ~~~~~~~~ Above is a list of all the processes. If they are waiting for a resource then it will be given in square brackets. Below is a summary of the waited upon resources, together with the holder of that resource. Notes: ~~~~~ o A process id of '???' implies that the holder was not found in the systemstate. Resource Holder State Latch 60009fc8 ??? Blocker Enqueue PV-00000000-00000000 34: 34: is waiting for Latch 60009fc8 Enqueue PR-00000000-00000000 34: 34: is waiting for Latch 60009fc8 Object Names ~~~~~~~~~~~~ Latch 60009fc8 process allocation Enqueue PV-00000000-00000000 Enqueue PR-00000000-00000000
Obviously the root blocker is latch: 60009fc8. But unluckily the systemstate dump don't have the infor that which session is holding the latch:
alpcispdb552[oracle]_pgbs> cat pgbs_diag_2254.trc|grep 60009fc8 waiting for 60009fc8 process allocation level=7 address=0x60009fc8, number=0x9, tries=0x0 p1: 'address'=0x60009fc8 waiting for 60009fc8 process allocation level=7 waiting for 60009fc8 process allocation level=7 waiting for 60009fc8 process allocation level=7 waiting for 60009fc8 process allocation level=7 waiting for 60009fc8 process allocation level=7
Then let's choose any one blocked process to check:
PROCESS 34: MMON ---------------------------------------- SO: 0x1972e95a8, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x1972e95a8, name=process, file=ksu.h LINE:12451, pg=0 (process) Oracle pid:34, ser:1, calls cur/top: 0x189db28e8/0x19f52b748 flags : (0x2) SYSTEM flags2: (0x20), flags3: (0x0) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 68 last post received-location: kso2.h LINE:467 ID:ksoreq_reply last process to post me: 19f2ef058 165 0 last post sent: 0 0 26 last post sent-location: ksa2.h LINE:282 ID:ksasnd last process posted by me: 1982dec38 1 6 (latch info) wait_event=0 bits=0 Location from where call was made: ksu.h LINE:13713 ID:ksu_unreserve: waiting for 60009fc8 process allocation level=7 Location from where latch is held: ksu.h LINE:13677 ID:ksuapc: proc addr Context saved from call: 6965578584 state=busy [holder orapid=0] wlstate=free [value=0] gotten 57043 times wait, failed first 54 sleeps 0 gotten 26253 times nowait, failed: 1 possible holder pid = 0 ospid=26755 Process Group: DEFAULT, pseudo proc: 0x19734a4b8 O/S info: user: oracle, term: UNKNOWN, ospid: 2339 OSD pid info: Unix process pid: 2339, image: oracle@server552(MMON)
Above blocked process is MMON, pay attention to red part, it is waiting for 60009fc8, and latch's possible holder ospid: 26755.
Let’s check ospid 26755:
PROCESS 50: ---------------------------------------- SO: 0x19f2ef058, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x19f2ef058, name=process, file=ksu.h LINE:12451, pg=0 (process) Oracle pid:50, ser:165, calls cur/top: (nil)/(nil) flags : (0x0) - flags2: (0x8000), flags3: (0x0) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 0 last post received-location: No post last process to post me: none last post sent: 0 0 0 last post sent-location: No post last process posted by me: none (latch info) wait_event=0 bits=0 Location from where call was made: ksu.h LINE:13721 ID:ksudlp: Context saved from call: 6965620824 waiting for 60009fc8 process allocation level=7 Location from where latch is held: ksu.h LINE:13677 ID:ksuapc: proc addr Context saved from call: 6965578584 state=busy [holder orapid=0] wlstate=free [value=0] gotten 57043 times wait, failed first 54 sleeps 0 gotten 26253 times nowait, failed: 1 possible holder pid = 0 ospid=26755 O/S info: user: oracle, term: UNKNOWN, ospid: 26755 OSD pid info: Unix process pid: 26755, image: oracle@alpcispdb552)
We can see the Process 26755 is blocked by itself. A self deadlock latch issue.
Then i kill the Process 26755 from OS level with "kill -9" to get the latch released. But unluckly the DB is still hang.
I make a latest systemstate and check:
System State 1
~~~~~~~~~~~~~~~~
1:
2: 0: waiting for 'latch free' [Latch 60009fc8] ---- PMON is blocked
3: 0: waiting for 'rdbms ipc message'
4: 0: waiting for 'VKTM Logical Idle Wait'
5: 0: waiting for 'rdbms ipc message'
6:
7: 0: waiting for 'rdbms ipc message'
8: 0: waiting for 'DIAG idle wait'
9: 0: waiting for 'rdbms ipc message'
10: 0: waiting for 'rdbms ipc message'
11: 0: waiting for 'rdbms ipc message'
12: 0: waiting for 'rdbms ipc message'
13: 0: waiting for 'rdbms ipc message'
14: 0: waiting for 'rdbms ipc message'
15: 0: waiting for 'rdbms ipc message'
16: 0: waiting for 'rdbms ipc message'
17: 0: waiting for 'rdbms ipc message'
18: 0: waiting for 'rdbms ipc message'
19: 0: waiting for 'rdbms ipc message'
20: 0: waiting for 'rdbms ipc message'
21: 0: waiting for 'rdbms ipc message'
22: 0: waiting for 'rdbms ipc message'
23: 0: waiting for 'rdbms ipc message'
24: 0: waiting for 'rdbms ipc message'
25: 0: waiting for 'rdbms ipc message'
26: 0: waiting for 'rdbms ipc message'
27: 0: waiting for 'rdbms ipc message'
28: 0: waiting for 'rdbms ipc message'
29: 0: waiting for 'rdbms ipc message'
30: 0: waiting for 'rdbms ipc message'
31: 0: waiting for 'rdbms ipc message'
32: 0: waiting for 'smon timer'
33: 0: waiting for 'rdbms ipc message'
34: 0: waiting for 'latch free' [Latch 60009fc8]
35: 0: waiting for 'rdbms ipc message'
36: [Latch 60009fc8]
37: 0: waiting for 'enq: PV - syncstart'[Enqueue PV-00000000-00000000]
38: [Latch 60009fc8]
39: 0: waiting for 'SQL*Net message from client'
40: 0: waiting for 'VKRM Idle'
41: 0: waiting for 'rdbms ipc message'
42: 0: waiting for 'rdbms ipc message'
43: 0: waiting for 'rdbms ipc message'
44: 0: waiting for 'Streams AQ: qmn coordinator idle wait'
45: 0: waiting for 'enq: PR - contention'[Enqueue PR-00000000-00000000]
46: [Latch 60009fc8]
47: 0: waiting for 'Streams AQ: qmn slave idle wait'
48: [Latch 60009fc8]
49: 0: waiting for 'SQL*Net message from client'
50: [Latch 60009fc8]
51: 9: waited for 'Streams AQ: waiting for time management or cleanup tasks'
52: 0: waiting for 'SQL*Net message from client'
Blockers
~~~~~~~~
Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.
Resource Holder State
Latch 60009fc8 ??? Blocker
Enqueue PV-00000000-00000000 34: 34: is waiting for Latch 60009fc8
Enqueue PR-00000000-00000000 34: 34: is waiting for Latch 60009fc8
Object Names
~~~~~~~~~~~~
Latch 60009fc8 process allocation
Enqueue PV-00000000-00000000
Enqueue PR-00000000-00000000
We can see pmon is also blocked on latch 60009fc8.
And below is the process dump for PMON:
PROCESS 2: PMON ---------------------------------------- SO: 0x19935b370, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x19935b370, name=process, file=ksu.h LINE:12451, pg=0 (process) Oracle pid:2, ser:1, calls cur/top: 0x19f525a88/0x19f525a88 flags : (0xe) SYSTEM flags2: (0x0), flags3: (0x0) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 68 last post received-location: kso2.h LINE:467 ID:ksoreq_reply last process to post me: 19f2edfd8 2 2 last post sent: 0 0 26 last post sent-location: ksa2.h LINE:282 ID:ksasnd last process posted by me: 1982dec38 1 6 (latch info) wait_event=0 bits=0 Location from where call was made: ksu.h LINE:13721 ID:ksudlp: Context saved from call: 6814711808 waiting for 60009fc8 process allocation level=7 Location from where latch is held: ksu.h LINE:13677 ID:ksuapc: proc addr Context saved from call: 6965578584 state=busy [holder orapid=0] wlstate=free [value=0] gotten 57043 times wait, failed first 54 sleeps 0 gotten 26253 times nowait, failed: 1 possible holder pid = 0 ospid=26755 Process Group: DEFAULT, pseudo proc: 0x19734a4b8 O/S info: user: oracle, term: UNKNOWN, ospid: 1311 OSD pid info: Unix process pid: 1311, image: oracle@server552(PMON)
We can see PMON is waiting on latch 60009fc8, blocked by the process that we just killed.
It is clear to us now:
process 26755 was in self deadlock state on latch 60009fc8 and blocked other processes. Later i killed process 26755 to release the latch, then pmon should go to clean the process and release the resource.
But before pmon start to work, pmon itself also require the same latch: 60009fc8 and hence blocked and can't progress.
So it is a typiclly pmon deadlock.
We had to bounce the DB at last.
More......
July 30, 2012
RMAN senior: duplicate database from tape without target DB, "RMAN-05548: The set of duplicated tablespaces is not self-contained"
Today, user require to restore a table from old backup.
The table is small, a few thousands rows. But the DB is big, 200gb. We don't have sufficient space to support restoring the whole DB.
We decide to resotre the table and its tablespace only, we will use "duplicate database skip tablespace" clause to skip irrelative tablespaces:
server001[oracle]_ORCLPROD> rman target / auxiliary sys/oracle@temp2 catalog rmanadmin/rman_psswd@rmancatalog
Recovery Manager: Release 11.2.0.2.0 - Production on Sun Jul 29 04:23:34 2012
Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.
connected to target database: ORCLPROD (DBID=1650044020)
connected to recovery catalog database
connected to auxiliary database: TEMP2 (not mounted)
RMAN> run
{
set until time "to_date('03-JUL-201212:00:00','dd-mm-yyyyhh24:mi:ss')";
allocate auxiliary channel ch1 type 'sbt_tape' parms 'ENV=(NB_ORA_SERVER=netbacksev1_b,NB_ORA_CLIENT=server001_b)';
allocate auxiliary channel ch2 type 'sbt_tape' parms 'ENV=(NB_ORA_SERVER=netbacksev1_b,NB_ORA_CLIENT=server001_b)';
duplicate target database to temp2 nofilenamecheck skip tablespace USERS,TBS1,TBS2,TBS3,TBS4,TBS5,TBS6,TBS7;
}
executing command: SET until clause
..........
..........
Violation: ORA-39906: Constraint SYS_C0050693 between table GETS_SO.GETS_SO_ATS_PROJECT in tablespace OTD and table GETS_SO.GETS_SO_SWPT_SSO_SHOP_ORD in tablespace SWPT_TABLES.
Violation: ORA-39908: Index GETS_SO.GETS_SO_NRE_BUSINESS_MST_PK in tablespace OTX_INDEXES enforces primary constraints of table GETS_SO.GETS_SO_NRE_BUSINESS_MST in tablespace OTD.
Violation: ORA-39908: Index GETS_SO.GETS_SO_NRE_CUSTOMER_MST_PK in tablespace OTX_INDEXES enforces primary constraints of table GETS_SO.GETS_SO_NRE_CUSTOMER_MST in tablespace OTD.
Violation: ORA-39908: Index GETS_SO.GETS_SO_NRE_ESR_DETAIL_PK in tablespace OTX_INDEXES enforces primary constraints of table GETS_SO.GETS_SO_NRE_ESR_DETAIL in tablespace OTD.
Violation: ORA-39907: Index GETS_SO.GETS_SO_NRE_ESR_DSGN_TSK_IE2 in tablespace OTX_INDEXES points to table GETS_SO.GETS_SO_NRE_ESR_DSGN_TSK in tablespace OTD.
Violation: ORA-39907: Index GETS_SO.GETS_SO_NRE_ESR_RPT_TSK_IE1 in tablespace OTX_INDEXES points to table GETS_SO.GETS_SO_NRE_ESR_RPT_TSK in tablespace OTD.
..........
..........
released channel: ch1
released channel: ch2
released channel: ch3
released channel: ch4
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 07/29/2012 05:13:00
RMAN-05501: aborting duplication of target database
RMAN-05548: The set of duplicated tablespaces is not self-contained
The red part is an enhancement in 11g. When we skipped some tablespaces, oracle will check dependency on remaining tablespace through TTS transport check procedure. If the check failed, then rman will refuse to restore.
How to skip that? The only method is to use no-target duplicate.
Since the tablespace checking relies on target database, so with this method, oracle can't check the dependency from target DB:
no target duplicate: rman auxiliary / catalog catalog rmanadmin/rman_psswd@rmancatalog set dbid 21111; run { set until time "to_date('03-JUL-201212:00:00','dd-mm-yyyyhh24:mi:ss')"; allocate auxiliary channel ch1 type 'sbt_tape' parms 'ENV=(NB_ORA_SERVER=netbacksev1_b,NB_ORA_CLIENT=server001_b)'; allocate auxiliary channel ch2 type 'sbt_tape' parms 'ENV=(NB_ORA_SERVER=netbacksev1_b,NB_ORA_CLIENT=server001_b)'; duplicate database to temp backup location 'sbt_tape' nofilenamecheck skip tablespace USERS,TBS1,TBS2,TBS3,TBS4,TBS5,TBS6,TBS7; }
More......
July 21, 2012
From 11.2.0.2.5, oracle don't support ADJUST SCN any more
Today, after incomplete recovery and opened a DB, i decided to increasing SCN for the DB.
I set "_allow_error_simulation=true", and use:
alter session set events 'IMMEDIATE trace name adjust_scn level 4490'; ALTER SESSION SET EVENTS '10015 TRACE NAME ADJUST_SCN LEVEL 4490';
The alert.log shew that oracle even didn't try to turn the SCN.
Very wiered.
Then i decide to try "_minimum_giga_scn" once, i added "_minimum_giga_scn=7000" into init.ora and then startup the DB:
SYS @ apci > startup
ORA-01078: failure in processing system parameters
LRM-00101: unknown parameter name '_minimum_giga_scn'
Kidding? I removed the parameter and again i can startup DB normal.
Let's query the hidden parameter:
SYS @ apci > select ksppinm from x$ksppi where ksppinm like '%giga%'; no rows selected
Let's check different version databases:
10.2.0.4, no problem as we expected: SYS @ 10gDB> select ksppinm from x$ksppi where ksppinm like '%giga%'; KSPPINM -------------------------------------------------------------------------------- _minimum_giga_scn 11.2.0.2.4, parameter still there: SYS @ XXDB >select ksppinm from x$ksppi where ksppinm like '%giga%'; KSPPINM -------------------------------------------------------------------------------- _minimum_giga_scn 11.2.0.2.5, from this version the parameter disapper: SQL> select ksppinm from x$ksppi where ksppinm like '%giga%'; no rows selected 11.2.0.2.6: SYS @ DB1 > select ksppinm from x$ksppi where ksppinm like '%giga%'; no rows selected 11.2.0.3: SYS @ DB2 > select ksppinm from x$ksppi where ksppinm like '%giga%'; no rows selected
We can see oracle remove the ADJUST SCN function since 11.2.0.2.5, so we can't use oracle's function to adjust SCN in future.
In there any alternative way?
Yes, we can do first increasing the checkpoint_change# value in datafile header block to a appopraite value, and then re-create controlfile.
So the new controlfile will read our new value as current SCN.
Below is the step:
SYS @ test8 > select current_scn from v$database; CURRENT_SCN ----------- 4324729 BBED> set offset 484 OFFSET 484 BBED> dump File: /d735/data02/oracle/test8/data/system01.dbf (1) Block: 1 Offsets: 484 to 995 Dba:0x00400001 ------------------------------------------------------------------------ 3ffe4100 00000000 59d60a2f 01000000 01000000 62010000 1000db8a 02000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 BBED> modify /x dddddddd Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) Y File: /d735/data02/oracle/test8/data/system01.dbf (1) Block: 1 Offsets: 484 to 995 Dba:0x00400001 ------------------------------------------------------------------------ dddddddd 00000000 59d60a2f 01000000 01000000 62010000 1000db8a 02000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 BBED> sum apply; Check value for File 1, Block 1: current = 0x5036, required = 0x5036 SYS @ test8 > startup nomount; ORACLE instance started. Total System Global Area 730714112 bytes Fixed Size 2230080 bytes Variable Size 318769344 bytes Database Buffers 322961408 bytes Redo Buffers 86753280 bytes SYS @ test8 > CREATE CONTROLFILE REUSE DATABASE "TEST8" RESETLOGS NOARCHIVELOG 2 MAXLOGFILES 200 3 MAXLOGMEMBERS 4 ........... ........... 29 '/d735/data02/oracle/test8/data/user03_new.dbf' 30 CHARACTER SET UTF8 31 ; Control file created. SYS @ test8 > alter database open resetlogs; Database altered. SYS @ test8 > select current_scn from v$database; CURRENT_SCN ----------- 3722305178
Done.
More......
July 5, 2012
Instance evicted due to HAIP failure
Rac cluster, one server's instances abruptly evicted and re-joined the cluster after 10 minutes. Below is from DB's alert log:
Wed Jun 27 00:34:55 2012
Errors in file /q001/product/ora_base/diag/rdbms/ttpoltq/ttpoltq1/trace/ttpoltq1_ora_21759.trc (incident=139278):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 169.254.25.210 not found. Check output from ifconfig command
.............
ORA-29740: evicted by instance number 3, group incarnation 60
LMON (ospid: 20361): terminating the instance due to error 29740
Instance terminated by LMON, pid = 20361
From red part we know there must be something wrong with interconnect.
After checking found interfaces on 10.0.0.* network were working fine.
Also GPNP logfile shew no abnormal information.
Then the issue must be caused by HAIP.
Let's check CSSD's logfile:
oracle $ cat ocssd.log|egrep -i 'haip|disk' 2012-06-27 00:31:44.417: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 7910 msecs 2012-06-27 00:31:52.433: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 15920 msecs 2012-06-27 00:32:00.449: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 23940 msecs 2012-06-27 00:32:08.170: [ CSSD][1105717568]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 31660 msecs 2012-06-27 00:32:16.481: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 39970 msecs 2012-06-27 00:32:24.497: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 47980 msecs 2012-06-27 00:32:32.513: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 56000 msecs 2012-06-27 00:32:40.945: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 64430 msecs 2012-06-27 00:32:49.236: [ CSSD][1105717568]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 72720 msecs 2012-06-27 00:32:58.565: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 82040 msecs 2012-06-27 00:33:06.581: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 90060 msecs 2012-06-27 00:33:14.597: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 98070 msecs 2012-06-27 00:33:15.539: [ CSSD][1109621056]clssnmvDiskCheck: (ORCL:ASM_DISK01) No I/O completed after 50% maximum time, 200000 ms, will be considered unusable in 99990 ms 2012-06-27 00:33:22.614: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 106090 msecs 2012-06-27 00:33:30.629: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 114100 msecs 2012-06-27 00:33:34.069: [ CSSD][1105717568]clssgmFenceClient: fencing client (0x19278b0), member 1 in group haip.cluster_interconnect, no share, death fence 1, SAGE fence 0 2012-06-27 00:33:34.069: [ CSSD][1105717568]clssgmUnreferenceMember: global grock haip.cluster_interconnect member 1 refcount is 1 2012-06-27 00:33:34.069: [ CSSD][1093089600]clssgmTermMember: Terminating member 1 (0x18afc30) in grock haip.cluster_interconnect 2012-06-27 00:33:34.069: [ CSSD][1093089600]clssgmUnreferenceMember: global grock haip.cluster_interconnect member 1 refcount is 0 2012-06-27 00:33:34.070: [ CSSD][1111198016]clssgmRemoveMember: grock haip.cluster_interconnect, member number 1 (0x18afc30) node number 1 state 0x0 grock type 2 2012-06-27 00:33:34.070: [ CSSD][1111198016]clssgmGrantLocks: 1-> new master (3/3) group haip.cluster_interconnect 2012-06-27 00:33:34.072: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c30034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2) 2012-06-27 00:33:34.072: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c30034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(109) 2012-06-27 00:33:34.242: [ CSSD][1105717568]clssgmJoinGrock: global grock haip.cluster_interconnect new client 0x2aaab0611a70 with con 0x12622491, requested num 1, flags 0x4000100 2012-06-27 00:33:34.242: [ CSSD][1105717568]clssgmAddGrockMember: adding member to grock haip.cluster_interconnect 2012-06-27 00:33:34.242: [ CSSD][1111198016]clssgmAddMember: Adding fencing for member 1, group haip.cluster_interconnect, death 1, SAGE 0 2012-06-27 00:33:34.242: [ CSSD][1111198016]clssgmAddMember: member (1/0x2aaab051c000) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab00e3190/haip.cluster_interconnect) 2012-06-27 00:33:34.242: [ CSSD][1111198016]clssgmCommonAddMember: global group grock haip.cluster_interconnect member(1/Local) node(1) flags 0x0 0x423b6300 2012-06-27 00:33:34.245: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c70034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2) 2012-06-27 00:33:34.245: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c70034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(110) 2012-06-27 00:33:34.562: [ CSSD][1105717568]clssgmExitGrock: client 1 (0x2aaab0611a70), grock haip.cluster_interconnect, member 1 2012-06-27 00:33:34.562: [ CSSD][1105717568]clssgmUnregisterPrimary: Unregistering member 1 (0x2aaab051c000) in global grock haip.cluster_interconnect 2012-06-27 00:33:34.562: [ CSSD][1105717568]clssgmUnreferenceMember: global grock haip.cluster_interconnect member 1 refcount is 0 2012-06-27 00:33:34.562: [ CSSD][1111198016]clssgmRemoveMember: grock haip.cluster_interconnect, member number 1 (0x2aaab051c000) node number 1 state 0x4 grock type 2 2012-06-27 00:33:34.564: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c80034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2) 2012-06-27 00:33:34.565: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c80034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(111) 2012-06-27 00:33:34.627: [ CSSD][1105717568]clssgmJoinGrock: global grock haip.cluster_interconnect new client 0x2aaab04022b0 with con 0x1262256c, requested num 1, flags 0x4000100 2012-06-27 00:33:34.627: [ CSSD][1105717568]clssgmAddGrockMember: adding member to grock haip.cluster_interconnect 2012-06-27 00:33:34.627: [ CSSD][1111198016]clssgmAddMember: Adding fencing for member 1, group haip.cluster_interconnect, death 1, SAGE 0 2012-06-27 00:33:34.627: [ CSSD][1111198016]clssgmAddMember: member (1/0x2aaab04c0f80) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab00e3190/haip.cluster_interconnect) 2012-06-27 00:33:34.627: [ CSSD][1111198016]clssgmCommonAddMember: global group grock haip.cluster_interconnect member(1/Local) node(1) flags 0x0 0x423b6300 2012-06-27 00:33:34.629: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6cb0034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2) 2012-06-27 00:33:34.630: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6cb0034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(112) 2012-06-27 00:33:39.647: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 123120 msecs 2012-06-27 00:33:47.663: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 131130 msecs 2012-06-27 00:33:55.680: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 139150 msecs 2012-06-27 00:34:04.115: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 147580 msecs 2012-06-27 00:34:05.658: [ CSSD][1109621056]clssnmvDiskCheck: (ORCL:ASM_DISK01) No I/O completed after 75% maximum time, 200000 ms, will be considered unusable in 49880 ms 2012-06-27 00:34:11.712: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 155180 msecs 2012-06-27 00:34:19.728: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 163190 msecs 2012-06-27 00:34:28.314: [ CSSD][1105717568]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 171770 msecs 2012-06-27 00:34:35.732: [ CSSD][1109621056]clssnmvDiskCheck: (ORCL:ASM_DISK01) No I/O completed after 90% maximum time, 200000 ms, will be considered unusable in 19810 ms 2012-06-27 00:34:36.762: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 180220 msecs 2012-06-27 00:34:43.044: [ CSSD][1111198016]clssgmRemoveMember: grock haip.cluster_interconnect, member number 3 (0x2aaaac27b800) node number 3 state 0x0 grock type 2 2012-06-27 00:34:43.044: [ CSSD][1111198016]clssgmGrantLocks: 3-> new master (2/2) group haip.cluster_interconnect 2012-06-27 00:34:43.046: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7200034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2) 2012-06-27 00:34:43.047: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7200034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(113) 2012-06-27 00:34:43.291: [ CSSD][1111198016]clssgmAddMember: member (3/0x2aaaac5844a0) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab00e3190/haip.cluster_interconnect) 2012-06-27 00:34:43.291: [ CSSD][1111198016]clssgmCommonAddMember: global group grock haip.cluster_interconnect member(3/Remote) node(3) flags 0x0 0xac5844a0 2012-06-27 00:34:43.293: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7400034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2) 2012-06-27 00:34:43.293: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7400034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(114) 2012-06-27 00:34:43.636: [ CSSD][1111198016]clssgmRemoveMember: grock haip.cluster_interconnect, member number 3 (0x2aaaac5844a0) node number 3 state 0x0 grock type 2 2012-06-27 00:34:43.638: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x74f0034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(3) 2012-06-27 00:34:43.639: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x74f0034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(115) 2012-06-27 00:34:43.730: [ CSSD][1111198016]clssgmAddMember: member (3/0x1dba9d0) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab00e3190/haip.cluster_interconnect) 2012-06-27 00:34:43.731: [ CSSD][1111198016]clssgmCommonAddMember: global group grock haip.cluster_interconnect member(3/Remote) node(3) flags 0x0 0x1dba9d0 2012-06-27 00:34:43.733: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x74a0034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2) 2012-06-27 00:34:43.733: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x74a0034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(116) 2012-06-27 00:34:45.457: [ CSSD][1105717568]clssgmExitGrock: client 1 (0x2aaab04022b0), grock haip.cluster_interconnect, member 1 2012-06-27 00:34:45.457: [ CSSD][1105717568]clssgmUnregisterPrimary: Unregistering member 1 (0x2aaab04c0f80) in global grock haip.cluster_interconnect 2012-06-27 00:34:45.457: [ CSSD][1105717568]clssgmUnreferenceMember: global grock haip.cluster_interconnect member 1 refcount is 0 2012-06-27 00:34:45.457: [ CSSD][1111198016]clssgmRemoveMember: grock haip.cluster_interconnect, member number 1 (0x2aaab04c0f80) node number 1 state 0x4 grock type 2 2012-06-27 00:34:45.459: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7640034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2) 2012-06-27 00:34:45.459: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7640034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(117) 2012-06-27 00:43:28.566: [ CSSD][1105717568]clssgmJoinGrock: global grock haip.cluster_interconnect new client 0x2aaab43103e0 with con 0x126435d1, requested num 1, flags 0x4000100 2012-06-27 00:43:28.566: [ CSSD][1105717568]clssgmAddGrockMember: adding member to grock haip.cluster_interconnect 2012-06-27 00:43:28.566: [ CSSD][1111198016]clssgmAddMember: Adding fencing for member 1, group haip.cluster_interconnect, death 1, SAGE 0 2012-06-27 00:43:28.566: [ CSSD][1111198016]clssgmAddMember: member (1/0x2aaab46fbd20) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab00e3190/haip.cluster_interconnect) 2012-06-27 00:43:28.566: [ CSSD][1111198016]clssgmCommonAddMember: global group grock haip.cluster_interconnect member(1/Local) node(1) flags 0x0 0x423b6300 2012-06-27 00:43:28.568: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x2aa0035) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2) 2012-06-27 00:43:28.569: [ CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x2aa0035) of grock(haip.cluster_interconnect) received all acks, grock update sequence(118) 2012-06-27 00:43:44.470: [ CSSD][1105717568]clssgmExecuteClientRequest: VOTEDISKQUERY recvd from proc 32 (0x2aaab4269a70) 2012-06-27 00:43:45.958: [ CSSD][1105717568]clssgmExecuteClientRequest: VOTEDISKQUERY recvd from proc 32 (0x2aaab4269a70) 2012-06-27 00:43:48.770: [ CSSD][1105717568]clssgmExecuteClientRequest: VOTEDISKQUERY recvd from proc 34 (0x2aaab05c4340) 2012-06-27 00:43:48.771: [ CSSD][1105717568]clssgmExecuteClientRequest: VOTEDISKQUERY recvd from proc 34 (0x2aaab05c4340)
2012-06-27 00:31:44.417: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 7910 msecs
The long disk timeout threshold is 200 seconds in 11gR2.
During the countdown process, the cluster had a check on HAIP and belive HAIP already died:
2012-06-27 00:33:34.069: [ CSSD][1105717568]clssgmFenceClient: fencing client (0x19278b0), member 1 in group haip.cluster_interconnect, no share, death fence 1, SAGE fence 0
Cluster start to cleanup HAIP, and tried mutiple times to restart HAIP.
And later, the votedisk is online again, so votedisk timeout countdown stopped, and HAIP also re-brought up on-line, and all instances also be brought up, and all things back to normal then.
More......