June 30, 2012

RMAN blocked by ARCH process, waiting “enq: WL - contention”

Today, we got a file system alert, one server's arch mount point is 90% full.

After checking found one DB's archivelog occupied 90% space.
And this DB's backup arch job was running since 10 days ago and still hang there, no progress.

After checking in the DB, found the rman job was blocked by below red part event:

      SSID    OSUSER     USERNAME          STATE      STATUS                    MODULE
---------- --------- ------------ -------------- ------------ -------------------------                                                  
BLOCKING_SESSION                   EVENT                          PROGRAM
---------------- ----------------------- --------------------------------
       244    oracle          SYS        WAITING       ACTIVE  rman@server621 (TNS V1-V3)
              58     enq: WL - contention       rman@server621 (TNS V1-V3)
SQL_TEXT
--------------------------------------
begin dbms_rcvman.setDatabase(upper(:dbname), :rlscn, :rltime, :fhdbi, :db_unique_name, TRUE); dbms_rcvman.setCanApplyAnyRedo(TRUE); dbms_rcvman.setCanConvertCf(TRUE); if (:canhandletts = 1) then   dbms_rcvman.setCanHandleTransportableTbs(TRUE); end if; end;

It is blocked by session 58, which is ARC0 process:
SID     SPID                        ROGRAM
------ -------- -------------------------- 
    58    10899    oracle@server621 (ARC1)

Killing ARC0 process, oracle will restart it automaticlly.
Then the rman backup job began to progress normally.

More......

June 29, 2012

hack the core:bootstrap$, replace bootstrap$ with user's table

bootstrap$ is the most core table for database. It contains defination for oracle most critical dictinary tables such as: obj$, tab$, ts$ etc.

During an startup, oracle will read bootstrap$ to build dictionary for whole DB.
Any modification on bootstrap$ may destroy your database, and oracle support won't help anything on that.

Below is  bootstrap$ structure and one row in it:

SYS @ geded > select count(*) from bootstrap$;
  COUNT(*)
----------
        60

SYS @ geded > select * from bootstrap$ where line#=18;

     LINE#       OBJ#
---------- ----------
SQL_TEXT
-------------------------------------------------------------------------------------
        18         18
CREATE TABLE OBJ$("OBJ#" NUMBER NOT NULL,"DATAOBJ#" NUMBER,"OWNER#" NUMBER NOT NULL,"NAME" VARCHAR2(30) NOT NULL,"NAMESPACE" NUMBER NOT NULL,"SUBNAME" VARCHAR2(30),"TYPE#" NUMBER NOT NULL,"C
TIME" DATE NOT NULL,"MTIME" DATE NOT NULL,"STIME" DATE NOT NULL,"STATUS" NUMBER NOT NULL,"REMOTEOWNER" VARCHAR2(30),"LINKNAME" VARCHAR2(128),"FLAGS" NUMBER,"OID$" RAW(16),"SPARE1" NUMBER,"SP
ARE2" NUMBER,"SPARE3" NUMBER,"SPARE4" VARCHAR2(1000),"SPARE5" VARCHAR2(1000),"SPARE6" DATE) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 STORAGE (  INITIAL 16K NEXT 104K MINEXTENTS 1 MAXEXT
ENTS 2147483645 PCTINCREASE 0 OBJNO 18 EXTENTS (FILE 1 BLOCK 240))

SYS @ geded > desc bootstrap$;
Name              Null?    Type
-------------- ---------- ---------------
LINE#          NOT NULL   NUMBER
OBJ#           NOT NULL   NUMBER
SQL_TEXT       NOT NULL   VARCHAR2(4000)

Below is one possible error for bootstrap$ failure during startup:
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure.

Today, let's hack the bootstrap$, even replace it with our own table:
Kevin.Zhang >  select count(*) from bootstrap$;
  COUNT(*)
----------
        60

Kevin.Zhang > shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.

Kevin.Zhang > startup upgrade;
ORACLE instance started.
Total System Global Area  730714112 bytes
Fixed Size                  2230080 bytes
Variable Size             318769344 bytes
Database Buffers          322961408 bytes
Redo Buffers               86753280 bytes
Database mounted.
Database opened.

Kevin.Zhang > create table KILLBOOT as select * from bootstrap$;
Table created.

Kevin.Zhang > delete from KILLBOOT where LINE#=59;
1 row deleted.

Kevin.Zhang > commit;
Commit complete.

Kevin.Zhang > delete from bootstrap$;
60 rows deleted.

Kevin.Zhang > commit;
Commit complete.

Here we use an internal package DBMS_DDL_INTERNAL.SWAP_BOOTSTRAP to swap bootstrap$ to our new table KILLBOOT.
Infact package DBMS_DDL_INTERNAL.SWAP_BOOTSTRAP only do one thing, to update kcvfhrdb in super block(file 1 block 1):
Kevin.Zhang > exec DBMS_DDL_INTERNAL.SWAP_BOOTSTRAP('KILLBOOT');
PL/SQL procedure successfully completed.

Kevin.Zhang > update obj$ set name='BOOTSTRAP_DEL' where name='BOOTSTRAP$';
1 row updated.

Kevin.Zhang > commit;
Commit complete.

Kevin.Zhang > shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.

Kevin.Zhang > startup;
ORACLE instance started.
Total System Global Area  730714112 bytes
Fixed Size                  2230080 bytes
Variable Size             318769344 bytes
Database Buffers          322961408 bytes
Redo Buffers               86753280 bytes
Database mounted.
Database opened.

Kevin.Zhang > select count(*) from bootstrap$;
  COUNT(*)
----------
        59

Kevin.Zhang > select count(*) from bootstrap_del;
  COUNT(*)
----------
         0

Kevin.Zhang > select count(*) from KILLBOOT;
  COUNT(*)
----------
        59


We are done, let's prove that we have already succeeded in replacing bootstrap$ with our new table KILLBOOT:
Kevin.Zhang > select dbms_rowid.rowid_relative_fno(rowid) file_id,dbms_rowid.rowid_block_number(rowid) block_id from bootstrap$ where rownum<10;

   FILE_ID   BLOCK_ID
---------- ----------
         1      60817
         1      60817
         1      60817
         1      60817
         1      60817
         1      60817
         1      60817
         1      60817
         1      60817

9 rows selected.

From the block_id we can clearly identify since this block belong to KILLBOOT table.
More......

June 27, 2012

LMS terminating the instance due to error 4031

One of our two node Rac DB encounter 04031 due to inbalance in mini sub heap.

After turned parameter for shared_pool, i decided to bounce instance one by one.
But when i was bouncing the instance on first node, the second node instance also crashed.

Below is in the logfile for second instance:

Wed Jun 27 04:57:28 2012
Reconfiguration started (old inc 4, new inc 6)
List of instances:
 2 (myinst: 2)
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Wed Jun 27 04:57:28 2012
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Wed Jun 27 04:57:28 2012
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Wed Jun 27 04:57:28 2012
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Wed Jun 27 04:57:28 2012
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
 parallel recovery started with 6 processes
Started redo scan
Completed redo scan
 read 202 KB redo, 74 data blocks need recovery
Wed Jun 27 04:57:32 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /d001/product/ora_base/diag/rdbms/ttpoltd/ttpoltd2/trace/ttpoltd2_lms2_942.trc:
ORA-04031: unable to allocate 4136 bytes of shared memory ("shared pool","unknown object","sga heap(3,0)","gcs dynamic r")
LMS2 (ospid: 942): terminating the instance due to error 4031
Wed Jun 27 04:57:33 2012
System state dump requested by (instance=2, osid=942 (LMS2)), summary=[abnormal instance termination].
System State dumped to trace file /d001/product/ora_base/diag/rdbms/ttpoltd/ttpoltd2/trace/ttpoltd2_diag_920.trc
Wed Jun 27 04:57:34 2012
ORA-1092 : opitsk aborting process
Errors in file /d001/product/ora_base/diag/rdbms/ttpoltd/ttpoltd2/trace/ttpoltd2_diag_920.trc:
ORA-00601: cleanup lock conflict
Dumping diagnostic data in directory=[cdmp_20120627045733], requested by (instance=2, osid=942 (LMS2)), summary=[abnormal instance termination].
Wed Jun 27 04:57:35 2012
ORA-1092 : opitsk aborting process
Wed Jun 27 04:57:35 2012
License high water mark = 38
Instance terminated by LMS2, pid = 942
USER (ospid: 17522): terminating the instance
Instance terminated by USER, pid = 17522

From above log we can see during instance reconfigure progress, LMS hit 04031 and failed to progress with reconfigure operation, hence LMS terminated the instance.:
More......

June 23, 2012

04031 and large pool, parallel and PX msg pool

Parallelism require "PX msg pool". In past, "PX msg pool" are allocate from shared_pool:

SYS @ test > select * from v$sgastat where  upper(name) like '%PX%';
POOL         NAME                                                    BYTES
------------ -------------------------------------------------- ----------
shared pool  PX subheap desc                                           256
shared pool  PX msg pool struct                                         56
shared pool  PX QC deq stats                                          1696
shared pool  ksu: ksusg_emptypxsta                                    5200
shared pool  PX server deq stats                                      1696
shared pool  PX list of chunk lists                                   1088
shared pool  PX subheap                                              66792
shared pool  PX QC msg stats                                          2904
shared pool  PX server msg stats                                      2904
shared pool  PX msg pool                                           8192000

Since 10g, if Automatic Shared Memory Management(ASMM) is enabled, the "PX msg pool" will be allocated from large_pool:
SYS @ SYS @ test > select * from v$sgastat where name like '%PX%';
POOL         NAME                                                    BYTES
------------ -------------------------------------------------- ----------
shared pool  PX subheap desc                                           256
shared pool  PX msg pool struct                                         56
shared pool  PX QC deq stats                                          1696
shared pool  PX server deq stats                                      1696
shared pool  PX list of chunk lists                                   1088
shared pool  PX subheap                                             757656
shared pool  PX QC msg stats                                          2904
shared pool  PX server msg stats                                      2904
large pool   PX msg pool                                           8088608

Ignore those names starting with PX in shared_pool, most of they are only some kind of descriptions for "PX msg pool". For example, "PX msg pool struct" describle the structure of "PX msg pool". "PX server msg stats" records statistics afor "PX msg pool".

Recently We hit the error 04031 in large_pool:

ORA-04031: unable to allocate 2048024 bytes of shared memory ("large pool","unknown object","large pool","PX msg pool")

Obviously it is because insufficient space in large pool.
Current size for large pool is 16m.This size fit most cases, for example, parallel 10.

But in our case the issue SQL's parallel is 20:
*** 2012-06-20 22:11:37.731
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=5a10rbj2tx6j4) -----
select substrb(dump(val,16,0,32),1,120) ep, cnt from (select /*+ no_expand_table(t) index_rs(t) parallel(t,20) parallel_index(t,20) dbms_stats cursor_shari
ng_exact use_weak_name_resl dynamic_sampling(0) no_monitoring no_substrb_pad */max(substrb("BUSINXXXMENT",1,32)) val,count(*) cnt from "XXXAVETH"."XXX_EXT" t where substrb("BUSINXXXMENT",1,32) is not null group by nlssort(substrb("BUSINXXXMENT",1,32), 'NLS_SORT = binary')) order by nlssor
t(val,'NLS_SORT = binary')

If the SGA have enough free memory, this is not a issue, because large_pool=16m only define an lower bound of large_pool. When large_pool is lack of space, ASMM will try to allocate some space for it:
SYS @ test > select * from v$sgastat where POOL ='large pool';
POOL         NAME                                               BYTES
------------ -------------------------------------------------- ----------
large pool   PX msg pool                                        8088608
large pool   free memory                                        300000
<---- here from another session i run one SQL with parallel 20:--->
SYS @ test > /
POOL         NAME                                               BYTES
------------ -------------------------------------------------- ----------
large pool   PX msg pool                                        26287976
large pool   free memory                                        2883640
large pool   KGH: NO ACCESS                                     188512

We can see large pool extends automaticlly. Also notice: "KGH: NO ACCESS", it means ASMM retrive some space from large_pool and offer to db_cache.But in our case, the SGA already reached its sga_max_size, so there is no extra space to offer to large_pool.

The issue SQL is coming from a daily gather stats job:
exec DBMS_STATS.GATHER_SCHEMA_STATS(ownname=>'XXXVETH',estimate_percent=>20,block_sample=>false,method_opt=>'For all columns size 225',GRANULARITY=>'ALL',degree=>20,cascade=>true);

To fix the issue, we can either:
1. increase large pool size.
2. reduce degree in gather stats job.

More......

June 12, 2012

Real meaning and Simulate "ORA-01152: file was not restored from a sufficiently old backup"

I saw many times people asked the meaning of "ORA-01152: file n was not restored from a sufficiently old backup" and the difference between "ORA-01152" and "ORA-01113: file needs media recovery".

First of all,  "ORA-01152" is not related to controlfile. I say so because someone said  "ORA-01152: file was not restored from a sufficiently old backup" is caused by old
controlile.
It is not true. A simple prove is: even you recreate controlfile with "create database", you can still get the error.

Then what is the real meaning for "ORA-01152: file was not restored from a sufficiently old backup"?

To figure it out, let's firstly learn below knowledge regarding how oracle know if a recovery is nessasary during startup.
Infact oracle make the judgement from below two flags in datafile header block:
1. OFFSET 0138-0139: DB_state.
00 means DB is down, 04 means DB is running.
When a DB is up and running, this flag is 04. During an normal shutdown, oracle will change this flag to 00.
So if oracle read 04 of the flag during startup, then oracle will be aware that latest shutdown was not commpleted succeesfully, so a recovery will be nessasary.

2. OFFSET 0484-0489: checkpoint_change#
During a DB clean shutdown, oracle will make a full checkpoint, so all datafiles' checkpoint_change# will be sync.
So if in a startup checking, oracle found checkpoint_change# are inconsistent among datafiles, then an recovery will be required.



OK. Now we know how oracle judge the recovery, from two flags DB_state and checkpoint_change#.
Then it is a simple math that there are four combinations:
Scenario 1.DB_state=00 and checkpoint_change# are consistent.
It means lastest shutdown is an succeed clean shutdown. Oracle won't try to perform any recovery.

Scenario 2.DB_state=04 and checkpoint_change# are inconsistent.
95% DB crashing is located in this senerio. When a DB crashed, oracle have no time to change DB_state to 00, and checkpoint_change# among datafiles are not consistent.
In this senerio, when DB startup, a recovery will be required to perform, if you try to open resetlogs without recover, you will get below error:
ORA-01113: file 5 needs media recovery
ORA-01110: data file 5: '/data/oracle/data/miao/miao/stream01.dbf'
This is an common error we often get. We know very well about it.

Scenario 3.DB_state=04 and checkpoint_change# are consistent.
How can this happen? For example, you execute command "alter system checkpoint;" after this command you immediately execute below command:
shutdown abort;
Then checkpoint_change# among datafiles are the same. But DB_state remians 04.

Next time when oracle startup, oracle will read DB_state=04, which means, as i already introduced, latest close for this datafile is not an normal close.
The error here you will get is the same in senerio 2:
ORA-01113: file 5 needs media recovery

Scenario 4.state_flag=00 and checkpoint_change# are inconsistent.
In this case, you will get error "ORA-01152: file n was not restored from a sufficiently old backup".

But how could this happen? How can it be possible when checkpoint_change# are inconsistent while state_flag=00 which means a clean shutdown?


For example:
At time point T1, you shutdown database and made a closed clean backup.
Then you startup DB and do some operations and swtich logfile for mutiple times. Then at time point T2, again you normally shutdown the DB.

Now you delete some T2 datafiles and replace it with datafiles that you backuped at timepoint T1.
So now all these datafiles state_flag=00 because all datafiles are from an clean closed shutdown, but checkpoint_change# among these datafiles are not consistent.

In this sernerio, when you try to open resetlogs DB without recovery, you will get below error:
ORA-01152: file 1 was not restored from a sufficiently old backup
ORA-01110: data file 1: '/data/oracle/data/miao/miao/system01.dbf'

More......

June 11, 2012

CSSD died with no error during GM grock activity

Recently i observed one node crashed and rebooted case: during the node's clusterware starting, its CSSD suddenly died during GM(Group Management) grock activity without any error. And server got a reboot because of that.

CSSD shows no errors:

2012-06-08 09:18:10.495: [ CSSD][1106151744]clssgmTestSetLastGrockUpdate: grock(CLSN.oratab), updateseq(0) msgseq(1), lastupdt<(nil)>, ignoreseq(0)
2012-06-08 09:18:10.495: [ CSSD][1106151744]clssgmAddMember: granted member(0) flags(0x2) node(3) grock (0x2aaab0063ac0/CLSN.oratab)
2012-06-08 09:18:10.495: [ CSSD][1106151744]clssgmCommonAddMember: global lock grock CLSN.oratab member(0/Remote) node(3) flags 0x2 0xb00c1df0
2012-06-08 09:18:10.495: [ CSSD][1106151744]clssgmHandleGrockRcfgUpdate: grock(CLSN.oratab), updateseq(1), status(0), sendresp(1)
2012-06-08 09:18:10.533: [ CSSD][1106151744]clssgmTestSetLastGrockUpdate: grock(CLSN.oratab), updateseq(1) msgseq(2), lastupdt<0x2aaab0054dc0>, ignoreseq(0)
2012-06-08 09:18:10.533: [ CSSD][1106151744]clssgmRemoveMember: grock CLSN.oratab, member number 0 (0x2aaab00c1df0) node number 3 state 0x0 grock type 3
2012-06-08 09:18:10.533: [ CSSD][1106151744]clssgmResetGrock: grock(CLSN.oratab) reset=1
2012-06-08 09:18:10.533: [ CSSD][1106151744]clssgmHandleGrockRcfgUpdate: grock(CLSN.oratab), updateseq(2), status(0), sendresp(1)
2012-06-08 09:18:10.535: [ CSSD][1106151744]clssgmTestSetLastGrockUpdate: grock(CLSN.oratab), updateseq(2) msgseq(3), lastupdt<0x2aaab0097f50>, ignoreseq(0)
2012-06-08 09:18:10.535: [ CSSD][1106151744]clssgmDeleteGrock: (0x2aaab0063ac0) grock(CLSN.oratab) deleted
<<----here CSSD died, it was doing some GM grock maintance and suddenly died and then server reboot, here we see no abnormal information---->>
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = CSSD, LogLevel = 2, TraceLevel = 0
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = GIPCNM, LogLevel = 2, TraceLevel = 0
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = GIPCGM, LogLevel = 2, TraceLevel = 0
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = GIPCCM, LogLevel = 2, TraceLevel = 0
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = CLSF, LogLevel = 0, TraceLevel = 0
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = SKGFD, LogLevel = 0, TraceLevel = 0
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = GPNP, LogLevel = 1, TraceLevel = 0
2012-06-08 09:28:30.052: [ CSSD][684039456]clsu_load_ENV_levels: Module = OLR, LogLevel = 0, TraceLevel = 0
[ CSSD][684039456]clsugetconf : Configuration type [4].
2012-06-08 09:28:30.052: [ CSSD][684039456]clssscmain: Starting CSS daemon, version 11.2.0.2.0, in (clustered) mode with uniqueness value 1339162110


Very soon CSSDMONITOR found the CSSD had died so it then sync (reboot) the node. Below is from CSSDMONITOR log:
2012-06-08 09:18:12.479: [ CSSCLNT][1116543296]clsssRecvMsg: got a disconnect from the server while waiting for message type 27
2012-06-08 09:18:12.479: [ CSSCLNT][1113389376]clsssRecvMsg: got a disconnect from the server while waiting for message type 22
2012-06-08 09:18:12.479: [ USRTHRD][1113389376] clsnwork_queue: posting worker thread
2012-06-08 09:18:12.479: [ USRTHRD][1113389376] clsnpollmsg_main: exiting check loop
2012-06-08 09:18:12.479: [GIPCXCPT][1116543296]gipcInternalSend: connection not valid for send operation endp 0x86da430 [0000000000000162] { gipcEndpoint : l
ocalAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=031d7676-63f8ad6d-9852))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_cihcissdb759_)(GIPCI
D=63f8ad6d-031d7676-9884))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 9884, flags 0x3861e, usrFlags 0x20010 }, ret g
ipcretConnectionLost (12)
2012-06-08 09:18:12.479: [GIPCXCPT][1116543296]gipcSendSyncF [clsssServerRPC : clsss.c : 6271]: EXCEPTION[ ret gipcretConnectionLost (12) ] failed to send o
n endp 0x86da430 [0000000000000162] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=031d7676-63f8ad6d-9852))', remoteAddr 'clsc://(AD
DRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_cihcissdb759_)(GIPCID=63f8ad6d-031d7676-9884))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0,
pidPeer 9884, flags 0x3861e, usrFlags 0x20010 }, addr 0000000000000000, buf 0x428d0d80, len 80, flags 0x8000000
2012-06-08 09:18:12.479: [ CSSCLNT][1116543296]clsssServerRPC: send failed with err 12, msg type 7
2012-06-08 09:18:12.479: [ CSSCLNT][1116543296]clsssCommonClientExit: RPC failure, rc 3
2012-06-08 09:18:12.479: [ CSSCLNT][1094465856]clsssRecvMsg: got a disconnect from the server while waiting for message type 1
2012-06-08 09:18:12.479: [ CSSCLNT][1094465856]clssgsGroupGetStatus: communications failed (0/3/-1)
2012-06-08 09:18:12.479: [ CSSCLNT][1094465856]clssgsGroupGetStatus: returning 8
2012-06-08 09:18:12.479: [ USRTHRD][1114966336] clsnwork_process_work: calling sync

2012-06-08 09:18:12.479: [ USRTHRD][1094465856] clsnomon_status: Communications failure with CSS detected. Waiting for sync to complete...

Checked OS log and no errors gernerate during that period of time:

Jun 8 09:18:02 node759 snmpd[21008]: Connection from UDP: [127.0.0.1 ]:27643
Jun 8 09:18:02 node759 snmpd[21008]: Received SNMP packet(s) from UDP: [127.0.0.1]:27643
Jun 8 09:18:02 node759 snmpd[21008]: Connection from UDP: [127.0.0.1]:27643
Jun 8 09:18:02 node759 snmpd[21008]: Connection from UDP: [127.0.0.1]:26972
Jun 8 09:18:02 node759 snmpd[21008]: Received SNMP packet(s) from UDP: [127.0.0.1]:26972
Jun 8 09:18:02 node759 snmpd[21008]: Connection from UDP: [127.0.0.1]:17368
Jun 8 09:18:02 node759 snmpd[21008]: Received SNMP packet(s) from UDP: [127.0.0.1]:17368
<<<----here suddenly rebooted with no information----->>>
Jun 8 09:23:03 node759 syslogd 1.4.1: restart.
Jun 8 09:23:03 node759 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Jun 8 09:23:03 node759 kernel: g enabled (TM1)
Jun 8 09:23:03 node759 kernel: Intel(R) Xeon(R) CPU X7550 @ 2.00GHz stepping 06
Jun 8 09:23:03 node759 kernel: CPU 3: Syncing TSC to CPU 0.
Jun 8 09:23:03 node759 kernel: CPU 3: synchronized TSC with CPU 0 (last diff -6 cycles, maxerr 477 cycles)
Jun 8 09:23:03 node759 kernel: SMP alternatives: switching to SMP code
Jun 8 09:23:03 node759 kernel: Booting processor 4/32 APIC 0x10
Jun 8 09:23:03 node759 kernel: Initializing CPU#4


Searched on network, only find one bug report describle an similiar issue:
Bug 13954099: CSSD DIES SUDDENLY WITHOUT ANY ERRORS

Just like our case, In the bug report, cluster version is also 11.2.0.2, and the CSSD died during a GM grock activity without any errors.
Then CSSD monitor rebooted the node after found that CSSD died.

This problem is not repeatable, the next time we re-bring up the cluster on the node, it succeed. The issue not happen again.

No Patch avialable as yet.

More......

June 8, 2012

identify root session among hundreds of locks using hanganalyze

Today, one of our most critical production DB reports lock issue.
When i login into the DB there are already more then 200+ locks, and most of them are TM lock.

Below is the list:
SESS                   ID1        ID2      LMODE    REQUEST TY      CTIME
--------------- ---------- ---------- ---------- ---------- -- ----------
Holder: 4738         16737          0          2          0 TM       4021
Holder: 2080         16737          0          2          0 TM       4021
............
............

Holder: 3480         17347          0          3          0 TM       5958
Waiter: 4975         17347          0          0          2 TM       5792
Waiter: 1015         17347          0          0          2 TM       4090
Waiter: 1058         17347          0          0          2 TM       5832

............
Holder: 4729        983064    3155735          6          0 TX       3229
Waiter: 2301        983064    3155735          0          6 TX       2327
Waiter: 1194       2752522    2373725          0          6 TX       2931
Waiter: 3359       2752522    2373725          0          6 TX       2397


286 rows selected.

There are see 286 locking sessions as well as so dozens blockers which of couse we can't kill them all.
Then here is the problem: how to identify the root session?

In 11g there is a powerful View "V$WAIT_CHAINS". But our DB is 10g.
BLOCKING_SESSION column in v$session is useful but not for so many locks and so many blockers.

systemstate dump is also not a choice since i lost most of my scripts including ass109.awk after computer re-installed.

Then the left choice is hanganalyze, let's do it:

==============
HANG ANALYSIS:
==============
Found 106 objects waiting for
<0/4634/18424/0x600086e0/28223/enq: TM - contention>
Found 105 objects waiting for
<0/3317/59720/0x20005f58/27464/enq: TM - contention>
...............
...............
Found 20 objects waiting for
<0/4974/30755/0x60041418/29550/enq: TM - contention>
Open chains found:

Chain 1 : :
<0/3480/26248/0x200229e8/190/db file sequential read>
-- <0/4634/18424/0x600086e0/28223/enq: TM - contention>
-- <0/679/55269/0xa004e990/13104/enq: TM - contention>
-- <0/793/864/0xa0054088/25644/enq: TM - contention>
Chain 2 : :
<0/3797/1/0x20001048/24805/log file parallel write>
-- <0/707/60023/0xa001bb38/27013/log file sync>
Other chains found:
Chain 3 : :
<0/689/49815/0xa0016c28/15330/enq: TM - contention>
-- <0/793/864/0xa0054088/25644/enq: TM - contention>
...............
...............
Chain 104 : :       ---------(oops.  104 chains!!)
<0/5041/14667/0x6000b650/6616/PX Deq: Table Q Normal>
State of nodes
([nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor):
[678]/0/679/55269/0xa05c3618/13104/NLEAF/1/8/[4633][3316]/792
[681]/0/682/5041/0xa05c7650/22742/IGN/9/10//none
[682]/0/683/33667/0xa05c8bb8/27585/IGN/11/12//none
[687]/0/688/35783/0xa05cf6c0/13708/IGN/13/14//none
...............
...............
[5063]/0/5064/1/0x608d2e40/24807/IGN/1657/1658//none
[5064]/0/5065/1/0x608d43a8/24799/IGN/1659/1660//none
====================
END OF HANG ANALYSIS
====================


From above hanganylyze session it is clear that root chain (Open chains part) is:
Open chains found:
Chain 1 : :
<0/3480/26248/0x200229e8/190/db file sequential read>
-- <0/4634/18424/0x600086e0/28223/enq: TM - contention>
-- <0/679/55269/0xa004e990/13104/enq: TM - contention>
-- <0/793/864/0xa0054088/25644/enq: TM - contention>
Chain 2 : :
<0/3797/1/0x20001048/24805/log file parallel write>
-- <0/707/60023/0xa001bb38/27013/log file sync>


And the LEAF is:
server444[oracle]_hrpp1> cat supp_ora_17698.trc|grep -i single|grep -v none
[level 5] : 83 node dumps -- [SINGLE_NODE] [SINGLE_NODE_NW] [IGN_DMP]
server444[oracle]_hrpp1> cat supp_ora_17698.trc|grep -i leaf|grep -v NLEAF
[level 4] : 2 node dumps -- [REMOTE_WT] [LEAF] [LEAF_NW]
[3479]/0/3480/26248/0x20729710/190/LEAF/3/4//4633
[3796]/0/3797/1/0x208d18d8/24805/LEAF/28/29//706



Now it is clear to us Session 3480 and 3797 have the most suspect.
After check these two sessions in DB, we identify the 3480 session as the root session since it is also an pldevelop session:

SSID SERIAL# SPID OSUSER USERNAME MACHINE LOGON_TIME STATE STATUS SQL_ID PREV_SQL_ID WAIT_TIME SECONDS_IN_WAIT EVENT BLOCKING_INSTANCE BLOCKING_SESSION
----- ---------- ------------ ---------- -------------------- ---
PROGRAM
------------------------------------------------
SQL_TEXT
-----------------------------------------------------------------
3480 26248 190 324979 SUPXXXHOC MYHOMEGDC\MYH136092 07-JUN-12 16:17 WAITED SHORT TIME ACTIVE 5xzua1wzkfwpy fvgnun67x1bcg -1 1 db file sequential read
plsqldev.exe
DELETE FROM T_HAD_XXXX_AUDIT WHERE FORM_ID = :B1


After killed the session all 280+ locks released immediate.
More......

June 6, 2012

ORA-04031 caused by parallel: "PQ Slave mismatch"

Our one Rac DB repeatly got 04031 error:
DDE: Problem Key 'ORA 4031' was flood controlled (0x6) (incident: 146179)
ORA-04031: unable to allocate 4064 bytes of shared memory ("shared pool","select di.inst_id,di.didbi,d...","sga heap(1,0)","kglsim heap")


When the 04031 error occur, we can't login into DB or can't execute any command in the DB including "alter system flush shared_pool;"

From the dumpfile we can see most space in shared_pool are oppcupied by SQL AREA:
Allocation Name         Size
"free memory "     188926712
"SQLA "            429067088


We also notice below part information:
LibraryHandle: Address=0x945ae1c8 Hash=e0b78783 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
ObjectName: Name=select di.inst_id,di.didbi,di.didbn,
to_date(di.dicts,'MM/DD/RRHH24:MI:SS','NLS_CALENDAR=Gregorian'),
to_number(di.dirls),to_date(di.dirlc,'MM/DD/RR HH24:MI:SS','NLS_CALENDAR=Gregorian'),
to_number(di.diprs),to_date(di.diprc,'MM/DD/RR HH24:MI:SS','NLS_CALENDAR=Gregorian'),
decode(di.dimla,0,'NOARCHIVELOG',1,'ARCHIVELOG','MANUAL'),to_number(di.discn),
to_number(di.difas),decode(bitand(di.diflg,256),256,'CREATED',decode(bitand(di.diflg,1024),
1024,'STANDBY',decode(bitand(di.diflg,32768),32768,'CLONE',decode(bitand
FullHashValue=25c9709f1fd8a1c460008a64e0b78783 Namespace=SQL AREA(00) Type=CURSOR(00) Identifier=3770124163 OwnerIdn=0
Statistics: InvalidationCount=54 ExecutionCount=0 LoadCount=60 ActiveLocks=1 TotalLockCount=59 TotalPinCount=1

Counters: BrokenCount=1 RevocablePointer=1 KeepDependency=1 BucketInUse=58 HandleInUse=58 HandleReferenceCount=0
Concurrency: DependencyMutex=0x945ae278(0, 58, 0, 0) Mutex=0x945ae2f8(479, 787, 0, 6)
Flags=RON/PIN/TIM/PN0/DBN/[10012841]
WaitersLists:
Lock=0x945ae258[0x945ae258,0x945ae258]
Pin=0x945ae238[0x945ae238,0x945ae238]
Timestamp: Current=06-03-2012 20:06:06
HandleReference: Address=0x945af0d8 Handle=(nil) Flags=[00]
LibraryObject: Address=0xa11940b0 HeapMask=0000-0001-0001-0000 Flags=EXS[0000] Flags2=[0000] PublicFlags=[0000]
ChildTable: size='64'
Child: id='0' Table=0xa1194f60 Reference=0xa1194988 Handle=0xa99918f8
Child: id='1' Table=0xa1194f60 Reference=0xa1194c50 Handle=0x85e39480
Child: id='2' Table=0xa1194f60 Reference=0xa1194f18 Handle=0xa9fbb930
Child: id='3' Table=0xa1194f60 Reference=0xab8006e8 Handle=0xa3927ca0
Child: id='4' Table=0xa1194f60 Reference=0xab8009b0 Handle=0xa9741b18
Child: id='5' Table=0xa1194f60 Reference=0xab800c78 Handle=0xa95b7860
Child: id='6' Table=0xa1194f60 Reference=0xab800f40 Handle=0xa388ac00
Child: id='7' Table=0xa1194f60 Reference=0xab801208 Handle=0xa38f3330
Child: id='8' Table=0xa1194f60 Reference=0x9e291120 Handle=0x858cb740
Child: id='9' Table=0xa1194f60 Reference=0x9e2913e8 Handle=0x858f89f8
Child: id='10' Table=0xa1194f60 Reference=0x9e2916b0 Handle=0x94f22198
Child: id='11' Table=0xa1194f60 Reference=0x9e291978 Handle=0xa31e6520
Child: id='12' Table=0xa1194f60 Reference=0x9e291c40 Handle=0x85926718
Child: id='13' Table=0xa1194f60 Reference=0xab3ee430 Handle=0x85829f60
Child: id='14' Table=0xa1194f60 Reference=0xab3ee6f8 Handle=0x857ab5c8
Child: id='15' Table=0xa1194f60 Reference=0xab3ee9c0 Handle=0x946d70b8
Child: id='16' Table=0x9e291d98 Reference=0xab3eec88 Handle=0x8591b818
Child: id='17' Table=0x9e291d98 Reference=0xab3eef50 Handle=0x85ff3670
Child: id='18' Table=0x9e291d98 Reference=0x9edc72e8 Handle=0x856316f0
Child: id='19' Table=0x9e291d98 Reference=0x9edc75b0 Handle=0x9a420f10
Child: id='20' Table=0x9e291d98 Reference=0x9edc7878 Handle=0xa9cf3b48
Child: id='21' Table=0x9e291d98 Reference=0x9edc7b40 Handle=0x85420040
Child: id='22' Table=0x9e291d98 Reference=0x9edc7e08 Handle=0xa96637d0
Child: id='23' Table=0x9e291d98 Reference=0xaba186e8 Handle=0x85830f10
Child: id='24' Table=0x9e291d98 Reference=0xaba189b0 Handle=0x9a2dc110
Child: id='25' Table=0x9e291d98 Reference=0xaba18c78 Handle=0x9a2077d8
Child: id='26' Table=0x9e291d98 Reference=0xaba18f40 Handle=0xa3e05e30
Child: id='27' Table=0x9e291d98 Reference=0xaba19208 Handle=0xa9999908
Child: id='28' Table=0x9e291d98 Reference=0x7eab32e8 Handle=0x858c09b8
Child: id='29' Table=0x9e291d98 Reference=0x7eab35b0 Handle=0x85948c90
Child: id='30' Table=0x9e291d98 Reference=0x7eab3878 Handle=0xa38c0c98
Child: id='31' Table=0x9e291d98 Reference=0x7eab3b40 Handle=0xa3ecae20
Child: id='32' Table=0xaba19360 Reference=0x7eab3e08 Handle=0xa3dc11a8
Child: id='33' Table=0xaba19360 Reference=0x9e454120 Handle=0x85b7dc70
Child: id='34' Table=0xaba19360 Reference=0x9e4543e8 Handle=0xa97c05d0
Child: id='35' Table=0xaba19360 Reference=0x9e4546b0 Handle=0xa95de978
.........
.........


From above we see the SQL has so many child cursors, which means the SQL was not  shared. And this situation can be the cause of 04031.

We need to find out the reason that why the SQL not shared.

After bounced DB, I write a script to monitor the DB. The next day,  i checked the output and find below:

SYS @ xxx13 >select sql_id,version_COUNT,SHARABLE_MEM/1024/1024,PERSISTENT_MEM/1024/1024 from gv\$sqlarea where version_COUNT>30 or sql_id='6004acmhbg1w3' order by sql_id,inst_id;
SQL_ID        VERSION_COUNT SHARABLE_MEM/1024/1024 PERSISTENT_MEM/1024/1024
------------- ------------- ---------------------- ------------------------
0v6s91manuhz8           175             26.5997515               6.95877075
0v6s91manuhz8           148             22.4999056               5.88513184
0v6s91manuhz8            44             .348004341               .079528809
3j9yx7t5abcyg            71             8.32154465               4.36545563
3j9yx7t5abcyg            62             7.26753616               3.81208801
6wtwv38h14q3q            39             8.43435574               2.30687714
6wtwv38h14q3q            36             7.98299313               2.12942505

SYS @ xxx13 >select to_char(substr(reason,58,25)),count(*) from gV$SQL_SHARED_CURSOR where (sql_id='6004acmhbg1w3' and CHILD_NUMBER<10) or CHILD_NUMBER>30 group by to_char(substr(reason,58,25));
TO_CHAR(SUBSTR(REASON,58,25))       COUNT(*)
--------------------------------- ----------
PQ Slave mismatch(5)                     436


The red part menas the during a parallel query, the slave process  failed to shared the cursor and re-parsed again. 
Let's check the parallel setting in the DB:
show parameter parallel
parallel_max_servers integer 685
parallel_server      boolean TRUE



The parallel_max_servers setted to 685.
This issue looks like a bug, but for a temp workaround we can try to disable parallel in the DB.

More......

June 4, 2012

cluster node eviction caused by bond switch

A three nodes production cluster's third node suddendly got a reboot.
Let's check.

Below is in the reoobed node's alert$hostname.log:

cssd(16352)]CRS-1610:Network communication with node node803 (3) missing for 90% of timeout interval. Removal of this node from cluster in 2.190 seconds
2012-06-02 00:45:17.904
[cssd(16352)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008: in /prod/product/oracle/11.2.0.2/grid/log/cihcispdb804/cssd/ocssd.log.


From the red part we can see seems some issue with network issue.

Let's go through the ocssd.log for more details:

2012-06-02 00:45:01.675: [ CSSD][1111071040]clssnmSendingThread: sending status msg to all nodes
2012-06-02 00:45:01.675: [ CSSD][1111071040]clssnmSendingThread: sent 4 status msgs to all nodes
2012-06-02 00:45:02.677: [ CSSD][1109494080]clssnmPollingThread: node node802 (1) at 50% heartbeat fatal, removal in 14.900 seconds
2012-06-02 00:45:02.677: [ CSSD][1109494080]clssnmPollingThread: node cihcispdb802 (1) is impending reconfig, flag 197646, misstime 15100
2012-06-02 00:45:02.678: [ CSSD][1109494080]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2012-06-02 00:45:02.678: [ CSSD][1102502208]clssnmvDHBValidateNCopy: node 1, cihcispdb802, has a disk HB, but no network HB, DHB has rcfg 211384656, wrtcn
t, 22587619, LATS 1736660004, lastSeqNo 22587575, uniqueness 1332725246, timestamp 1338612301/2887464976
..........
..........
2012-06-02 00:45:10.696: [ CSSD][1109494080]clssnmPollingThread: node cihcispdb802 (1) at 75% heartbeat fatal, removal in 6.880 seconds
2012-06-02 00:45:10.696: [ CSSD][1109494080]clssnmPollingThread: node cihcispdb803 (3) at 75% heartbeat fatal, removal in 7.200 seconds
..........
..........
2012-06-02 00:45:17.580: [ CSSD][1109494080]clssnmPollingThread: Removal started for node cihcispdb802 (1), flags 0x3040e, state 3, wt4c 0

2012-06-02 00:45:17.580: [ CSSD][1109494080]clssnmMarkNodeForRemoval: node 1, cihcispdb802 marked for removal
..........
..........

2012-06-02 00:45:17.581: [    CSSD][3390307616]NMEVENT_SUSPEND [00][00][00][0e]
..........
..........
2012-06-02 00:45:17.589: [ CSSD][1097791808]clssnmvDiskKillCheck: not evicted, file ORCL:ASM_DISK_01 flags 0x00000000, kill block unique 0, my unique 1336874747
..........
..........
2012-06-02 00:45:17.902: [ CSSD][1109494080]clssnmPollingThread: Removal started for node node803 (3), flags 0x6040e, state 3, wt4c 0
..........
..........
2012-06-02 00:45:17.904: [ CSSD][1112648000]clssnmCheckDskInfo: Surviving cohort: 1,3

2012-06-02 00:45:17.904: [ CSSD][1112648000](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, cihcispdb804, is smaller than cohort of 2 nodes led by node 1, node802, based on map type 2
2012-06-02 00:45:17.904: [GIPCGMOD][1094637888] gipcmodGipcCallback: EXCEPTION[ ret gipcretInvalidObject (3) ] failed during request completion for req (nil), endp 0x1ebda200
2012-06-02 00:45:17.904: [ CSSD][1112648000]###################################
2012-06-02 00:45:17.904: [ CSSD][1112648000]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread

2012-06-02 00:45:17.904: [ CSSD][1112648000]###################################


From above message we know that after network conmunication from node 3 to other two nodes lost, the node3 start to try to remove other two nodes from cluster by writting shooting head information into voting disk.

And then the node3 read shooting information from node1 via voting disk, then the node3 make a juedgement that node1 and node2 is a two node cluster which is a bigger then node3, so node3 choose to kill itself instead of killing other two nodes:

2012-06-02 00:45:17.904: [ CSSD][1112648000]clssnmCheckDskInfo: Surviving cohort: 1,3
2012-06-02 00:45:17.904: [ CSSD][1112648000](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, node
804, is smaller than cohort of 2 nodes led by node 1, node802, based on map type 2


Now we are sure the issue is caused by network.
Then we can go to check the OS log:

Jun 2 00:45:35 node804 kernel: nx_nic[eth0]: Device is DOWN. Fail count[8] Jun 2 00:45:35 node804 kernel: nx_nic[eth0]: Firmware hang detected. Severity code=0 Peg number=0 Error code=0 Return address=0
Jun 2 00:45:35 node804 kernel: nx_nic: Flash Version: Firmware[4.0.534], BIOS[2.1.0] Jun 2 00:45:35 cihcispdb804 kernel: nx_nic: No memory on card. Load Cut through.
Jun 2 00:45:37 node804 kernel: bonding: bond1: link status definitely down for interface eth1, disabling it

Jun 2 00:45:37 node804 kernel: bonding: bond1: making interface eth5 the new active one.
Jun 2 00:45:37 node804 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it

Jun 2 00:45:37 node804 kernel: bonding: bond0: making interface eth4 the new active one.
Jun 2 00:45:39 node804 kernel: nx_nic: Loading the firmware from flash.
Jun 2 00:45:42 node804 kernel: nx_nic: Gen2 strapping detected Jun 2 00:45:45 cihcispdb804 kernel: nx_nic HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter Board S/N Chip id 0x1

Jun 2 00:45:45 node804 kernel: nx_nic HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter Board S/N Chip id 0x1
Jun 2 00:45:45 node804 kernel: nx_nic HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter Board S/N Chip id 0x1
Jun 2 00:45:46 node804 kernel: nx_nic[eth1]: NIC Link is up
Jun 2 00:45:46 node804 kernel: bonding: bond1: link status definitely up for interface eth1.
Jun 2 00:45:46 node804 kernel: bonding: bond1: making interface eth1 the new active one.
Jun 2 00:45:48 node804 kernel: nx_nic[eth0]: NIC Link is up Jun 2 00:45:48 cihcispdb804 kernel: bonding: bond0: link status definitely up for interface eth0.
Jun 2 00:45:48 node804 kernel: bonding: bond0: making interface eth0 the new active one.
Jun 2 00:45:50 node804 kernel: nx_nic[eth0]: Device is UP.
Jun 2 00:45:54 node804 kernel: nx_nic[eth2]: NIC Link is up


Why bond swtich can make Oracle inter-connect failed?

Don't ask me, i don't know neither!

In our envirorment i had encounter dozens times that cluster failed when bond swtich.
And once bond switched back, the cluster recover again.
(during that time, ping ,ssh, still succeed, even checked mutilcast found no problem.)

And from internet you also can find some articles regarding same problem.

Maybe that is one of the reason that in 11g Oracle suggest you to use HAIP instead of bond for private interface.

More......

June 2, 2012

what will happen if we hang CSSD process in Rac

in 11gR2, we have many new background processes including two cssd related processes cssdmonitor and cssdagent.  cssdagent process is charge of respawn cssd.bin.

And both of those processes are charge of monitor cssd state and system hang problem.

[root@node1 node1]# ps -ef|grep cssd
root 479  1 0 00:30 ? 00:00:01 /prod/grid/app/11.2.0/grid/bin/cssdmonitor
root 471  1 0 00:30 ? 00:00:01 /prod/grid/app/11.2.0/grid/bin/cssdagent
grid 568  1 0 00:30 ? 00:00:04 /prod/grid/app/11.2.0/grid/bin/ocssd.bin


Let's see what will happen if we hang the ocssd.bin process:
[root@node1 ~]# kill -SIGSTOP 568
[root@node1 ~]# kill -SIGSTOP 568
[root@node1 ~]# kill -SIGSTOP 568


After a few seconds the node got a reboot.   Let's go to check the log. We can find below information in alert$hostname.log:
2012-05-27 09:39:01.558
[ohasd(4545)]CRS-8011:reboot advisory message from host: node1, component: mo224552, with time stamp: L-2012-02-05-22:51:26.126
[ohasd(4545)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2012-05-27 09:39:01.634
[ohasd(4545)]CRS-8011:reboot advisory message from host: node1, component: ag014510, with time stamp: L-2012-05-26-05:35:13.032
[ohasd(4545)]CRS-8013:reboot advisory message text: Rebooting after limit 28500 exceeded; disk timeout 28180, network timeout 28500, last heartbeat from CSSD at epoch seconds 1337981671.618, 28541 milliseconds ago based on invariant clock value of 57493073


Below is the iformation in cssdagent and cssdmonitor's log:
2012-05-26 05:34:53.160: [ USRTHRD][2915036048] clsnproc_needreboot: Impending reboot at 50% of limit 28500; disk timeout 28180, network timeout 28500, last heartbeat from CSSD at epoch seconds 1337981671.618, 14871 milliseconds ago based on invariant clock 57493073; now polling at 100 ms
2012-05-26 05:35:02.748: [ USRTHRD][2915036048] clsnproc_needreboot: Impending reboot at 75% of limit 28500; disk timeout 28180, network timeout 28500, last heartbeat from CSSD at epoch seconds 1337981671.618, 21401 milliseconds ago based on invariant clock 57493073; now polling at 100 ms
2012-05-26 05:35:08.907: [ USRTHRD][2915036048] clsnproc_needreboot: Impending reboot at 90% of limit 28500; disk timeout 28180, network timeout 28500, last heartbeat from CSSD at epoch seconds 1337981671.618, 25681 milliseconds ago based on invariant clock 57493073; now polling at 100 ms

Since cssd was hang, so cssd itself's log won't have any information:
2012-05-26 05:34:25.052: [ CSSD][2894060432]clssnmSendingThread: sending status msg to all nodes
2012-05-26 05:34:25.052: [ CSSD][2894060432]clssnmSendingThread: sent 4 status msgs to all nodes
2012-05-26 05:34:30.525: [ CSSD][2894060432]clssnmSendingThread: sending status msg to all nodes
2012-05-26 05:34:30.525: [ CSSD][2894060432]clssnmSendingThread: sent 4 status msgs to all nodes
----HERE REBOOTED----

2012-05-27 09:39:24.030: [ CSSD][3046872768]clssscmain: Starting CSS daemon, version 11.2.0.1.0, in (clustered) mode with uniqueness value 1306460363
2012-05-27 09:39:24.032: [ CSSD][3046872768]clssscmain: Environment is production



More......