May 30, 2012

ctssd will prevent cluster starting when exist an significant time gap

ctssd (Oracle Cluster Time Synchronization Services) is a good new feature which introduced in 11gR2.  It is used for sync the time gap among nodes, replacing old NTP method.


By default the ctssd run in observe mode, which means, it will detect time gap, but won't make any action when found a gap. In this mode you can see below information in alert$hostname.log:
[ctssd(13443)]CRS-2409:The clock on host node2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.

To enable the ctssd service, you need to disable NTP and all other vendor's time service, then ctssd will become active automatic:
2012-05-23 20:36:55.586
[ctssd(4781)]CRS-2408:The clock on host node2 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.


It looks good, but ctssd still have many defects.
For example, if the time gap among nodes are very big, then the ctssd will be failed to sync at once.
To instead, the ctssd will turn a minor period of time every second, and that may take days to sync your nodes when the gap is big.

What is worse, there are still many bugs for ctssd. For example, i found a repeatable bug is that if ctssd is enabled in ACTIVE mode, and the time gap is big, then after first node come up, rest nodes' cluster will fail to startup.

That ctssd first check and report that it can’t fix the time sync, and then terminated.
----alert$hostname.log:
2012-05-25 00:09:40.690
[ctssd(24841)]CRS-2401:The Cluster Time Synchronization Service started on host node2.
2012-05-25 00:09:40.690
[ctssd(24841)]CRS-2402:The Cluster Time Synchronization Service aborted on host node2. Details at  in /prod/grid/app/11.2.0/grid/log/node2/ctssd/octssd.log.

----octssd.log:
2012-05-25 00:09:40.431: [    CTSS][2988911504]ctssslave_msh: Forming connection with CTSS master node [1]
2012-05-25 00:09:40.432: [    CTSS][2988911504]ctssslave_msh: Successfully connected to master [1]
2012-05-25 00:09:40.433: [    CTSS][2988911504]ctssslave_swm: The magnitude [86413354867 usec] of the offset [-86413354867 usec] is larger than [86400000000 usec] sec which is the CTSS limit.
2012-05-25 00:09:40.433: [    CTSS][2988911504]ctsselect_mmg9_3: Failed in clsctsselect_select_mode [12]: Time offset is too much to be corrected
2012-05-25 00:09:40.690: [    CTSS][2978421648]ctss_checkcb: clsdm requested check alive. Returns [40000050]
2012-05-25 00:09:40.690: [    CTSS][3046659776]ctss_init: Spawn completed. Waiting for threads to join
2012-05-25 00:09:40.690: [    CTSS][2988911504]ctsselect_mmg: CTSS daemon exiting [12].
2012-05-25 00:09:40.690: [    CTSS][2988911504]CTSS daemon aborting

And then CRSD will also fail to startup, but the error it reports is very misleading:
2012-05-25 00:09:44.235
[ohasd(24419)]CRS-2765:Resource 'ora.crsd' has failed on server 'node2'.
2012-05-25 00:09:44.387
[crsd(24930)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /prod/grid/app/11.2.0/grid/log/node2/crsd/crsd.log.



The solution is to either sync the time gap, or disable ctssd by enable vendor's time service.
And then re-bring up rest nodes can success.

More......

May 28, 2012

ORA-00904: "TZ_VERSION": invalid identifier during upgrade, since utlu112i.sql didn't run

We all know that we had to run an Pre-Upgrade Information Toolscript (example: "utlu112i.sql" if target is 11.2.0.X) before an upgrade.
More then doing a requirement checking such as tablespace free size,  the script will also do some modification on dictionary tables.

So if you forget or failed to run utlu112i.sql before upgrade, then when you run catupgrd.sql during upgrade, you will encounter some errors such as:
TO_NUMBER(value$) != (SELECT tz_version from registry$database))*
ERROR at line 6:
ORA-00904: "TZ_VERSION": invalid identifier

Once you got this error, it is often too later for you to realize that you forgot to run pre-upgrade script.
And since your Database currently is already in upgrade mode, so you can't re-run the pre-check script anymore.
 
So what should we do now?  Do we had to run the pre-check script in souce DB and re-start from begining?
 
Infact we can fix the problem in below way: 
1. forget to run utlu112i.sql for 11.2.0.2, to fix, execute below sqls:
vecs1>select STATUS from v$instance;
STATUS
------------
OPEN MIGRATE
vecs1>ALTER TABLE registry$database ADD (tz_version NUMBER);
Table altered.
vecs1>ALTER PACKAGE dbms_registry COMPILE BODY;
ERROR at line 1:
ORA-04023: Object SYS.DBMS_STANDARD could not be validated or authorized
vecs1>ALTER VIEW dba_registry_database COMPILE;
View altered.
vecs1>ALTER PUBLIC SYNONYM DBA_REGISTRY_DATABASE COMPILE;
Synonym altered.

2. forget to run utlu112i.sql for 11.2.0.3, to fix, need to execute one more sql:
ALTER TABLE registry$database ADD (tz_version NUMBER);
UPDATE registry$database set tz_version = 14;    --- one extra SQL for 11.2.0.3
ALTER PACKAGE dbms_registry COMPILE BODY;
ALTER VIEW dba_registry_database COMPILE;
ALTER PUBLIC SYNONYM DBA_REGISTRY_DATABASE COMPILE;

And then, re-run the catupgrd.sql  script, you won't encounter the "TZ_VERSION: invalid identifier " error anymore.

From where i get above SQLs?  I get them from the pre-upgrade script itself.
For example,  you can find below in 11.2.0.3's utlu112i.sql file:
EXECUTE IMMEDIATE
'ALTER TABLE registry$database ADD (tz_version NUMBER)';
EXECUTE IMMEDIATE
'UPDATE registry$database set tz_version = :1'
USING db_tz_version;
EXECUTE IMMEDIATE
'ALTER PACKAGE dbms_registry COMPILE BODY';
EXECUTE IMMEDIATE
'ALTER VIEW dba_registry_database COMPILE';
EXECUTE IMMEDIATE
'ALTER PUBLIC SYNONYM DBA_REGISTRY_DATABASE COMPILE';

For other versions upgrade which you forgot to run the pre-check script, you also can go to view the utluxxxi.sql file and find the corresponding part,  and execute them in target upgrade mode DB, and then re-run catupgrd.sql can be succeed.
More......

May 24, 2012

Application Crashed after Increasing SGA

Last Sunday, i performed an planned production database crossing server migration, from 10.2.0.4 to 11.2.0.3.
Everything appeared to be fine at beginning. Restore and Recover succeed without errors as well as upgrade.
After all tasks pending at my side completed, i provide user the TNS for the new 11g DB. And user updated their application configure file with that.
After then, everything seemed out of control, the whole application service flooded with similar errors.


Below is one:
[Oracle JDBC Driver][Oracle]
ORA-20001: ZKP_UTILS.my_caller:
ORA-20001: ZKP_UTILS.parse_call_stack:
ORA-06502: PL/SQL: numeric or value error: character to number conversion error
ORA-20001: ZKP_UTILS.my_caller:
ORA-20001: ZKP_UTILS.parse_call_stack:
ORA-06502: PL/SQL: numeric or value error: character to number conversion error
ORA-06512: at "ZKAPADM.ZKP_UTILS", line 146
ORA-06512: at "ZKAPADM.SCHEDULE_JOB", line 1015



The error message not shew the innermost layer. After a hard checking, we finally identify the root cause is below part code:
v_call_stack :=dbms_utility.format_call_stack;
v_line := SUBSTR(v_call_stack, 1, (i-1));
v_line_no := TO_NUMBER(SUBSTR(v_line, 13, 8));



The purpose of above code is to catch below red part string and convert to number format:SQL> select  dbms_utility.format_call_stack from  dual;
FORMAT_CALL_STACK
--------------------------------------------------------------------------------
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
0x1a68e9570         1  anonymous block


Let's re-produce the error:
SQL> select TO_NUMBER(SUBSTR('0x1a68e9570         1  anonymous block',  13, 8)) FROM DUAL;
ERROR at line 1:
ORA-01722: invalid number



Above SQL was running fine in old DB:
SQL> select TO_NUMBER(SUBSTR('0xfad349f8         1  anonymous block',  13, 8)) FROM DUAL;
TO_NUMBER(SUBSTR('0XFAD349F81ANONYMOUSBLOCK',13,8))
---------------------------------------------------
                                                  1



We notice that the old DB's dbms_utility.format_call_stack last line first part's output is "0xfad349f8"---8 bytes,on the other hand, the new DB is "0x1a68e9570"---9 bytes,
And that one extra byte make the SUBSTR function not get the expected "1", and hence got the number character conversion error.
Let's verify again:
---old 10g DB
SQL> select length(dbms_utility.format_call_stack) from dual;
LENGTH(DBMS_UTILITY.FORMAT_CALL_STACK)
--------------------------------------
                                   120

---new 11g DB
SQL> select length(dbms_utility.format_call_stack) from dual;
LENGTH(DBMS_UTILITY.FORMAT_CALL_STACK)
--------------------------------------
                                   121



The 1 byte is the black horse which leading to "character to number conversion error ".
And the issue codes are everywhere in the application, so it leaded the whole application crashed.
Since we already found the issue, the solution is quite simple. We can update user's code, change below part:
v_line_no := TO_NUMBER(SUBSTR(v_line, 13, 8));
to

v_line_no := TO_NUMBER(SUBSTR(v_line, 13, 9));


Then it can handle the extra 1 byte and get the character string.
But at that very moment, we didn't have time and couldn't afford the risk to update the whole application's with untested code.

Anyway other way for us to solve the issue without update application code?


If we don't want to modify app's code, then the simplest solution in our mind should be:
Is there a way that we can control the length of DBMS_UTILITY.FORMAT_CALL_STACK output?
Some hints or some parameters?



Then i searched on both on google and metalink with key word " length, FORMAT_CALL_STACK", don't find any useful information.
So we can only guess that based on our current knowledge.
Compare the old DB VS new DB:
server : redhat4 VS redhat5
version: 10.2.0.4 VS 11.2.0.3

There are too many places worth us to suspect.
Which way should we choose?



To make a precision judgement, we must first figure out what the hell of the "0x1a68e9570" represent?
Let's review again:

SQL> select  dbms_utility.format_call_stack from  dual;
FORMAT_CALL_STACK
--------------------------------------------------------------------------------
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
0x1a68e9570         1  anonymous block



Pay attention to the red part, it indicated the "0x1a68e9570" is a handle.
Then where the handle point to? Obviously a so big number can't be an object_id, then one other possible choice is the handle point to somewhere in library cache。
And we know library cache objects are managed by chains.
So, what in my head head was; is length of "0x1a68e9570" related to the length of chains in library cache?

If a chain is short, then 8 bytes are enough to represent all objects on the chain. But if the chain is long, then 9 bytes will be required.
So we have an guess now, next, if our guess is correct, how can we reduce the length of the chain based on our guess?

Maybe some undocumented parameter can do that, but more directly thought in my mind is: if we reduce the size of library cache, then we can reduce the length of chains right?
To reduce the size of library cache, we can reduce the size of SGA:
SQL> show parameter sga
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
lock_sga                             boolean     FALSE
pre_page_sga                         boolean     FALSE
sga_max_size                         big integer 3000M
sga_target                           big integer 3000M



Currently the size is 3000m and :
SQL> select length(dbms_utility.format_call_stack) from dual;
LENGTH(DBMS_UTILITY.FORMAT_CALL_STACK)
--------------------------------------
                                   121    -----  the damn 121


Since the migration is from 10g to 11g, many init parameters were turned. One of them is consider 11g DB have more background processes which will eat more memory, so the SGA size increased.
oops, really all the issues we got here are caused by the increasing of SGA?
Let's try to change the SGA back to 2500M and see:

SQL> alter system set sga_max_size='2500m' scope=spfile;
System altered.
SQL> alter system set sga_target='2500m' scope=spfile;
System altered.

SQL> startup force;
ORACLE instance started.
Database mounted.
Database opened.

SQL> select length(dbms_utility.format_call_stack) from dual;
LENGTH(DBMS_UTILITY.FORMAT_CALL_STACK)
--------------------------------------
                                   120



It worked!
And soon application team confirmed everything back to normal, whole application were running fine then.
Before that day, i can tell one thousand causes of crashing an application, but none of them would be increasing sga.
What can i say now? Even increasing SGA can crash an application, what else can't happen in oracle world? 



More......

May 19, 2012

Fix corruption blocks caused by Server Crash

A server turned into hang status, swap space used up, and 95% cpu consuming by sys.
After sysadmin team reboot the server, I re-brought up all databases.


Then things become crazy, many alerts flood into our mail box reporting corruption blocks on different Databases.

After checking found different type of corruption, including:
1. normal data/index block fractured corruption.
2. UNDO block corruption. ( Luckily not system undo)
3. SYSAUX table corruption: CSC(0x0b57.10b514f6) higher than block SCN(0x0000.00000000)
and.....
4. database core bootstrap object con$ block corruption.

After three hours work finally fixed most of them. Below is one:
/d333/oracle/cdixx/diag/diag/rdbms/cdixx/cdixx/trace/cdi1d_j000_6213.trc (incident=205700):
ORA-01578: ORACLE data block corrupted (file # 51, block # 173106)
ORA-01110: data file 51: '/d333/data01/oracle/cdixx/data/cdi1data04.dbf'

Let's check the file 51 and block 173106:
SYS @ cdi1d > select OWNER,SEGMENT_NAME,SEGMENT_TYPE,BLOCK_ID from dba_extents where FILE_ID=51 and 173106>=BLOCK_ID and 173106<=(BLOCK_ID+BLOCKS);
OWNER     SEGMENT_NAME    SEGMENT_TYPE   BLOCK_ID
--------- ------------ --------------- ----------
CDI1XXX01     V_WOXXXR           TABLE     172869

When query on the table:
SYS @ cdi1d > select /*+full(t)*/ count(*) from CDI1XXX01.V_WOXXXR t;
ERROR:
ORA-01578: ORACLE data block corrupted (file # 51, block # 173106)
ORA-01110: data file 1: '/d333/data01/oracle/cdixx/data/cdi1data04.dbf'

Now let's correct the block:
BBED> set file 1;
FILE# 1

BBED> dump block 173106;
File: /d333/data01/oracle/cdixx/data/cdi1data04.dbf (1)
Block: 173106 Offsets: 0 to 511 Dba:0x0042a432
------------------------------------------------------------------------
06c20000 32a4c20c 00000000 0000ff04 b50e0000 01000000 bacb0300 72642324
570b0000 03003200 45a3c20c 0f002200 1f590000 a8254106 34167400 00a0570b
0f7f2124 03001500 0e100400 6e0a4209 d7bb8200 0080570b 68332324 17002400
a0040000 c3a44006 86013100 1b001f2a 00000000 00000000 00000000 00014500
00009c00 6e081608 6b320000 45000100 02000300 04000500 06000700 08000900

BBED> modify /x 33 offset 14
File: /d333/data01/oracle/cdixx/data/cdi1data04.dbf (1)
Block: 173106 Offsets: 14 to 525 Dba:0x0042a432
------------------------------------------------------------------------
3304b50e 00000100 0000bacb 03007264 2324570b 00000300 320045a3 c20c0f00
22001f59 0000a825 41063416 740000a0 570b0f7f 21240300 15000e10 04006e0a
4209d7bb 82000080 570b6833 23241700 2400a004 0000c3a4 40068601 31001b00
1f2a0000 00000000 00000000 00000001 45000000 9c006e08 16086b32 00004500
01000200 03000400 05000600 07000800 09000a00 0b000c00 0d000e00 0f001000

BBED> modify /x 33 offset 16380
File: /d333/data01/oracle/cdixx/data/cdi1data04.dbf (1)
Block: 173106 Offsets: 16380 to 16383 Dba:0x0042a432
------------------------------------------------------------------------
33060000

BBED> sum apply
Check value for File 1, Block 173106:
current = 0x0eb5, required = 0x0eb5

BBED> verify
DBVERIFY - Verification starting
FILE = /d333/data01/oracle/cdixx/data/cdi1data04.dbf
BLOCK = 173106

Block Checking: DBA = 214082610, Block Type = KTB-managed data block
data header at 0x2ac48be2c27c
kdbchk: xaction header lock count mismatch
trans=3 ilk=27 nlo=26
Block 173106 failed with check code 6108

DBVERIFY - Verification complete
Total Blocks Examined : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing (Data) : 1
Total Blocks Processed (Index): 0
Total Blocks Failing (Index): 0
Total Blocks Empty : 0
Total Blocks Marked Corrupt : 0
Total Blocks Influx : 0
Message 531 not found; product=RDBMS; facility=BBED


Now let's query the table again:
SYS @ cdi1d > select /*+full(t)*/ count(*) from CDI1XXX01.V_WOXXXR t;
COUNT(*)
----------
472070

Fixed.


More......