May 24, 2012

Application Crashed after Increasing SGA

Last Sunday, i performed an planned production database crossing server migration, from 10.2.0.4 to 11.2.0.3.
Everything appeared to be fine at beginning. Restore and Recover succeed without errors as well as upgrade.
After all tasks pending at my side completed, i provide user the TNS for the new 11g DB. And user updated their application configure file with that.
After then, everything seemed out of control, the whole application service flooded with similar errors.


Below is one:
[Oracle JDBC Driver][Oracle]
ORA-20001: ZKP_UTILS.my_caller:
ORA-20001: ZKP_UTILS.parse_call_stack:
ORA-06502: PL/SQL: numeric or value error: character to number conversion error
ORA-20001: ZKP_UTILS.my_caller:
ORA-20001: ZKP_UTILS.parse_call_stack:
ORA-06502: PL/SQL: numeric or value error: character to number conversion error
ORA-06512: at "ZKAPADM.ZKP_UTILS", line 146
ORA-06512: at "ZKAPADM.SCHEDULE_JOB", line 1015



The error message not shew the innermost layer. After a hard checking, we finally identify the root cause is below part code:
v_call_stack :=dbms_utility.format_call_stack;
v_line := SUBSTR(v_call_stack, 1, (i-1));
v_line_no := TO_NUMBER(SUBSTR(v_line, 13, 8));



The purpose of above code is to catch below red part string and convert to number format:SQL> select  dbms_utility.format_call_stack from  dual;
FORMAT_CALL_STACK
--------------------------------------------------------------------------------
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
0x1a68e9570         1  anonymous block


Let's re-produce the error:
SQL> select TO_NUMBER(SUBSTR('0x1a68e9570         1  anonymous block',  13, 8)) FROM DUAL;
ERROR at line 1:
ORA-01722: invalid number



Above SQL was running fine in old DB:
SQL> select TO_NUMBER(SUBSTR('0xfad349f8         1  anonymous block',  13, 8)) FROM DUAL;
TO_NUMBER(SUBSTR('0XFAD349F81ANONYMOUSBLOCK',13,8))
---------------------------------------------------
                                                  1



We notice that the old DB's dbms_utility.format_call_stack last line first part's output is "0xfad349f8"---8 bytes,on the other hand, the new DB is "0x1a68e9570"---9 bytes,
And that one extra byte make the SUBSTR function not get the expected "1", and hence got the number character conversion error.
Let's verify again:
---old 10g DB
SQL> select length(dbms_utility.format_call_stack) from dual;
LENGTH(DBMS_UTILITY.FORMAT_CALL_STACK)
--------------------------------------
                                   120

---new 11g DB
SQL> select length(dbms_utility.format_call_stack) from dual;
LENGTH(DBMS_UTILITY.FORMAT_CALL_STACK)
--------------------------------------
                                   121



The 1 byte is the black horse which leading to "character to number conversion error ".
And the issue codes are everywhere in the application, so it leaded the whole application crashed.
Since we already found the issue, the solution is quite simple. We can update user's code, change below part:
v_line_no := TO_NUMBER(SUBSTR(v_line, 13, 8));
to

v_line_no := TO_NUMBER(SUBSTR(v_line, 13, 9));


Then it can handle the extra 1 byte and get the character string.
But at that very moment, we didn't have time and couldn't afford the risk to update the whole application's with untested code.

Anyway other way for us to solve the issue without update application code?


If we don't want to modify app's code, then the simplest solution in our mind should be:
Is there a way that we can control the length of DBMS_UTILITY.FORMAT_CALL_STACK output?
Some hints or some parameters?



Then i searched on both on google and metalink with key word " length, FORMAT_CALL_STACK", don't find any useful information.
So we can only guess that based on our current knowledge.
Compare the old DB VS new DB:
server : redhat4 VS redhat5
version: 10.2.0.4 VS 11.2.0.3

There are too many places worth us to suspect.
Which way should we choose?



To make a precision judgement, we must first figure out what the hell of the "0x1a68e9570" represent?
Let's review again:

SQL> select  dbms_utility.format_call_stack from  dual;
FORMAT_CALL_STACK
--------------------------------------------------------------------------------
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
0x1a68e9570         1  anonymous block



Pay attention to the red part, it indicated the "0x1a68e9570" is a handle.
Then where the handle point to? Obviously a so big number can't be an object_id, then one other possible choice is the handle point to somewhere in library cache。
And we know library cache objects are managed by chains.
So, what in my head head was; is length of "0x1a68e9570" related to the length of chains in library cache?

If a chain is short, then 8 bytes are enough to represent all objects on the chain. But if the chain is long, then 9 bytes will be required.
So we have an guess now, next, if our guess is correct, how can we reduce the length of the chain based on our guess?

Maybe some undocumented parameter can do that, but more directly thought in my mind is: if we reduce the size of library cache, then we can reduce the length of chains right?
To reduce the size of library cache, we can reduce the size of SGA:
SQL> show parameter sga
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
lock_sga                             boolean     FALSE
pre_page_sga                         boolean     FALSE
sga_max_size                         big integer 3000M
sga_target                           big integer 3000M



Currently the size is 3000m and :
SQL> select length(dbms_utility.format_call_stack) from dual;
LENGTH(DBMS_UTILITY.FORMAT_CALL_STACK)
--------------------------------------
                                   121    -----  the damn 121


Since the migration is from 10g to 11g, many init parameters were turned. One of them is consider 11g DB have more background processes which will eat more memory, so the SGA size increased.
oops, really all the issues we got here are caused by the increasing of SGA?
Let's try to change the SGA back to 2500M and see:

SQL> alter system set sga_max_size='2500m' scope=spfile;
System altered.
SQL> alter system set sga_target='2500m' scope=spfile;
System altered.

SQL> startup force;
ORACLE instance started.
Database mounted.
Database opened.

SQL> select length(dbms_utility.format_call_stack) from dual;
LENGTH(DBMS_UTILITY.FORMAT_CALL_STACK)
--------------------------------------
                                   120



It worked!
And soon application team confirmed everything back to normal, whole application were running fine then.
Before that day, i can tell one thousand causes of crashing an application, but none of them would be increasing sga.
What can i say now? Even increasing SGA can crash an application, what else can't happen in oracle world? 


0 Comments:

Post a Comment