March 31, 2012

ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU6_621766836$

One DB suddenly got many blow errors in a period of time:
Wed Mar 28 23:53:58 2012
ORA-1628: max # extents  32765 reached for rollback segment _SYSSMU6_621766836$
Wed Mar 28 23:53:58 2012
ORA-1628: max # extents  32765 reached for rollback segment _SYSSMU7_1070497209$


This error means an undo segment reached its max extents so that it couldn't extend any more:
SQL> select SEGMENT_NAME,MAX_EXTENTS from DBA_ROLLBACK_SEGS;
SEGMENT_NAME                   MAX_EXTENTS
------------------------------ -----------
_SYSSMU59_1810053014$                32765
_SYSSMU58_3572456133$                32765
_SYSSMU57_2783840325$                32765
_SYSSMU56_2339751936$                32765
_SYSSMU55_3550317550$                32765

..........


It is none business with the size of undo tablespace, infact there are plenty of free space:
Tablespace                    Free MB   Total MB        PCT
------------------------- ----------- ---------- ----------
UNDORBS                        21,483      24000    89.5125

The DB use Automatic Undo Management (AUM) :

SQL> show parameter undo
NAME TYPE VALUE
------------------------------------
undo_management string AUTO
undo_retention integer 10800
undo_tablespace string UNDORBS

Since it uses AUM, there is no way for us to turn the max_extents attribute:
SQL> select MAX_EXTENTS from DBA_ROLLBACK_SEGS where SEGMENT_NAME='_SYSSMU10$';
MAX_EXTENTS
-----------
32765

SQL> alter rollback segment "_SYSSMU10$" storage (MAXEXTENTs unlimited);
Rollback segment altered.

Above command was executed successful, but if we checked again, we can find the attribute still not changed:
SQL> select MAX_EXTENTS from DBA_ROLLBACK_SEGS where SEGMENT_NAME='_SYSSMU10$';
MAX_EXTENTS
-----------
32765

Even setting "_smu_debug_mode" is useless:
SQL> alter system set "_smu_debug_mode" = 4;
System altered.

SQL> alter rollback segment "_SYSSMU10$" storage (MAXEXTENTs unlimited);
ERROR at line 1:
ORA-25150: ALTERING of extent parameters not permitted

Pay attention to this, when get ORA-1628 with AUM, in most cases, the root cause is not simply the max extents attribute too small. In most cases it is related to some bug.

Let's check our case:
SQL> select to_char(begin_time,'MM/DD/YYYY HH24:MI') begin_time, UNXPSTEALCNT, EXPSTEALCNT , NOSPACEERRCNT, TUNED_UNDORETENTION from gv$undostat;
BEGIN_TIME       UNXPSTEALCNT EXPSTEALCNT NOSPACEERRCNT TUNED_UNDORETENTION
---------------- ------------ ----------- ------------- -------------------

03/30/2012 03:24            0           0             0            23941735
03/30/2012 03:14            0           0             0            23806293
03/30/2012 03:04            0           0             0            23670851
...........

...........
03/29/2012 00:14            0           0             0            42082827
03/29/2012 00:04            0           0             0            39077268
03/28/2012 23:54         1081        1081             5             9930112
03/28/2012 23:44            0           0             9          2367948800
03/28/2012 22:14            0           0             0                   0
03/28/2012 22:04            0           0             0          2346086400


From red part, we can obviously see the auto turned undo retention is not a normal value.

And pay attention the time point with those unusual retention time, the same time the DB got the ORA-1628 error.

And it make sense now. At that very period time, oracle miss judged a very big value for auto tuned undo retention time.
Hence make many commited undo blocks keeping un-expired, and force the rollback segments growing bigger and bigger, and at last leading to:
ORA-1628: max # extents 32765 reached for rollback segment _SYSSMU6_621766836$

OK, since we find out the root cause. The solution is simple, we can set below parameter to control the undo retention time:
SQL> alter system set "_smu_debug_mode" = 33554432;
System altered.

Let's check the undo retention time again:
SQL> select to_char(begin_time,'MM/DD/YYYY HH24:MI') begin_time, UNXPSTEALCNT, EXPSTEALCNT , NOSPACEERRCNT, TUNED_UNDORETENTION from gv$undostat;
BEGIN_TIME       UNXPSTEALCNT EXPSTEALCNT NOSPACEERRCNT TUNED_UNDORETENTION
---------------- ------------ ----------- ------------- -------------------
03/30/2012 03:44            0           0             0               10800
03/30/2012 03:34            0           0             0               10800
03/30/2012 03:24            0           0             0            23941735
03/30/2012 03:14            0           0             0            23806293
03/30/2012 03:04            0           0             0            23670851
...........
...........

More......

March 29, 2012

oracle and UTC

I was trouble shooting an repeatly appreared ORA-1555 issue. The ORA-1555 was caused by a SQL with a small query duration while the auto turned undo retention time is far larger then that.

Till now we are already close to the root of the issue. But that's not the topic i want to say here, maybe in later's post. Here i want to share a trick i found during my trouble shooting. It proved oracle use UTC as its standard.

When checking the undo segment header dump, i found below:
index state cflags wrap# uel  scn             dba      cmt
------------------------------------------------------------------------------
0x00 9 0x00 0x02de 0x0007 0x0b50.2b608af5 0x02401bb0 1333068666
0x01 9 0x00 0x02de 0x0002 0x0b50.2b66f219 0x02401bb3 1333070396
0x02 9 0x00 0x02de 0x0042 0x0b50.2b66f2a3 0x02401bb3 1333070396
0x03 9 0x00 0x02de 0x000d 0x0b50.2b5ccf81 0x02c05c70 1333067236
0x04 9 0x00 0x02dd 0x0037 0x0b50.2b55621f 0x02c05c4a 1333065389
0x05 9 0x00 0x02de 0x001b 0x0b50.2b5cf63e 0x02c05c71 1333067756
0x06 9 0x00 0x02de 0x0000 0x0b50.2b606d49 0x00000000 1333068599
0x07 9 0x00 0x02de 0x0017 0x0b50.2b61aa83 0x02401bb2 1333068893
0x08 9 0x00 0x02de 0x0027 0x0b50.2b5d14ba 0x02c05cbf 1333067788
0x09 9 0x03 0x02de 0x0015 0x0b50.2b6b3d38 0x02401c81 1333071273
0x0a 9 0x00 0x02de 0x0012 0x0b50.2b6b495d 0x02401c81 1333071316
0x0b 9 0x00 0x02de 0x0001 0x0b50.2b66ef7e 0x02401bb2 1333070395
0x0c 9 0x00 0x02dd 0x0025 0x0b50.2b5ce170 0x02c05c71 1333067657
........
........


9 in state column indicates all transaction here are not activity now, and fclags indiates all transcations are committed normal(no dead transaction).

All seems good till i notice the cmt column. cmt means commit time. And it is not record as an SCN format which we usually see in oracle. It is stored as an unix time format.

Let‘s take red part 1333071317 as example. I tried to conver it into oracle time:
SQL> select sysdate,TO_DATE('19700101000000','YYYYMMDDHH24MISS')+NUMTODSINTERVAL('1333071316', 'SECOND') cmt from dual;
SYSDATE CMT
------------------- -------------------
2012-03-29-22.23.32 2012-03-30-01.35.16



What? The transcation commit time was beyond current time? I saw the future?

It did confuse me for a while till i suddenly figure out that the commit time should use the UAT, while our lcoal machine is using EDT.

Check below:
cihcispdb774.corporate.ge.com[oracle]_gldr12> date
Thu Mar 29 22:24:21 EDT 2012
cihcispdb774.corporate.ge.com[oracle]_gldr12> date -u
Fri Mar 30 02:24:27 UTC 2012



And:
cihcispdb774.corporate.ge.com[oracle]_gldr12> date +%s
1333074371


Now when compared to UTC they are all matched.


What is UTC?
According to the US Navel Observatory, UTC time is "Coordinated Universal Time," a coordinated civil time scale "kept by time laboratories around the world ... determined using highly precise atomic clocks." This is approximately the same as Greenwich Mean Time (GMT), but is more precise. A number of nations, such as Morocco, Iceland, and Mali; use UTC directly as their civil time. In the winter, UTC is the civil time for the UK. US Civil Time is determined by a set number of hours offset from UTC. For instance, Arizona (which doesn't keep Daylight Savings Time) is always UTC+7. Beijing, China's timezone is UTC-8

More......

March 18, 2012

Understand the DBV output

Below is a DBV output of a totally corrupted block:
Page 104503 is marked corrupt
Corrupt block relative dba: 0x04019837 (file 16, block 104503)
Bad header found during dbv:
Data in bad block:
type: 201 format: 0 rdba: 0x6bf21791
last change scn: 0x2cd7.f5a66aae seq: 0x7 flg: 0x0e
spare1: 0x63 spare2: 0x81 spare3: 0x8c0e
consistency value in tail: 0x910ff227
check value in block header: 0x4fb6
computed block checksum: 0xe10f



1.type: 201
No this type of block. Mostly we see 6 here, which means : Table/cluster/index segment data block. 


2.format: 0

No this format of  block, below are some possible format.
8i~9i:  all : 0x02

10g  :  2k : 0x62
            4k : 0x82
            8k : 0xa2
          16k : 0xc2
logfile   all : 0x22

3.rdba: 0x6bf21791
rdba(rfile#+block) infor in block header is wrong, it should be the same with below part:

relative dba: 0x04019837 (file 16, block 104503)

4.last change scn: 0x2cd7.f5a66aae seq: 0x7 flg: 0x0e
consistency value in tail: 0x910ff227

facture tail is wrong, should be 0x6aae0e07 (Red part)

5. flg: 0x0e

Wrong. Blow are some posible block flg:
#define KCBHFNEW 0x01 /* new block - zeroed data area */
#define KCBHFDLC 0x02 /* Delayed Logging Change advance SCN/seq */
#define KCBHFCKV 0x04 /* ChecK Value saved-block xor's to zero */
#define KCBHFTMP 0x08 /* Temporary block */

We often see 0x06 in an formated block which means 02+04.

6. spare1: 0x63 spare2: 0x81 spare3: 0x8c0e

Wrong, wrong, wrong, all spare should be 0x0 in any formated block.

7.check value in block header: 0x4fb6
checksum value recorded on block is wrong, should be the same with below part:
computed block checksum: 0xe10f

Gernerally speaking, this data block is totally corruput. It looks like an un-formated block.
Even though there were some data, now are already gone.

More......

March 14, 2012

a wierd connecting Issue, root reachable but oracle not

A server's netbackup failed with below error:
RMAN-03009: failure of backup command on CH1 channel at 03/12/2012 23:16:59
ORA-19506: failed to create sequential file, name="VIGP_TAPECOPY_03122012_1_11668", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
VxBSAValidateFeatureId: Failed with error:
Server Status: cannot connect on socket
ORA-19600: input file is backup piece (/prod/backup/oracle/vigp/oracle/vigp1/arch/VIGP_ARCH_03022012_1_11668)


After checking the reason is netbackup server not reachable:
[oracle@node1 ~]$ ping netback_b
ping: unknown host netback_b


The settng for the /etc/hosts is correct:
[oracle@node1 ~]$ cat /etc/hosts|grep netback_b
2.1.0.10 netback_b netback_b.corporate.ge.com


And the wierd thing coming:
[oracle@node1 ~]$ su - root
[root@node1 ~]# ping netback_b
PING netback_b (2.1.0.10) 56(84) bytes of data.
64 bytes from netback_b (2.1.0.10): icmp_seq=1 ttl=252 time=1.11 ms
64 bytes from netback_b (2.1.0.10): icmp_seq=2 ttl=252 time=0.401 ms


Oracle it is not ping-able, but via root it is ping-able.
Why?

Let's trace the session:
[oracle@node1 ~]$strace -f -o /tmp/strace.log ping netback_b

Below is in the trace file:
..........
4347 open("/etc/nsswitch.conf", O_RDONLY) = -1 EACCES (Permission denied)

4347 open("/opt/CA/SharedComponents/lib/libnss_dns.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
4347 open("/opt/CA/CAlib/libnss_dns.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)



From above red part we find out why.
[oracle@node1 ~]$ ls -ltr /etc/nsswitch.conf
-rw------- 1 root root 1730 Feb 23 09:07 /etc/nsswitch.conf


Oracle don't have access to above file.
After grant permission to oracle, the ping succeed for oracle, so is the netbackup.
More......

March 12, 2012

DB get an outage caused by INODE

A three nodes Rac, the third node, when user connect, get below error:
ORA-02002: error while writing to audit trail
ORA-09925: Unable to create audit trail file
Linux-x86_64 Error: 28: No space left on device

The error seems the audit mount point is full.

Log on the server and check, the audit directory is:
/prod/product/corp_bac1/audit/trail

Let's check the audit directory mount point:
node3[oracle]_corp_bac3> df -h|grep /prod/product
                       34G   27G  5.1G  84% /prod/product

Wired, the mount point is not full:
node3[oracle]_corp_bac3> df -h|grep /prod/product
                       34G   27G  5.1G  84% /prod/product

Then why set get the error: "Linux-x86_64 Error: 28: No space left on device"?

Switch to audit directory and run ls -ltr, the screen flash with:
06:54 corp_bac3_ora_20624_2e.aud
06:54 corp_bac3_ora_20621_38.aud
06:54 corp_bac3_ora_20618_32.aud
06:54 corp_bac3_ora_20615_38.aud
06:54 corp_bac3_ora_20612_33.aud
06:54 corp_bac3_ora_20688_3c.aud
06:54 corp_bac3_ora_20685_34.aud
06:54 corp_bac3_ora_20682_3e.aud
06:54 corp_bac3_ora_20679_3e.aud
06:54 corp_bac3_ora_20676_3e.aud
06:54 corp_bac3_ora_20673_34.aud
06:54 corp_bac3_ora_20669_37.aud
06:54 corp_bac3_ora_20666_33.aud
06:54 corp_bac3_ora_20663_3c.aud
06:54 corp_bac3_ora_20660_32.aud
.........
.........
It showed hundreds of audit files generated in one minute.

I canceled the ls -ltr, and want to caculated the amount number of audit file under the directory:
[oracle@node3 trail]$ ls -F |grep /|wc

Above command is still running even after ten minutes, so i had to cancel it.

Till now it is already clear to me the issue is very possible caused by Inode resource used up:
$ df -h -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/rootvg-rootvol
                         15M    134K     15M    1% /
/dev/cciss/c0d0p1        65K      46     65K    1% /boot
tmpfs                    16M      74     16M    1% /dev/shm
/dev/mapper/rootvg-lv_data
                        5.6M    4.4K    5.6M    1% /data
/dev/mapper/cihcispdb716_grid_vg-gridvol
                        2.2M    473K    1.7M   22% /prod/grid
/dev/mapper/cihcispdb716_product_vg-productvol
                        4.3M    4.3M      0K  100% /prod/product
3.24.148.15:/vol/database_backup
                         31M     37K     31M    1% /prod/backup
3.24.148.13:/vol/cis_orasoft
                         31M    5.5M     25M   19% /cis_orasoft

From above red part it shows Inodes used up. That's why we can't create any file even there are enough space left:
node3[oracle]_corp_bac3> df -h|grep /prod/product
                       34G   27G  5.1G  84% /prod/product

Cleaned some of the audit file and check:
[oracle@node3 11.2.0]$ df -h -i /prod/product
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/cihcispdb716_product_vg-productvol
                        4.3M    3.8M    444K   90% /prod/product

Connect the third node again, succeed.


More......