June 19, 2012

server hang, large number of defunct processes

Today, one server hang. We had to reboot the server at last.

Root cause: swap space used up, and big swap out/In made the server hang.
( we also observed more then 290+ defunct processes, these defunct processes are also victim of the server swap issue.)

Below is from sar log:

00:00:02          CPU     %user     %nice   %system   %iowait    %steal     %idle
16:50:01          all      5.63      0.00      2.20      0.65      0.00     91.53
17:00:01          all      5.62      0.00      2.23      0.62      0.00     91.53
17:10:01          all      6.20      0.00      2.40      0.81      0.00     90.60
17:20:02          all      5.42      0.00      8.18      1.29      0.00     85.10  
17:30:01          all      6.65      0.00     27.56      1.95      0.00     63.84
17:40:04          all      5.80      0.00     40.05      2.02      0.00     52.12
17:50:35          all      5.70      0.00     61.28      1.88      0.00     31.15
18:00:43          all      4.90      0.00     70.22      1.57      0.00     23.32
18:10:03          all      3.09      0.00     94.97      0.89      0.00      1.05
18:20:14          all      7.12      0.00     74.69      4.30      0.00     13.90
18:30:01          all      6.96      0.00     44.49      4.82      0.00     43.73
18:40:28          all      5.34      0.00     78.41      2.24      0.00     14.01
18:50:23          all      3.69      0.00     91.39      1.39      0.00      3.53
19:00:09          all      2.77      0.00     94.81      0.86      0.00      1.56
19:20:31          all      1.54      0.00     98.41      0.02      0.00      0.02

From above we can see the issue starts from 17:20:02, and most CPU are consuming on system% column, in our case it is caused by huge swap in/out:
server735[oracle]_oemagent> vmstat 2 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r   b     swpd   free  buff    cache si   so  bi    bo    in   cs    us sy id wa st
32  5 33554424 567228 14248 30837612 1464 228 19614 9346  2038 330692 4 66 24  6  0
273 3 33554424 571564 14268 30836660 194  22  2626  2074  1151 189433 1 89  8  2  0
143 2 33554424 775564 14552 30841844 16   2   1378  7442  1428 21556  4 96  0  0  0
88  4 33554424 772480 14796 30853676 190  24  6733  3824  1260 22365  2 98  0  0  0
67  7 33554424 703896 15088 30910156 3176 14  28644 8846  1692 24264  5 95  0  0  0
55  5 33554424 606004 15260 31001636 4398 2   40868 12894 1806 24636  7 93  0  0  0
85  8 33554424 567052 15540 31037660 1824 22  17618 13838 2153 24594  5 94  0  1  0
72  2 33554424 567220 15708 31042560 1152 20  10834 13300 2691 30036  4 87  5  4  0
107 9 33554424 571468 15684 31036588 384  28  1368  4594  1410 23576  3 97  0  0  0

Later, due to the high load of the server, some oracle processes failed to get stat when its children process exit, hence gernerated a large number of defunct processes:
[root@server735 ~]# ps -ef | grep defunct
oracle 7949  1115  0 20:04 ? 00:00:00 [oracle] 
oracle 9234  1115  0 20:07 ? 00:00:01 [sh] 
oracle 9959  1666  0 18:01 ? 00:00:00 [sh] 
oracle 9993  15748 0 18:02 ? 00:00:02 [sh] 
oracle 10077 557   0 20:10 ? 00:00:00 [sh] 
oracle 10097 1666  0 18:02 ? 00:00:00 [sh] 
oracle 10180 16385 0 20:10 ? 00:00:00 [sh] 
oracle 10352 5749  0 20:11 ? 00:00:00 [sh] 
oracle 10362 16385 0 20:11 ? 00:00:03 [sh] 
oracle 10378 29201 0 20:11 ? 00:00:00 [sh] 
oracle 11016 12434 0 18:04 ? 00:00:00 [sh] 
oracle 11114 16025 0 20:13 ? 00:00:01 [sh] 
oracle 11115 19455 0 20:13 ? 00:00:00 [sh] 
oracle 11116 19524 0 20:13 ? 00:00:01 [sh] 
oracle 11336 19524 0 20:14 ? 00:00:00 [sh] 
oracle 11601 5030  0 20:14 ? 00:00:01 [sh] 
.....

From the timestamp it is clear that these defunct processes are gernerated after server swap issue.

We had to reboot the server at last. sysadmin team failed to find out any clue why suddenly the server's swap space used up and all CPU consumed by swap in/out.

For a temp workaround, i reduce the size of DB's SGA on the server.

0 Comments:

Post a Comment