January 20, 2017

Solving a critical performance crisis for a national project caused by oracle connection slowness

An living national healthcare IT system, one core gateway system is very slow.

The gateway system is used to distribute patients' admission and treatment information. Every patient may generate dozens of system messages when they go to hospital, like when they report at the front desk, take X-rays, pay the bill, or when doctors write reports for their examinations etc.

The gateway system is designed to treat 100 messages per minute, but we find that it can only handles 40 messages per minute. As a result, tens of thousands of messages accumulate in the queue and sometimes patients have to wait hours before being attended to in hospitals in peak time as their information does not show up in the IT system.
Many teams and engineers are involved in this significant crisis.

After debugging the gateway, I believe the reason for the slowness is due to connection slowness from gateway system to oracle database.
The gateway system needs to establish a new connection to database to do some database updates for each message that it processes. So it needs to connect to database many times in a second.
However, there seems to be slowness when connecting to database.

Please note, for application running with long connections, even connecting time taking up to 30 seconds is fine, because it only needs to connect once for a long period. However, for this gateway system, it needs to connect to database so frequently, so the delay in connecting results in terrible performance.


A script is written to continuously connect from gateway system server to oracle database and below is the time spent for each connection:

time elapsed : 0,31
time elapsed : 0,31
time elapsed : 3,94
time elapsed : 0,33
time elapsed : 0,30
time elapsed : 0,33
time elapsed : 0,36
time elapsed : 0,30
time elapsed : 0,34
time elapsed : 0,25
time elapsed : 0,22
time elapsed : 0,34
time elapsed : 0,36
time elapsed : 0,30
time elapsed : 0,31
time elapsed : 0,28
time elapsed : 0,28
time elapsed : 0,27
time elapsed : 0,30
time elapsed : 3,81
time elapsed : 0,28
...........
...........

From above we can see it takes around 0.3 second to establish a connection, which is terrible; and worse, intermittently there are some connections taking 3+ seconds to establish, too bad.

However, after the connection is made, executing SQLs is very fast with no delay. It seems that the slowness only occurs when establishing connection.

This is where different teams have an argument.
DBA experts think it is a network issue.
Solaris administrators argues that oracle connection slowness should be either oracle database issue or network issue.
Network experts insist that there is nothing wrong with network.

So, which part of the system should be responsible for this? client server windows OS, database server Solaris OS, oracle database, or network layer?

Ok, let me start investigation. As a first step, I enable sqlnet trace for both oracle client on gateway server and for the database listener on database server.
The trace of sql client reveals the delay, below shows that the delay is pending at listener's response, so the slowness has nothing to do with client server:

[14-APR-2016 05:58:39:329] nsprecv: entry
[14-APR-2016 05:58:39:329] nsprecv: reading from transport...
[14-APR-2016 05:58:39:329] nttrd: entry
[14-APR-2016 05:58:41:201] nttrd: socket 516 had bytes read=8
[14-APR-2016 05:58:41:201] nttrd: exit

For the same connection, listener's trace reveals that it is waiting Solaris kernel to fork shadow process and return its PID:
[14-APR-2016 05:58:39:566] nsevwait: 4 registered connection(s)
[14-APR-2016 05:58:39:566] nsevwait: 0 pre-posted event(s)
[14-APR-2016 05:58:39:566] nsevwait: waiting for transport event (1 thru 5)...
[14-APR-2016 05:58:41:358] nsevwait: 1 newly-posted event(s)
[14-APR-2016 05:58:41:358] nsevfnt: cxd: 0x222b28 stage 0: NT events set:
        READ
[14-APR-2016 05:58:41:358] nsevfnt: cxd: 0x222b28 stage 0: NS events set:

With above information, we know the delay is with solaris kernel rather than oracle client/listener nor network layer.
I then re-do the debug while tracing the Solaris system, though dtrace and guds captures no contention/interruption/locking during the period, still with truss command we captured some useful information.
Below are the high level events we got from truss:

22626/1: 8.6926 fork1() = 3728 
3728/1: 8.6926 fork1() (returning as child ...) = 22626 
3728/1: 8.7229 fork1() = 3731 
3731/1: 8.7229 fork1() (returning as child ...) = 3728 
3731/1: 12.0249 write(17, " N T P 0 3 7 3 1\n", 10) = 10 
22626/1: 12.0252 read(16, " N T P 0 3 7 3 1\n", 64) = 10 

From above we can see truss shows process 22626 (oracle listener) forked process 3728, and then process 3728 forked process 3731(oracle process) , and then process 3731(oracle process) returns PID, and this steps take 3 seconds.

Meanwhile, oracle listener was simply looping and waiting this PID to be returned by Solaris kernel during the whole period.

So now let's look into the issue deeper to see what was 3731(oracle process) doing that took it up to 3 seconds to return PID to oracle listener:

3731/1: 9.3187 shmat(16777280, 0x380000000, IPC_ALLOC) = 0x380000000 
22626/1: pollsys(0x100333970, 4, 0x00000000, 0x00000000) (sleeping...) 
22626/2: 10.2679 nanosleep(0xFFFFFFFF785FBDF0, 0xFFFFFFFF785FBDE0) = 0 
22626/2: 10.2686 so_socket(PF_INET, SOCK_STREAM, IPPROTO_TCP, "", SOV_DEFAULT) = 14 
22626/2: 10.2689 fcntl(14, F_SETFD, 0x00000001) = 0 
22626/2: 10.2691 setsockopt(14, SOL_SOCKET, SO_LINGER, 0xFFFFFFFF785FBEDC, 8, SOV_DEFAULT) = 0 
22626/2: 10.2773 connect(14, 0x100341D88, 16, SOV_DEFAULT) Err#146 ECONNREFUSED 
22626/2: 10.2777 shutdown(14, SHUT_RDWR, SOV_DEFAULT) Err#134 ENOTCONN 
22626/2: 10.2780 close(14) = 0 
22626/2: 10.2783 so_socket(PF_INET6, SOCK_STREAM, IPPROTO_TCP, "", SOV_DEFAULT) = 14 
22626/2: 10.2785 fcntl(14, F_SETFD, 0x00000001) = 0 
22626/2: 10.2787 setsockopt(14, SOL_SOCKET, SO_LINGER, 0xFFFFFFFF785FBEDC, 8, SOV_DEFAULT) = 0 
22626/2: 10.2790 connect(14, 0x100341ED8, 32, SOV_DEFAULT) Err#128 ENETUNREACH 
22626/2: 10.2791 shutdown(14, SHUT_RDWR, SOV_DEFAULT) Err#134 ENOTCONN 
22626/2: 10.2793 close(14) = 0 
22626/2: nanosleep(0xFFFFFFFF785FBDF0, 0xFFFFFFFF785FBDE0) (sleeping...) 
3731/1: 11.7921 shmat(16777281, 0x3C0000000, IPC_ALLOC) = 0x3C0000000 
3731/1: 11.7931 mprotect(0x380000000, 8192, PROT_READ) = 0 

From above we can see the major time was spent by process 3731 to attach to memory segment 0x380000000.

And memory segment 0x380000000 is part of the database:

2767: ora_smon_QPROD 
Address Kbytes RSS Anon Locked Pgsz Mode Mapped File 
0000000380000000 45056 45056 - - 8K rwxs- [ dism shmid=0x100001e ] 
0000000382C00000 32768 - - - - rwxs- [ dism shmid=0x100001e ] 
0000000384C00000 53248 53248 - - 8K rwxs- [ dism shmid=0x100001e ] 
00000003C0000000 9699328 - - - - rwxs- [ dism shmid=0x100001f ] 
0000000610000000 9043968 9043968 - - 8K rwxs- [ dism shmid=0x100001f ] 
0000000840000000 4096 4096 - - 8K rwxs- [ dism shmid=0x1000020 ] 

This is where I noticed that the database is using DISM, so very possible it is a bug related to solaris DISM functionality as there exists a serials of known defects along with DISM.

This 11gR2 oracle database is actually using parameters memory_target/memory_max_target. With this two parameters in play, solaris DISM memory management mechanism is enabled.
To avoid oracle from using solaris DISM, we simply stop using parameters memory_target/memory_max_target, and we use sga_target/sga_max_size parameters instead. By using sga_target/sga_max_size parameters, oracle will use solaris ISM instead DISM memory mechanism and we might be able to bypass some potential defects in DISM.

I then send out suggestion to test using sga_target/sga_max_size parameters instead of memory_target/memory_max_target parameters, as well as reducing oracle memory size defined in spfile from 40G to 18G, and that is when I got many doubts —— no one seems to believe that a slowness in gateway system can be solved by changing from memory_target/memory_max_target to sga_target/sga_max_size, and by reducing oracle memory size.
It sounds irrelevant and ridiculous.

Well, in fact, in my career I have seen plenty of more bizarre stuff than this issue, you can find many of them in my blog. This one for example:
http://dbakevin.blogspot.com.au/2012/05/application-crashed-after-increasing.html

Eventually they implemented the change and then we re-did the connection testing::

time elapsed : 0,13
time elapsed : 0,16
time elapsed : 0,17
time elapsed : 0,14
time elapsed : 0,13
time elapsed : 0,14
time elapsed : 0,12
time elapsed : 0,13
time elapsed : 0,13
time elapsed : 0,13
time elapsed : 0,13
time elapsed : 0,13
time elapsed : 0,11
time elapsed : 0,15
time elapsed : 0,17
time elapsed : 0,13
time elapsed : 0,12
time elapsed : 0,15
time elapsed : 0,15
time elapsed : 0,15

From above we can see now the connection time has dropped from 0.3 second to 0.15 averagely, and these is no more annoying intermittent 3-second-connections anymore. So in general, we have cut 60% connection time.
On gateway itself we can see sensible improvement of performance, from 40 messages per minute to 60+ messages per minute now.

This alleviated the performance issue and saved us from the crisis. Meanwhile, it does not satisfy me, as 0.15 second per connection is still long. And we know more and more regional hospitals will join this national project so the load by the end of the year will be at least 50% higher than at this moment. So I am asked to further fine tune the capacity of this gateway system as per their request.

I then come up with shared server connection method in my mind: It should help.

Why?

Now we know the bottleneck is when Solaris kernel forking child process.
With shared server connection method, oracle will always maintain a pool of existing oracle processes and assign one from this pool when there is an incoming call, and put it back into the pool after the connection ends.
So, theoretically, we can bypass oracle forking shadow process step and its associated slowness.


I then created a separated shared server connecting service on database server only for the gateway use and updated the new tnsname entry of the gateway, and now let's see the speed of connection:

time elapsed : 0,03
time elapsed : 0,05
time elapsed : 0,05
time elapsed : 0,03
time elapsed : 0,06
time elapsed : 0,04
time elapsed : 0,03
time elapsed : 0,03
time elapsed : 0,05
time elapsed : 0,05
time elapsed : 0,05
time elapsed : 0,05
time elapsed : 0,03
time elapsed : 0,06
time elapsed : 0,05
time elapsed : 0,05
time elapsed : 0,06
time elapsed : 0,05
time elapsed : 0,05
time elapsed : 0,05
time elapsed : 0,05
time elapsed : 0,03
time elapsed : 0,03
time elapsed : 0,03
time elapsed : 0,03
time elapsed : 0,06
time elapsed : 0,05

Now the gateway performance issue has been perfectly solved, the capacity of the gateway is almost 100 messages per minute, and this meets the demand.

More......