One DB frequently crashed.
Here was the entry in alert.log:
Fri Oct 11 04:11:22 2013 Errors in file c:\oracle\product\10.2.0\admin\qprod\bdump\qprod_arc0_2380.trc: ORA-00202: Message 202 not found; No message file for product=RDBMS, facility=ORA; arguments: [E:\ORACLE\ORADATA\QPROD\CONTROL02.CTL] ORA-27091: Message 27091 not found; No message file for product=RDBMS, facility=ORA ORA-27070: Message 27070 not found; No message file for product=RDBMS, facility=ORA OSD-04006: ReadFile() failure, unable to read from file O/S-Error: (OS 1453) Insufficient quota to complete the requested service.
Database failed to perform a IO on its controlfile, reason OS error 1453, hence database terminated.
Below is Microsoft explanation for OS error 1453:
WORKING_SET is a memory structure.
Above error indicates Oracle failed to locate memory from physical memory.
From trace file we could see there was not sufficient physical memory when the DB crashed.
Dump file c:\oracle\product\10.2.0\admin\qprod\bdump\qprod_arc0_2380.trc
Fri Oct 11 04:11:15 2013
ORACLE V10.2.0.4.0 - 64bit Production vsnsta=0
vsnsql=14 vsnxtr=3
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Windows NT Version V5.2 Service Pack 2
CPU : 4 - type 8664, 4 Physical Cores
Process Affinity : 0x0000000000000000
Memory (Avail/Total): Ph:79M/8191M, Ph+PgF:11824M/20032M
Instance name: qprod
In Windows OS event manager, we could see errors that indicating it is a memory issue:
Based on above, the Error DB got was due to server memory exhausted during the issue period.
(Note that due to out of memory, different oracle processes might die with different ORA- errors and symbols in alert.log, but root is clear: out of memory.)
We configured tool to monitor the server's memory usage.
From the log we could see during the issue period, 2.8GB free memory was consumed in a few minutes:
We infer 2.8GB free memory were all consumed up in two minutes. Then database died. Later memory were released. (due to database was died there were more free memory later).
I checked the activity history inside the database including PGA and SGA history and I could see there was no activity and the memory usage of database was stable and not changed during the issue period.
So since then we were working on identifying which external process suddenly consumed all the memory during that period.
The most wired part was, from system overall logs, all available memory was consumed up in a few minutes, but when we looked into each single process' monitoring log, each process' memory usage is stable and not changed during the issue period.
It was a truth that all memory were consumed, but seems in the monitor log files, no process was admitted to be responsible for that.
Since the issue always happened at around 4:00 AM Local time, so then I logged into station at 4:00 AM and monitored the system closely for a few days.
And I can see above issue persists everyday 4:00 AM, some days database died due to the memory exhausted, someday database hang on until the memory released and survived.
For passed a few days, I tried different debug tools to walk into the kernel, until today, I noticed that during the issue period of time, one strange thread of SYSTEM process coming into active heavily (consumed Rank top CPU -- 13.48%), while it was not there during normal time:
We saw during the issue time, vmmemctl.sys of kernel process suddenly come into active heavily, and very soon Physical Usage raised to 100% from 60%.
Explanation for vmmemctl.sys:
Memory ballooning function is not related to shared/reserved setting in vmware properties.
Memory ballooning is handled through a driver (vmmemctl.sys) that is installed as part of the VMware Tools.
This driver is loaded in the guest OS to interact with the VMkernel and is leveraged to reclaim memory pages when ESX memory resources are in demand and available physical pages cannot meet. requirements.
When memory demands rise on the ESX host, the VMkernel will instruct the balloon driver to "inflate" and consume memory in the running guest OS,
forcing the guest operating system to leverage its own native memory management techniques to handle changing conditions.
Free pages are typically released first, but the guest OS may decide to page some memory out to its pagefile on the virtual disk.
The reclaimed memory is then used by ESX to satisfy memory demands of other running workloads, but will be relinquished back to the guest OS when memory demands decrease by "deflating" the balloon driver.
Balloon driver activity can be viewed either through VirtualCenter performance monitoring graphs or ESXTOP on the local host.
From the explanation, it is clear now that during issue period, memory exhausted on ESX global server, hence it grabs memory from guest OS namely our DB server.
Below are the steps to check the ballooning from ESX host:
a. Run esxtop. b. Type m for memory c. Type f for fields d. Select the letter J for Memory Ballooning Statistics (MCTL) e. Look at the MCTLSZ value. MCTLSZ (MB) displays the amount of guest physical memory reclaimed by the balloon driver.
Below are steps to disable memory balloon:
Disabling ballooning via the vSphere Client To set the maximum balloon size to zero: 1. Using the vSphere Client, connect to the vCenter Server or the ESXi/ESX host where the virtual machine resides. 2. Log into the ESXi/ESX host as a user with administrative rights. 3. Shut down the virtual machine. 4. Right-click the virtual machine listed on the Inventory panel and click Edit Settings. 5. Click the Options tab, then under Advanced, click General. 6. Click Configuration Parameters. 7. Click Add row and add the parameter sched.mem.maxmemctl in the text box. 8. Click on the row next to it and add 0 in the text box. 9. Click OK to save changes. To re-enable the balloon driver in a virtual machine: 1. Using the vSphere Client, connect to the vCenter Server or the ESXi/ESX host where the virtual machine resides. 2. Shut down the virtual machine if it is powered on. 3. SSH to the ESXi/ESX host. For more information, see Connecting to an ESX host using an SSH client (1019852). 4. Change directory to the datastore where the virtual machine's configuration file resides. 5. Back up the virtual machine's configuration file. 6. Edit the virtual machine's configuration file (virtual_machine_name.vmx) and remove this entry: sched.mem.maxmemctl = "0" 7. Save and close the file. 8. Power on the virtual machine. Note: You cannot remove the entry via the Configuration Parameters UI once it has been added. You must edit the configuration file (.vmx) for the virtual machine to remove the entry. ------------------------------------------------------- Disabling ballooning via the Windows registry To disable ballooning on the virtual machine: Note: This procedure modifies the Windows registry. Before making any registry modifications, ensure that you have a current and valid backup of the registry and the virtual machine. For more information on backing up and restoring the registry, see the Microsoft Knowledge Base article 136393. 1. Log into the guest OS. 2. Click Start > Run, type regedit, and press Enter. The Registry Editor window opens. 3. Navigate to: \HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\VMMEMCTL 4. Change the Start key from 2 to 4. 5. Save the setting and restart the guest OS. ------------------------------------------------------- Disabling ballooning via VMware Tools uninstallation/reinstallation 1. Uninstall VMware Tools from the guest OS. 2. Reinstall VMware Tools using the Custom Settings option, and deselect the Memory Control Drivers.
More......