2017-04-19

unexpected node reboot in RAC

Issue #1: The node rebooted, but the log files do not show any error or cause.

Cause:
If the node reboot is by one of the Oracle processes but log files do not show any error, then the culprit is oprocd, cssdmonitor, and cssdagent processes. This happens when the cluster node was hanging for a while or one or more critical CRS processes cannot get scheduled for CPU. Because those processes run in real time, the underlying issue is likely memory starvation or low free memory and not (literally) CPU starvation. The kernel was swapping pages heavily or was busy scanning memory to identify pages to free up. There could also be an OS scheduling bug at play.

Solution:
1) Set diagwait to 13 if CRS version is 11.1 or lower.
2) If platform is AIX tune AIX VM parameters as suggested in Document 811293.1 (RAC and Oracle Clusterware Best Practices and Starter Kit (AIX)).
3) If the platform is Linux, set up hugepages and set kernel parameter vm.min_free_kbytes to reserve 512MB.  Setting hugepages is probably the single most important thing to do on Linux. Note that memory_target can not be set when using hugepages.
4) If the platform is Linux and kernel is 2.6.18 (i.e. OEL5, Redhat 5, SLES 10) or lower, set kernel parameter swappiness to 100.
Note that there is no need to set kernel parameter swappiness to 100 on Linux Kernel 2.6.32 (i.e. OEL6, Redhat 6, SLES 11) or higher.
5) Disable Transparent HugePages on SLES11, RHEL6, OEL6 and UEK2 Kernels
6) Check if a large amount of memory is allocated to IO buffer cache. Talk to the OS vendor to suggest ways to reduce the amount of IO buffer cache or increase the reclamation rate of memory from IO buffer cache.
7) Increase the amount of memory.