2012-08-21

Top 5 Oracle RAC Instance Crash Issues

Top 5 RAC Instance Crash Issues [ ID 1375405.1]


Applies to:

Oracle Server - Enterprise Edition - Version 11.2.0.1 and later
Information in this document applies to any platform.

Purpose

The purpose of this note is to provide a summary of the top 5 issues that may cause RAC instance crash and hot issues reported for earlier release for example 10.2.0.5.

Scope

Issue #1 to #5 applies to 11gR2 Real Application Cluster only. Issue <release> applies to mentioned release only.

Details

Issue #1: ORA-29770 LMHB Terminate Instance

Symptoms:
LMON (ospid: 31216) waits for event 'control file sequential read' for 88 secs.
Errors in file /oracle/base/diag/rdbms/prod/prod3/trace/prod3_lmhb_31304.trc (incident=2329):
ORA-29770: global enqueue process LMON (OSID 31216) is hung for more than 70 seconds
LMHB (ospid: 31304) is terminating the instance.

or
LMON (ospid: 8594) waits for event 'control file sequential read' for 118 secs.
ERROR: LMON is not healthy and has no heartbeat.
ERROR: LMHB (ospid: 8614) is terminating the instance.
Possible Causes:
Bug 8888434  LMHB crashes the instance with LMON waiting on controlfile read
Bug 11890804 LMHB crashes instance with ORA-29770 after long "control file sequential read" waits
Solutions:
Bug 8888434 has been fixed in 11.2.0.2+
Bug 11890804 has been fixed in 11.2.0.3+
Please refer Document 1197674.1Document 8888434.8 and Document 11890804.8 for more details

Issue #2: Instance crash with ORA-481

Symptoms:
1. PMON (ospid: 12585): terminating the instance due to error 481
LMON trace shows:
Begin DRM(107) (swin 0)
* drm quiesce <kjxgmrcfg: Reconfiguration started, type 6

LMSx trace shows:
2011-07-05 10:53:44.218905 : Start affinity expansion for pkey 81885.0
2011-07-05 10:53:44.498923 : Expand failed: pkey 81885.0, 229 shadows traversed, 153 replayed 1 retries

2. PMON (ospid: 4915562): terminating the instance due to error 481
Sat Oct 01 19:21:37 2011
System state dump requested by (instance=2, osid=4915562 (PMON)), summary=[abnormal instance termination].
Possible Causes:
1. Bug 11875294 LMS gets stuck during DRM, Instance crashed with ORA-481
2. HAIP is not online on partial of cluster nodes, or HAIP is online on all cluster nodes but they are not pingable
Solutions:
1. Bug 11875294 has been fixed in 11.2.0.3, workaround is:
Disable read-mostly locking by setting:
_gc_read_mostly_locking=FALSE.
Please refer < Document 11875294.8> for more information.

2. Fix HAIP issue per Document 1383737.1

Issue #3: ORA-600[kjbmprlst:shadow], ORA-600[kjbrref:pkey], ORA-600[kjbmocvt:rid], [kjbclose_remaster:!drm], ORA-600 [kjbrasr:pkey], instance crash

Symptoms:
RAC instance crashes with ORA-600 [kjbmprlst:shadow] or ORA-600[kjbrref:pkey], or ORA-600[kjbmocvt:rid],[kjbclose_remaster:!drm], ORA-600 [kjbrasr:pkey]
Possible Causes:
This group of ORA-600 are related with DRM (dynamic resource remastering) messaging or read mostly locking. Quite few bugs involved:
Document 9458781.8 Missing close message to master leaves closed lock dangling crashing the instance with assorted Internal error
Document 9835264.8 ORA-600 [kjbrasr:pkey] / ORA-600 [kjbmocvt:rid] in RAC with dynamic remastering
Document 10200390.8 ORA-600[kjbclose_remaster:!drm] in RAC with fix for 9979039
Document 10121589.8 ORA-600 [kjbmprlst:shadow] can occur in RAC
Document 11785390.8 Stack corruption / incorrect behaviour possible in RAC
Document 12408350.8 ORA-600 [kjbrasr:pkey] in RAC with read mostly locking
Document 12834027.8 ORA-600 [kjbmprlst:shadow] / ORA-600 [kjbrasr:pkey] with RAC read mostly locking
Solutions:
Most of above bugs are fixed in 11.2.0.3, apply 11.2.0.3 patchset should avoid the bugs with the exception of Bug 12834027, this bug will be fixed in 12.1. Workaround for the bug is:

Disable DRM
or
Disable read-mostly object locking
eg: Run with "_gc_read_mostly_locking"=FALSE

Please refer to above Document number for each bug explanation and solution.

Issue #4: Dumps on kcldle / kclfplz / kcbbxsv_l2 / kclfprm using flash

Symptoms:
ORA-7445[kcldle]
ORA-7445[kclfplz]
ORA-7445[kcbbxsv_12]
ORA-744[kclfprm]  reported in alert log
Possible Causes:
They are caused by various bugs which closed as base Bug 12337941 Dumps on kcldle / kclfplz / kcbbxsv_l2 / kclfprm using flash
Solutions:
The bug has been fixed in 11.2.0.3, either apply the patchset or use workaround: Disable the flash cache
Refer Document 12337941.8 for more details

Issue #5: LMS gets ORA-600 [kclpdc_21] and instance crashes

Symptoms:
ORA-600[kclpdc_21] reported in alert log
Possible Causes:
Document 10040035.8  LMS gets ORA-600 [kclpdc_21] and instance crashes 
Solutions:
The bug has been fixed in 11.2.0.3

Issue for 10.2.0.5

Symptoms:
1. lms report ORA-600[kjccgmb:1], instance crash with LMS<n>: terminating instance due to error 484
2. Instance crash with:
Received an instance abort message from instance 2 (reason 0x0)
Please check instance 2 alert and LMON trace files for detail.
LMD0: terminating instance due to error 481
Possible Causes:
1. Bug 11893577 - LMD CRASHED WITH ORA-00600 [KJCCGMB:1]
2. Bug 9577274 - 1OFF:UNABLE TO VIEW REQUEST OUTPUT AND LOG AFTER APPLYING FIX TO ISSUE IN BUG 9400041
Solutions:
1. For 10.2.0.5.0, please apply merge patch 12616787 only
2. For 10.2.0.5.5, please apply merge patch 13470618 only
At the time of writing, patch only available for certain platform. It is not required to apply both of above patches for any 10.2.0.5.x release.

Niciun comentariu:

Trimiteți un comentariu