Friday, December 30, 2022

RAC DB Crash by ORA-00600: internal error code, arguments: [ksm_mga_pseg_cbk_attach:map_null]

Problem:

When implementing a change required a RAC nodes restart one by one on a 19.5 RAC 2 nodes DB, and while restarting one node in the cluster (node2), the other node (node1) started to report this error and went in hung state, neither we were able to abort the DB instance nor restart the clusterware or the Linux OS:

ORA-00600: internal error code, arguments: [ksm_mga_pseg_cbk_attach:map_null]

Analysis:

The above-mentioned error didn't come alone, it was accompanied by a bunch of other ORA-00600 errors!

ORA 600 [ipc_recreate_que_2]                              
ORA 600 [ORA-00600: internal error code, arguments: [ipc_re
ORA 600 [IPC Assert]                                       
ORA 603                                                    
ORA 600 [17090]    
                                       


Along with the only clear ORA error which explained what was going on:

ORA-27300: OS system dependent operation:open failed with status: 23
ORA-27301: OS failure message: Too many open files in system

We decided to forcefully reboot the hung node (node1) from the hardware console; as the OS went hung, and to be honest, this is a tough decision any DBA can take on a cluster node; as this is known to corrupt the clusterware files especially when fsck runs at the startup.

What made the issue worse, is that the clusterware took too much time to come up on node2 (the node which restarted gracefully), as it was having trouble reading the votedisks --remember the other node (node1) which went hung and forcefully rebooted? This explains!

You may think that everything went fine after the startup of node2, and the hero managed to kill the beast and the movie ended? NO
10 minutes after, node1 DB instance joined back the cluster and node2 crashed by the same above-mentioned bunch or ORA-600 errors.
This scenario kept happening (one node join the cluster, the other node crash), until I started up both DB instances with the pfile.


Solution:

Oracle support referred us to an unpublished Bug 30851951. This bug is well known to hit 19c RAC DBs.

In short, there is something called MGA (Managed Global Area) memory, which shares the memory between the processes that access the PGA memory by sharing it under /dev/shm (if I'm not mistaken!), The bug causes the MGA to keep opening endless number of files under /dev/shm causing the system to reach its max open_files limit and go hung.

Which means you have to pick one of the following solutions:

1- Apply the bug fix patch 30851951 which is available for 19.5
2- Apply 19.8 RU patch or higher.
3- Workaround: Set a value for pga_aggregate_limit parameter.

The best solution I can see here, is to apply the bug fix 30851951 whereas,
Applying 19.8 RU patch is a major change in the system, and setting a value for pga_aggregate_limit parameter is well known to make the system prone to more PGA related bugs on 19c RAC! 

Honestly speaking, throughout the years and after witnessing similar RAC related bugs incidents in the past --since RAC 11g, I started to have a strong feeling that RAC technology itself (because of its bugs) contributes more to the SLAs breaches than the stability of the system!

No comments:

Post a Comment