Tuesday, October 12, 2021

Cluster Log Records Showing Different Time Than OS Time on 19c RAC

 Problem:

On OS the time is showing 16:00 while the Clusterware and DB logs time are showing 20:00

OS Time

 

Cluster's Log Time:

 

Analysis:

The configured timezone in the cluster environment config file was different from the OS timezone:

OS Timezone: is UTC

# cat /etc/sysconfig/clock


Cluster Timezone: is Dubai

# cat $GRID_HOME/crs/install/s_crsconfig_nodex1_env.txt

 

Solution:

change the Cluster timzone to be similar to OS timezone:

# vi $GRID_HOME/crs/install/s_crsconfig_nodex1_env.txt

Note: The clusterware and DB logs will pick the new timezone once you restart the clusterware.


Wednesday, October 6, 2021

ora_lms Processes Leaking Memory on 19c RAC and Crash the Server Due To Lack of Memory (Yet another Memory Leak Bug!)

Problem:

After few weeks of creating a RAC DB on a 19c cluster I started to figure out that ora_lms processes on each RAC node is consuming high memory, and they keep consuming the memory without releasing it, until the server run out of it and start swapping leading the DB and the clusterware to crash.
By the way ora_lms processes don't consume the memory in a constant pace e.g. 100m a day, no, this will be based on the load on the server; the more the load the more memory they will consume; such bug can bring down a busy DB system in few days.

Analysis:

Snapshot below is a top command after pressing M to list the top memory consumers, I've multiple CPUs on each RAC node; so Oracle automatically created two ora_lms services, which ironically double the impact of this memory leak bug, considering that you have no control over the number of ora_lms processes to run ( Doc ID 1392248.1 ) .


I've opened an SR with Oracle support, it took me 8 months of pulling and pushing along with dozens of escalation calls till their development team recognized it as a bug:

Bug 32961288 - LMS CONSUMING MORE MEMORY AFTER SERVER UP FOR SOMETIME

Workaround:

Initially, I work around this bug by restarting the RAC DB instances node by node and this released the memory consumed by ora_lms, but again within few weeks the issue will happen again!

The confirmed workaround was to implement HugePages, Implementing HugePages has done the trick and made this bug to disappear.

While implementing HugePages, make sure you do the following:
    - Set the HugePage size to its default value, which is 2MB per page in Linux.
    - Force the DB instance to use the HugePages when startup by setting this static parameter:
    SQL>
alter system set use_large_pages=only scope=spfile sid='*';

The following links will help you with HugePages implementation on DB server:

https://docs.oracle.com/en/database/oracle/oracle-database/19/unxar/administering-oracle-database-on-linux.html#GUID-02E9147D-D565-4AF8-B12A-8E6E9F74BEEA

https://oracle-base.com/articles/linux/configuring-huge-pages-for-oracle-on-linux-64

Conclusion:

What triggers the ora_lms memory leak bug, this remains a mystery for me! I use lots of hidden parameters on this DB, I tried resetting them one by one without any luck. Luckily the implementation of HugePages can work around it, this is much better than waiting for Oracle Dev team to release a bug fix!