Thursday, April 23, 2020

CRS-6706: Oracle Clusterware Release patch level ('3291738383') does not match Software patch level ('724960844')

Problem:
After patching an Oracle 19.3 GRID_HOME on Oracle Restart setup with 19.5 RU patch [30125133], I was not able to start up Oracle Restart HAS due to this error:

#  $GRID_HOME/bin/crsctl start has
CRS-6706: Oracle Clusterware Release patch level ('3291738383') does not match Software patch level ('724960844'). Oracle Clusterware cannot be started.
CRS-4000: Command Start failed, or completed with errors.

Analysis:
Despite the success of patching GRID_HOME with 19.5 RU went successful, something went wrong during the patching process.
While trying to find a solution, I landed on Oracle Note (Doc ID 1639285.1) which describes a similar problem on a RAC setup, but it didn't offer a solution for Oracle Restart setup --which is my case. So I thought to write about the solution I followed and worked for me in this post.

Solution:
Running the following commands would fix/complete an in-complete patching of the GRID_HOME on an Oracle Restart setup:
You have to run the following commands with root user while Oracle Restart HAS is stopped:
# $GRID_HOME/crs/install/roothas.sh -unlock
# $GRID_HOME/crs/install/roothas.sh -prepatch 
# $GRID_HOME/crs/install/roothas.sh -postpatch













Oracle Restart HAS will startup automatically after the last command.

Tuesday, April 14, 2020

dbalarm Script Updated!

I've added one more feature to the system monitoring script dbalarm to monitor the inodes number on each mounted filesystem on the system to provide an early warning before the inodes number get exhausted.

I've published one post explaining the impact of getting inodes number exhausted:
http://dba-thoughts.blogspot.com/2020/04/no-space-left-on-device-error-while.html

To download the final version of dbalarm script:
https://www.dropbox.com/s/a8p5q454dw01u53/dbalarm.sh?dl=0

To read more about dbalarm script and how it monitors the DB server:
http://dba-tips.blogspot.com/2014/02/database-monitoring-script-for-ora-and.html

Soon I'll update the same in the bundle tool as well.

Here is the GitHub version:

Sunday, April 12, 2020

19c Clusterware fail to Startup due to CRS-41053: checking Oracle Grid Infrastructure for file permission issues CRS-4000

On a19c cluster node I got this error when trying to start one RAC node:

[root@fzppon05vs1n ~]# crsctl start crs
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
PRVG-11960 : Set user ID bit is not set for file "/u01/grid/12.2.0.3/bin/extjob" on node "fzppon05vs1n".
PRVG-2031 : Owner of file "/u01/grid/12.2.0.3/bin/extjob" did not match the expected value on node "fzppon05vs1n". [Expected = "root(0)" ; Found = "oracle(54321)"]
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.


Before you rush and change any file permission, read below solutions carefully, because most probably it's not a permission issue!

I've faced this error in many occasions; each time I fix it with a different solution. And here is a list of all solutions, where anyone can work for you.

Solution #1: Make sure / and /var filesystems are not full

If / or /var filesystems are 100% full this may cause CRS-41053 when starting up the clusterware. If that is the case then free up the space under the full filesystem, one quick command can do the magic; cleaning yum files:
 
[As root]
# yum clean all

 
Solution #2: kill all duplicate ohasd services
 
Before trying to restart the OS, just thought to check the clusterware background processes, and here is the catch:

[root@fzppon05vs1n ~]# ps -ef | grep -v grep| grep '\.bin'
root     19786     1  1 06:18 ?        00:00:39 /u01/grid/12.2.0.3/bin/ohasd.bin reboot

root     19788     1  0 06:18 ?        00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root     19850     1  0 06:18 ?        00:00:13 /u01/grid/12.2.0.3/bin/orarootagent.bin
root     19958     1  0 06:18 ?        00:00:14 /u01/grid/12.2.0.3/bin/oraagent.bin

...

Found lots of ohasd.bin are running, while it supposed to be only one ohasd.bin process

Checking all ohasd related processes:

[root@fzppon05vs1n ~]# ps -ef | grep -v grep | grep ohasd
root      1900     1  0 06:17 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
root      1947  1900  0 06:17 ?     00:00:00 /bin/sh /etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
root      19786     1  1 06:18 ?        00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root      19788     1  0 06:18 ?        00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot


Now, let's kill all ohasd processes and give it a try:

[root@fzppon05vs1n ~]# kill -9 1900  1947 19786 19788            

or simply kill all init.ohasd using the following one command:
[root@fzppon05vs1n ~]# ps -ef | grep 'init.ohasd' | grep -v grep | awk '{print $2}' | xargs -r kill -9            
Start the clusterware:

[root@fzppon05vs1n ~]# crsctl start crs
CRS-4123: Oracle High Availability Services has been started.


Voilà! Started up.

In case this didn't work move to the next solution ...
 
 
Solution #3: Reboot the server
 
This sounds as an IT Service Disk solution, but restarting the machine is known to fix 50% or more of cluster startup weird issues :-)
  
 
Solution #4: Re-configuring the clusterware
 
In case your clusterware still doesn't start up after the above workaround, you may need to consider re-configuring the clusterware using the following commands:

Note: Re-configuring the clusterware should happen on the malfunctioning node, where it's not supposed to do any impact on the other working cluster nodes:

# $GRID_HOME/crs/install/rootcrs.sh -deconfig -force
# $GRID_HOME/root.sh

 
Conclusion:

CRS-41053 may look vague, moreover, it may mention a different file other than extjob in the error message, don't rush and change the file's ownership as advised by the error message,
- Make sure there is no filesystem is 100% full.
- Second, check for any redundant running clusterware background processes and kill it, then try to startup the clusterware.
- If clusterware is still failing; restart the node and check again for any redundant processes, if found any of them try to kill and start the cluster.
- Lastly, If your clusterware still doesn't come up, then use the sliver bullet and reconfigure the clusterware on the malfunctioning node.