Sunday, April 12, 2020

19c Clusterware fail to Startup due to CRS-41053: checking Oracle Grid Infrastructure for file permission issues CRS-4000

On a19c cluster node I got this error when trying to start one RAC node:

[root@fzppon05vs1n ~]# crsctl start crs
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
PRVG-11960 : Set user ID bit is not set for file "/u01/grid/12.2.0.3/bin/extjob" on node "fzppon05vs1n".
PRVG-2031 : Owner of file "/u01/grid/12.2.0.3/bin/extjob" did not match the expected value on node "fzppon05vs1n". [Expected = "root(0)" ; Found = "oracle(54321)"]
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.


Before you rush and change any file permission, read below solutions carefully, because most probably it's not a permission issue!

I've faced this error in many occasions; each time I fix it with a different solution. And here is a list of all solutions, where anyone can work for you.

Solution #1: Make sure / and /var filesystems are not full

If / or /var filesystems are 100% full this may cause CRS-41053 when starting up the clusterware. If that is the case then free up the space under the full filesystem, one quick command can do the magic; cleaning yum files:
 
[As root]
# yum clean all

 
Solution #2: kill all duplicate ohasd services
 
Before trying to restart the OS, just thought to check the clusterware background processes, and here is the catch:

[root@fzppon05vs1n ~]# ps -ef | grep -v grep| grep '\.bin'
root     19786     1  1 06:18 ?        00:00:39 /u01/grid/12.2.0.3/bin/ohasd.bin reboot

root     19788     1  0 06:18 ?        00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root     19850     1  0 06:18 ?        00:00:13 /u01/grid/12.2.0.3/bin/orarootagent.bin
root     19958     1  0 06:18 ?        00:00:14 /u01/grid/12.2.0.3/bin/oraagent.bin

...

Found lots of ohasd.bin are running, while it supposed to be only one ohasd.bin process

Checking all ohasd related processes:

[root@fzppon05vs1n ~]# ps -ef | grep -v grep | grep ohasd
root      1900     1  0 06:17 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
root      1947  1900  0 06:17 ?     00:00:00 /bin/sh /etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
root      19786     1  1 06:18 ?        00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root      19788     1  0 06:18 ?        00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot


Now, let's kill all ohasd processes and give it a try:

[root@fzppon05vs1n ~]# kill -9 1900  1947 19786 19788            

or simply kill all init.ohasd using the following one command:
[root@fzppon05vs1n ~]# ps -ef | grep 'init.ohasd' | grep -v grep | awk '{print $2}' | xargs -r kill -9            
Start the clusterware:

[root@fzppon05vs1n ~]# crsctl start crs
CRS-4123: Oracle High Availability Services has been started.


Voilà! Started up.

In case this didn't work move to the next solution ...
 
 
Solution #3: Reboot the server
 
This sounds as an IT Service Disk solution, but restarting the machine is known to fix 50% or more of cluster startup weird issues :-)
  
 
Solution #4: Re-configuring the clusterware
 
In case your clusterware still doesn't start up after the above workaround, you may need to consider re-configuring the clusterware using the following commands:

Note: Re-configuring the clusterware should happen on the malfunctioning node, where it's not supposed to do any impact on the other working cluster nodes:

# $GRID_HOME/crs/install/rootcrs.sh -deconfig -force
# $GRID_HOME/root.sh

 
Conclusion:

CRS-41053 may look vague, moreover, it may mention a different file other than extjob in the error message, don't rush and change the file's ownership as advised by the error message,
- Make sure there is no filesystem is 100% full.
- Second, check for any redundant running clusterware background processes and kill it, then try to startup the clusterware.
- If clusterware is still failing; restart the node and check again for any redundant processes, if found any of them try to kill and start the cluster.
- Lastly, If your clusterware still doesn't come up, then use the sliver bullet and reconfigure the clusterware on the malfunctioning node.

No comments:

Post a Comment