On a19c cluster node I got this error when trying to start one RAC node:
[root@fzppon05vs1n ~]# crsctl start crs
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
PRVG-11960 : Set user ID bit is not set for file "/u01/grid/12.2.0.3/bin/extjob" on node "fzppon05vs1n".
PRVG-2031 : Owner of file "/u01/grid/12.2.0.3/bin/extjob" did not match the expected value on node "fzppon05vs1n". [Expected = "root(0)" ; Found = "oracle(54321)"]
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.
Before you rush and change any file permission, read below solutions carefully, because most probably it's not a permission issue!
I've faced this error in many occasions; each time I fix it with a different solution. And here is a list of all solutions, where anyone can work for you.
[root@fzppon05vs1n ~]# crsctl start crs
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
PRVG-11960 : Set user ID bit is not set for file "/u01/grid/12.2.0.3/bin/extjob" on node "fzppon05vs1n".
PRVG-2031 : Owner of file "/u01/grid/12.2.0.3/bin/extjob" did not match the expected value on node "fzppon05vs1n". [Expected = "root(0)" ; Found = "oracle(54321)"]
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.
Before you rush and change any file permission, read below solutions carefully, because most probably it's not a permission issue!
I've faced this error in many occasions; each time I fix it with a different solution. And here is a list of all solutions, where anyone can work for you.
Solution #1: Make sure / and /var filesystems are not full
If / or /var filesystems are 100% full this may cause CRS-41053
when starting up the clusterware. If that is the case then free up the
space under the full filesystem, one quick command can do the magic;
cleaning yum files:
[As root]
# yum clean all
Before trying to restart the OS, just thought to check the clusterware background processes, and here is the catch:
[root@fzppon05vs1n ~]# ps -ef | grep -v grep| grep '\.bin'
root 19786 1 1 06:18 ? 00:00:39 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root 19788 1 0 06:18 ? 00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root 19850 1 0 06:18 ? 00:00:13 /u01/grid/12.2.0.3/bin/orarootagent.bin
root 19958 1 0 06:18 ? 00:00:14 /u01/grid/12.2.0.3/bin/oraagent.bin
...
Found lots of ohasd.bin are running, while it supposed to be only one ohasd.bin process
Checking all ohasd related processes:
[root@fzppon05vs1n ~]# ps -ef | grep -v grep | grep ohasd
root 1900 1 0 06:17 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
root 1947 1900 0 06:17 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
root 19786 1 1 06:18 ? 00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root 19788 1 0 06:18 ? 00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
Now, let's kill all ohasd processes and give it a try:
[root@fzppon05vs1n ~]# kill -9 1900 1947 19786 19788
[root@fzppon05vs1n ~]# ps -ef | grep -v grep| grep '\.bin'
root 19786 1 1 06:18 ? 00:00:39 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root 19788 1 0 06:18 ? 00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root 19850 1 0 06:18 ? 00:00:13 /u01/grid/12.2.0.3/bin/orarootagent.bin
root 19958 1 0 06:18 ? 00:00:14 /u01/grid/12.2.0.3/bin/oraagent.bin
...
Found lots of ohasd.bin are running, while it supposed to be only one ohasd.bin process
Checking all ohasd related processes:
[root@fzppon05vs1n ~]# ps -ef | grep -v grep | grep ohasd
root 1900 1 0 06:17 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
root 1947 1900 0 06:17 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
root 19786 1 1 06:18 ? 00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
root 19788 1 0 06:18 ? 00:00:00 /u01/grid/12.2.0.3/bin/ohasd.bin reboot
Now, let's kill all ohasd processes and give it a try:
[root@fzppon05vs1n ~]# kill -9 1900 1947 19786 19788
or simply kill all init.ohasd using the following one command:
[root@fzppon05vs1n ~]# ps -ef | grep 'init.ohasd' | grep -v grep | awk '{print $2}' | xargs -r kill -9
Start the clusterware:
[root@fzppon05vs1n ~]# crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
Voilà! Started up.
In case this didn't work move to the next solution ...
Solution #3: Reboot the server
This sounds as an IT Service Disk solution, but restarting the machine is known to fix 50% or more of cluster startup weird issues :-)
Solution #4: Re-configuring the clusterware
In case your clusterware still doesn't start up after the above workaround, you may need to consider re-configuring the clusterware using the following commands:
Note: Re-configuring the clusterware should happen on the malfunctioning node, where it's not supposed to do any impact on the other working cluster nodes:
# $GRID_HOME/crs/install/rootcrs.sh -deconfig -force
# $GRID_HOME/root.sh
Note: Re-configuring the clusterware should happen on the malfunctioning node, where it's not supposed to do any impact on the other working cluster nodes:
# $GRID_HOME/crs/install/rootcrs.sh -deconfig -force
# $GRID_HOME/root.sh
Conclusion:
CRS-41053 may look vague, moreover, it may mention a different file other than extjob in the error message, don't rush and change the file's ownership as advised by the error message,
CRS-41053 may look vague, moreover, it may mention a different file other than extjob in the error message, don't rush and change the file's ownership as advised by the error message,
- Make sure there is no filesystem is 100% full.
- Second, check for any redundant running clusterware background processes and kill it, then try to startup the clusterware.
- If clusterware is still failing; restart the node and check again for any redundant processes, if found any of them try to kill and start the cluster.
- Lastly, If your clusterware still doesn't come up, then use the sliver bullet and reconfigure the clusterware on the malfunctioning node.
- Second, check for any redundant running clusterware background processes and kill it, then try to startup the clusterware.
- If clusterware is still failing; restart the node and check again for any redundant processes, if found any of them try to kill and start the cluster.
- Lastly, If your clusterware still doesn't come up, then use the sliver bullet and reconfigure the clusterware on the malfunctioning node.
No comments:
Post a Comment