Monday, November 11, 2019

CRS-4000: Command Start failed, or completed with errors

Problem:
 
While restarting the clusterware on one cluster node I got this error:

[root@fzppon06vs1n~]# crsctl start cluster
CRS-2679: Attempting to clean 'ora.ctssd' on 'fzppon06vs1n'
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'fzppon06vs1n'
CRS-2672: Attempting to start 'ora.evmd' on 'fzppon06vs1n'
CRS-2680: Clean of 'ora.ctssd' on 'fzppon06vs1n' failed
CRS-2676: Start of 'ora.evmd' on 'fzppon06vs1n' succeeded
CRS-2674: Start of 'ora.drivers.acfs' on 'fzppon06vs1n' failed
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'fzppon06vs1n'
CRS-2679: Attempting to clean 'ora.ctssd' on 'fzppon06vs1n'
CRS-2680: Clean of 'ora.ctssd' on 'fzppon06vs1n' failed
CRS-2679: Attempting to clean 'ora.ctssd' on 'fzppon06vs1n'
CRS-2680: Clean of 'ora.ctssd' on 'fzppon06vs1n' failed
CRS-2674: Start of 'ora.drivers.acfs' on 'fzppon06vs1n' failed
CRS-2672: Attempting to start 'ora.storage' on 'fzppon06vs1n'
CRS-2676: Start of 'ora.storage' on 'fzppon06vs1n' succeeded
CRS-4000: Command Start failed, or completed with errors.



Analysis:
 
When Checking the clusterware alertlog I can find the log stopped on this line:

2019-10-07 12:15:45.495 [EVMD(23031)]CRS-8500: Oracle Clusterware EVMD process is starting with operating system process ID 23031
I've checked the time between both cluster nodes and it was in sync.
Tried to stop the clusterware forcefully and start it up:
[root@fzppon06vs1n~]# crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'fzppon06vs1n'
CRS-2679: Attempting to clean 'ora.ctssd' on 'fzppon06vs1n'
CRS-2680: Clean of 'ora.ctssd' on 'fzppon06vs1n' failed
CRS-2679: Attempting to clean 'ora.ctssd' on 'fzppon06vs1n'
CRS-2680: Clean of 'ora.ctssd' on 'fzppon06vs1n' failed
CRS-2679: Attempting to clean 'ora.ctssd' on 'fzppon06vs1n'
CRS-2680: Clean of 'ora.ctssd' on 'fzppon06vs1n' failed
CRS-2679: Attempting to clean 'ora.ctssd' on 'fzppon06vs1n'
CRS-2680: Clean of 'ora.ctssd' on 'fzppon06vs1n' failed
CRS-2679: Attempting to clean 'ora.ctssd' on 'fzppon06vs1n'
CRS-2680: Clean of 'ora.ctssd' on 'fzppon06vs1n' failed
CRS-2799: Failed to shut down resource 'ora.cssd' on 'fzppon06vs1n'
CRS-2799: Failed to shut down resource 'ora.cssdmonitor' on 'fzppon06vs1n'
CRS-2799: Failed to shut down resource 'ora.ctssd' on 'fzppon06vs1n'
CRS-2799: Failed to shut down resource 'ora.gipcd' on 'fzppon06vs1n'
CRS-2799: Failed to shut down resource 'ora.gpnpd' on 'fzppon06vs1n'
CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'fzppon06vs1n' has failed
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors.


Looks there is a problem with stopping cssd and ctssd services as well.

Solution:
 
Restarted the node and the clusterware came up properly without errors:
[root@fzppon06vs1n ~]# sync;sync;sync; init 6
Analyzing such problem was challenging as there were no errors reported in the clusterware logs when the clusterware was hung during its start up. So far, restarting the RAC node is one of the silver bullet troubleshooting techniques for many of non-sense clusterware behaviors ;-)