Currently, the failure response on OnApp is not at an acceptable production level. The loss of the management network, and possibly the VM networks could cause a situation where the management station is not aware whether the HV is still connected to storage. In this situation, there is no way to ensure that recovery is done without booting up VMs twice.
Right now, in current scenario of failover, only software checks(ssh, snmp, ping, etc) are implemented and all checks are being done via HV mgmt and VMs network(ping).
I’d suggest adding option to run some custom (self written script) or command from CP and wait for the output string or exit code as it’s implemented in Onapp Recipe and this script of command should be used as a last step in Onapp failover before considering HV as completely offline and start migrating VMs from it. Goal of this command or script is to execute it and make sure that node is completely offline or restarted (not via MGMT interface). It could ipmitool command for SuperMicro Servers, but could be any others remote management platform interfaces, so onapp won’t be tied to Supermicro servers. Another way to consider HV as completely offline before migrating VMs from it: Run custom “Power cycle Command" from CP without logging to HV via mgmt network, but in this scenario output needs to be checked or response code(again similar to Onapp Recepi). So, this command will be only run in Failover Steps if it’s defined.
Ideally, the HA functionality would be performed without the need for the CP server, and there should be some level of coordination between the HV’s.
Storage heart beating should be enforced to ensure that an HV is not connected to the storage before booting up those VMs to avoid corruption.
Comments
1 comment