How should I manage CloudBoot compute resources failure?
Example scenario: the Data Center power fails, then generator fails to fire up.
Providing all servers went offline at the same time, disks and all compute resources are brought back online before booting VSs. The disks should not be degraded.
In this case VDisks will show as in sync in the Integrated Storage left panel for all data store zones. Once all compute resources are stable, start powering up VSs in small batches and then progressively larger if no issues are identified.
If the servers go down at different times (in cabinet UPS lasting have different amounts of time for example) or if VSs are booted before all compute resources are back up, then disks could be in a degraded state and would need to be repaired.
Wait until compute resources are all back online. If there are some compute resources that fail to come back online or disk drives don’t come back, then either the content on those disks will need to be forgotten (in the case that the compute resource or disk is never coming back) or attempts to make the compute resource/disk come back online. At the point where the system is stable again, repair disks. To repair, use the 'Repair All' section in the diagnostics page (Integrated Storage > Compute Zones > Diagnostics).
Example scenario: compute resource power supply fails
When power supply of a compute resource fails, OnApp will detect the compute resource as offline. In this case the failover processes will start and boot up VSs on other compute resources if there are sufficient resources on those compute resources. Also VSs will only start on compute resources with disk content for all stripes if read-local path policy is enabled.
Ensure that failover timeout is set to above 2 minute window for storage layer to work correctly with failover.
At this stage any VDisks with content on the offline compute resource will be degraded but VSs should be running.
Example scenario: If the compute resource cannot be fixed
If the compute resource cannot be fixed, perform the following operations:
On the backup server or another compute resource run:
onappstore forgetfromall forgetlist=<node_id>
Repeat for each node from the now offline compute resource.
Go to the Diagnostics page and Repair all with 'partial memberlist found' parameter. (Integrated Storage > Compute Zones > Diagnostics)
Example scenario: If the compute resource can be fixed
To fix a compute resource:
- Boot the compute resource backup.
- Check diagnostics page to make sure all nodes are active (Integrated Storage > Compute Zones > Diagnostics)
- Do a Repair All on the degraded VDisks.