Cluster Roll feature enhancements now available

Revital Vladimirsky

Product Manager

January 27, 2022

4 min read

Spot’s Ocean includes a powerful feature called “cluster roll.” This feature simplifies applying changes to Kubernetes worker nodes. Typical changes include applying a new image, modifying or adding user data, and updating security groups. A cluster roll applies these changes without having to disable the Ocean autoscaler. It also removes the need for you to manually attach new nodes or remove replaced nodes from the cluster.

Cluster roll is performing as designed for the majority of use cases. However, sometimes the cluster roll process did not support the desired change. We also encountered a few situations where the cluster roll would enter a failed state. When this happened there was not enough information logged to quickly identify and rectify the underlying issue.

Feature Enhancements

We are pleased to share the following list of feature enhancements for cluster roll. These are now available to all Ocean users with a few exceptions that are noted in the feature description.

Replace one instance with multiple instances

Cluster roll is able to replace a single instance with multiple smaller instances. This avoids a cluster roll failure when only smaller instance types are configured in the Ocean cluster prior to initiating the roll. Rather than replacing each existing instance with a one of the same type, Ocean will provision the most relevant infrastructure during the cluster roll. This is based on the workloads currently running on the nodes chosen for rolling. This improvement is especially helpful when you have modified the list of allowed instance types or if your goal is to remove a specific instance type and replace it with multiple smaller ones. It can also improve utilization levels when it is not possible to scale down an entire node. Instead an underutilized node could be replaced with a smaller instance. Please note: This feature is not yet enabled for Ocean clusters running on Microsoft Azure.

Interrupted instances are taken into consideration

During a roll an instance might be interrupted. We have improved how this situation is handled by Ocean. Ocean has additional tracking of which replacement instance(s) are associated with the interrupted instance. The result is that such interruptions during the roll will not cause the cluster roll to fail. This significantly improves the overall experience while using cluster roll.

Detailed output for each old instance

We have implemented four statuses for each instance:

REPLACED – The instance was successfully replaced by a new instance.
TO_BE_REPLACED – Ocean did not try to replace the instance yet. This is usually because the instance is not part of the current batch.
COULD_NOT_BE_REPLACED – The instance was not replaced. This situation generally happens when there is no replacement instance that becomes healthy within the grace period or because the autoscaler could not launch a node to satisfy the workload.
NOT_REPLACED_DUE_TO_PDB – Replacing the instance violates the PDB (Pod Disruption Budget) configuration on one of the pods running on the node. Note: This status is only relevant when respectPdb is set to “true“.

Support PDB during roll

There is a new parameter called respectPdb which can be specified using the API or the UI. When set to “true“, Ocean will not replace a node if the PDB is violated.

Roll a specific VNG/node

There may be use cases where you want to roll a specific workload that is running across multiple VNGs (Virtual Node Groups). You can choose one or more VNGs to roll together. When more than one VNG is chosen, Ocean creates one cluster roll that includes all of the nodes in all of the specified VNGs. The batch size is applied across all the affected nodes from each of the VNGs. In addition to this new capability, you can also roll a specific instance/node if needed.

Minimum health percentage

The parameter batchMinHealthyPercentage indicates the minimum percentage of healthy instances in a single batch. If the amount of healthy instances in a single batch is under this percentage, the cluster roll will fail. The range is 1-100, and if the parameter value is null, the default value will be 50%. Instances that were not replaced due to PDB will be considered healthy. You can override this behavior by setting respectPdb to “true”. Please note: This feature is not yet enabled for Ocean clusters running on Microsoft Azure.