AKS Day 2 management made easy

Anita Tragler

Sr. Product Manager - Containers

August 1, 2023

6 min read

AKS Announcements Kubernetes Microsoft Azure Ocean Spot VMs

You’ve deployed your Azure Kubernetes Services (AKS) cluster into production. Now what? Deploying AKS clusters is cause for celebration, but don’t rest on your laurels for too long. You are now in the Day 2 Kubernetes management phase and the operational challenges are on the rise.

The Kubernetes application lifecycle is broken into three main phases. They are often referred to as Days, but realistically, they take much longer than 24 hours!

Day 0 defines the design along with planning, requirements, and architecture.
Day 1 involves getting the application into production, includes configuration, deployment, installation, and validation)
Day 2 signifies ongoing operations and maintenance, including application updates and lifecycle, Kubernetes upgrades and infrastructure maintenance, security and compliance, monitoring and troubleshooting, and cost optimization.

Day 2 is the longest phase and spans the life of the application, keeping the lights on until it is terminated or replaced by something else.

When the AKS cluster is first deployed, there are only a handful of applications — a few hundred pods on a small set of nodes. Initially, it may seem easy for the operations team, including DevOps and site reliability engineers (SREs), to manage and debug manually. But as time progresses, applications will scale up with more users, more applications from multiple teams, and more clusters to manage with hundreds of nodes and thousands of pods. The complexity grows exponentially with too many moving parts and increased chances for errors, misconfigurations, and exposure to security vulnerabilities. How well you execute on Day 2 can make or break your company’s cloud native strategy.

These are just a few factors you need to consider as you enter Day 2:

Kubernetes upgrades: Kubernetes upstream has an aggressive quarterly release cycle. How are you going to upgrade the Kubernetes version in your cluster that services hundreds of applications, including mission-critical, big-data, edge, livestreaming, AI/ML emerging technologies, and legacy apps? What if there is an upgrade failure and you need to roll back?
Security patching: How often are you going to patch for security vulnerabilities in your Kubernetes components and node operating system or images?
Configuration management: How are you going to roll out application and configuration updates to your infrastructure? Additionally, how will you manage things like new security policies, new network subnets, ephemeral disk storage, multi-arch (ARM) support, and Windows Server windows2019 (Hyper-V Gen1) to windows2022 (Hyper-V Gen2) upgrades?
Infrastructure failure: How are you going to detect and fix misconfigurations and infrastructure failures, such as node pools in a failed provisioning state or unhealthy nodes that are not responding?
Cost optimization: Lastly, how are you going to manage and optimize your infrastructure so that cloud costs don’t spiral out of control? You need to be able to monitor infrastructure and application resource utilization and make adjustments on the fly.

Existing Kubernetes monitoring and open-source tooling may not be sufficient. You will need a combination of cloud provider Microsoft Azure tools and vendor add-ons for observability, telemetry, configuration management, and cost optimization to not only provide metrics and generate alerts but also to optimize resources, use automation to resolve issues, and perform day-to-day tasks.

Automation is the key to a successful Kubernetes Day 2 operations strategy. Having a solid Day 2 plan with the right tools and automation framework will help you achieve your company’s cloud business goals.

Spot Ocean Roll: Wanna roll with it?

Spot Ocean for AKS provides serverless Kubernetes optimization for Azure and now adds support for Ocean’s Roll feature, or rolling updates to help simplify your Day 2 cloud operations.

This is the first in a series of three blogs on Ocean Day 2 management for AKS. This blog introduces the capability of Ocean’s roll feature, how it works, and when to use it. In the next blog, we will deep dive into custom automation with Ocean’s Roll feature for AKS Kubernetes upgrades, and the last blog will cover more scenarios for Ocean AKS Day 2 management.

Spot Ocean’s ‘Roll’ capability enables users to automate and customize roll-out of Kubernetes upgrades, patching of node images, and security updates as well as pushing configuration updates with minimum disruption to workloads and without having to disable autoscaling. The Roll feature can also be used to automatically rebalance infrastructure across Spot and regular nodes to improve availability and reduce costs as well as replace unhealthy nodes and failed node pools. The Roll feature can save a lot of time and frustration by easily automating Day 2 operations.

Ocean for AKS supports the Roll capability where a rolling update can be executed or scheduled for:

the Ocean managed Kubernetes cluster, referred to as ‘Cluster Roll’
a specific virtual node group (VNG), ‘or VNG Roll’
specific node pools, or ‘Node Pool Roll’
only a set of nodes, or ‘Node Roll’

How does it work? Getting ready to batch and roll

When an Ocean Roll is triggered for the cluster or multiple VNGs, then the roll gets broken into customizable batches (default batchSize=20%), where a batch is a subset of nodes (and node pools) to roll. A batch roll is considered successful when a minimum percentage of nodes has been replaced by healthy nodes, defined by batchMinHealthyPercentage, set to 50% by default. Then the roll takes on the next batch until all batches are successfully rolled.

For each batch, Ocean attempts to launch new nodes in new node pools to match the workload requirements (CPU and memory requests) of the existing batch of nodes being rolled. New node pools may be of the same or different VM sizes depending on updated VNG configuration as well as current availability and cost of markets.

While the nodes in new node pools are starting up, the existing node pools in the batch are locked for scaling and nodes are cordoned, marked as unschedulable (taint node.kubernetes.io/unschedulable). Ocean autoscaler can continue to scale up, if needed, using other node pools, including the newly created node pools when they are ready.

Once the new nodes (and node pools) are ready, Ocean will safely drain nodes and gracefully evict pods. Evicted pods are immediately rescheduled to new nodes in the new node pools.

Ocean provides sufficient time (honoring the terminationGracePeriodSeconds setting in its PodSpec), for evicted pods to wrap up tasks and terminate gracefully. Ocean also provides a cluster level using customizable DrainTimeout=300s to handle misbehaving pods. When the workload controller restarts the pods, they are scheduled right away on the new nodes that are waiting for them. This ensures minimum down time for mission critical workloads in production. Ocean verifies that newly launched nodes are healthy and pods are running properly.

Ocean marks the Roll as COMPLETE when all batches have executed, all existing nodes have been processed, and most of them (batchMinHealthyPercentage =50%) have been replaced by healthy nodes and pods running on new nodes. Then finally, existing node pools are scaled down (nodes deleted) and removed.

During the Roll, Ocean will respect pod constraints including Spot Label spotinst.io/restrict-scale-down, pod disruption budget (PDB), and pod topology spread constraints. Ocean will not evict a pod if the PDB is violated and respectPDB=True, node will not be replaced, node status will be set to NOT_REPLACED_DUE_TO_PDB. Ocean will continue to proceed with other nodes.

If some nodes or node pools could not be replaced due to PDB or other reasons, you can fix the issue or choose to ignore PDB (set respectPDB=False) and rerun or schedule the roll for failed nodes.

In the UI console ‘Rolls’ Tab or using the Roll API, you can track the status of each Roll, batch or node being rolled.

Ocean for AKS: Realize the power of the roll

Ocean AKS not only manages your cloud infrastructure but also removes the pain out of Day 2 operations.

Now that you have seen the power of the new roll feature for Ocean AKS, it is time to give it a roll! Reach out to your Spot contact person or start working on your AKS today in the Spot Console.

Request a demo with a solution architect.

In the next blog, we will deep dive into how Ocean can make AKS Kubernetes upgrades a fully automated painless experience.