Kubernetes HPA: The Basics and a Quick Tutorial

What Is Kubernetes HPA (Horizontal Pod Autoscaler)? 

Kubernetes Horizontal Pod Autoscaler (HPA) is a feature in Kubernetes that automatically adjusts the number of pods in a deployment or replica set based on observed CPU utilization or other selected metrics. It is designed to handle the dynamic nature of user traffic by scaling applications up and down to meet demand without manual intervention. 

HPA is useful for applications that experience variations in usage, ensuring efficient use of resources while maintaining performance. HPA functionality is built into Kubernetes and leverages the metrics gathered by components like the metrics server to make decisions. 

By setting specific thresholds for resource use, HPA ensures that pods scale automatically when thresholds are exceeded. Its goal is to maintain stability and optimal performance, balancing between under-utilization and cost-effectiveness with over-utilization and potential degradation of service.

This is part of a series of articles about Kubernetes architecture.

In this article:

How Does HPA Work? 

Kubernetes Horizontal Pod Autoscaler (HPA) operates through a control loop mechanism that automatically scales the number of pods in a deployment based on specific metrics, primarily CPU utilization, although it can also use custom or other predefined metrics. 

Here’s a breakdown of the process:

  1. Metric collection: HPA periodically collects metrics from the Kubernetes metrics server or a custom metrics API. These metrics reflect each pod’s current resource usage compared to its specified target.
  2. Evaluation: The collected data is evaluated against the scaling policies defined by the user. These policies include thresholds for scaling up or down the number of pods. For example, you might set HPA to increase the number of pods when CPU usage exceeds 70% of the pod’s resource request.
  3. Scaling decision: Based on this evaluation, HPA calculates the required number of pods. If the current number of pods does not align with this calculation, HPA adjusts the count. The scaling decision is made such that the average load per pod falls within the target specified in the HPA configuration.
  4. Actuation: If a scaling action is necessary, HPA modifies the deployment or replica set to reflect the new desired count of pods. This is done by sending requests to the Kubernetes API to scale the number of replicas.
  5. Stabilization: After scaling, there is a stabilization period during which HPA avoids further scaling actions. This period helps in mitigating the frequent scale-in and scale-out that could happen due to fluctuating metrics, providing a more stable operational environment.
  6. Feedback loop: The process is cyclical; metrics continue to be monitored and evaluated at regular intervals defined in the HPA configuration (typically every 15 seconds), ensuring the system dynamically adapts to changing workloads.

This feedback loop enables Kubernetes HPA to dynamically and automatically adjust the number of pods, ensuring that applications maintain optimal performance and resource utilization according to real-time demands.

Related content: Read our guide to Kubernetes pod

Kubernetes HPA Benefits 

HPA offers several important benefits:

  • Automatic scaling: Useful for apps with fluctuating demands, it brings agility in resource management, allowing resources to be allocated dynamically based on actual usage rather than static, potentially inaccurate predictions. This ensures that applications can handle unexpected spikes in traffic without suffering downtime or performance hiccups.
  • Resource efficiency: By adjusting the number of pods based on current demands, HPA reduces waste associated with over-provisioning and the risks tied to under-provisioning. This translates into cost savings and promotes eco-friendly practices by reducing the environmental impact of running unnecessary computing resources.
  • Improved performance: HPA ensures different instances are available to handle incoming requests efficiently. It prevents scenarios where insufficient resources cause slow responses or crashes due to overload. By automatically scaling resources, HPA maintains a consistent experience for end-users.

Kubernetes HPA Limitations 

However, HPA also has some limitations that users should be aware of:

  • Metric latency: There’s often a delay between data collection and action, which can lead to over or under-scaling. This delay occurs because metrics processing and retrieval are not instantaneous — they often involve complex calculations and data transfers across different system components.
  • Lack of awareness of external factors: HPA operates based on the metrics it can monitor, which typically include CPU and memory usage. However, it lacks direct awareness of external factors such as third-party APIs or concurrent tasks that might impact application performance.
  • Cannot be used together with Vertical Pod Autoscaler (VPA): These tools address scaling in conflicting ways. While HPA changes the number of instances, VPA adjusts resources available to individual pods. When both are enabled, they can interfere with each other, resulting in unstable pod behavior and resource usage.
  • Not suitable for all applications: HPA isn’t useful applications with constant usage or predictable demand. Applications that cannot be easily partitioned into multiple instances due to statefulness or other architectural designs might not benefit from horizontal scaling.

Tutorial: Create HPA in Kubernetes 

This tutorial is adapted from the Kubernetes documentation

Run and Expose the php-apacheServer

To set up a demonstration of Kubernetes’ Horizontal Pod Autoscaler (HPA), deploy one of the example applications shipped with Kubernetes, the php-apache server. You will use a Kubernetes manifest file to create a deployment that runs a container from the hpa-example image, and a service to expose it. Here is the command to apply the Kubernetes manifest:

kubectl apply -f https://k8s.io/examples/application/php-apache.yaml

This command sets up a deployment named php-apache with specifications to match labels and manage resource limits and requests for CPU usage. The service component of the manifest ensures that the deployment is accessible via a network on port 80.

Create a Horizontal Pod Autoscaler

Once the application is up and running, you create a HorizontalPodAutoscaler to manage the scaling of this deployment. The HPA is designed to maintain between 4 and 12 pod replicas, aiming for an average CPU utilization of 60%. Use the following command to create the autoscaler:

kubectl autoscale deployment php-apache --cpu-percent=60 --min=4 --max=12

After creation, you can verify the status of the HPA with this command:

kubectl get hpa

This checks the current status of the autoscaler, displaying details like the target CPU utilization and the number of replicas currently running.

Increase the Autoscaler’s Load

To test the HPA’s response to increased load, start a separate pod that acts as a client sending requests to the php-apache service. This simulates a higher load by continually querying the service:

kubectl run -i --tty load-generator --rm --image=busybox:1.28 --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://php-apache; done"

Monitor the HPA’s reaction to the increased load by watching the number of replicas:

kubectl get hpa php-apache --watch

You should observe the CPU usage increase and, correspondingly, an increase in the number of replicas.

Stop Generating the Load

To conclude the demonstration, stop the load generation by exiting the load-generating command with Ctrl+C. Afterward, observe the autoscaler’s adjustment by continuing to watch the HPA:

kubectl get hpa php-apache --watch

Once the load ceases, CPU utilization should decrease, prompting the HPA to scale down the number of replicas back to the minimum specified. Verify the final state of the deployment to ensure it reflects the reduced demand:

kubectl get deployment php-apache

The deployment should show that the number of replicas has adjusted back to the baseline as the CPU usage normalized.

Automating Kubernetes Infrastructure with Spot by NetApp

Spot Ocean from Spot by NetApp frees DevOps teams from the tedious management of their cluster’s worker nodes while helping reduce cost by up to 90%. Spot Ocean’s automated optimization delivers the following benefits:

  • Container-driven autoscaling for the fastest matching of pods with appropriate nodes
  • Easy management of workloads with different resource requirements in a single cluster
  • Intelligent bin-packing for highly utilized nodes and greater cost-efficiency
  • Cost allocation by namespaces, resources, annotation and labels
  • Reliable usage of the optimal blend of spot, reserved and on-demand compute pricing models
  • Automated infrastructure headroom ensuring high availability
  • Right-sizing based on actual pod resource consumption  

Learn more about Spot Ocean today!