Kubernetes Autoscaling: A Deep Dive into HPA, VPA, and KEDA

Kubernetes Autoscaling: A Deep Dive into HPA, VPA, and KEDA

Sarah Chen

Sarah Chen

Head of Engineering

September 5, 2024 16 min read

Why Autoscaling Is Critical for Cloud-Native Applications

Autoscaling is one of the most powerful capabilities that Kubernetes provides, yet it is also one of the most commonly misconfigured. The promise of autoscaling is simple: automatically adjust the resources allocated to your applications based on actual demand, ensuring optimal performance during traffic spikes while minimizing costs during quiet periods. In practice, achieving this balance requires a deep understanding of the different autoscaling mechanisms available in Kubernetes, their strengths and limitations, and how to configure them correctly for your specific workloads.

At Primates, our Kubernetes clusters serve highly variable workloads. Our event ingestion endpoints handle baseline traffic of approximately fifty thousand requests per second during off-peak hours, which can spike to over three hundred thousand requests per second during peak periods triggered by customer batch processing jobs and marketing campaign launches. Our analytics query engine has different scaling characteristics—it needs to scale based on CPU and memory consumption rather than request rate. And our event processing pipeline, built on Flink, needs to scale based on Kafka consumer lag rather than traditional resource metrics. Each of these workloads requires a different autoscaling strategy, and getting the configuration right is essential for maintaining both performance and cost efficiency.

In this article, I will provide a comprehensive guide to the three primary autoscaling mechanisms in the Kubernetes ecosystem: the Horizontal Pod Autoscaler, the Vertical Pod Autoscaler, and KEDA for event-driven autoscaling. For each mechanism, I will explain how it works, when to use it, how to configure it correctly, and share real-world examples from our production environment. By the end of this article, you should have a clear understanding of which autoscaling approach is right for each of your workloads and how to implement it effectively.

Horizontal Pod Autoscaler (HPA)

The Horizontal Pod Autoscaler is the most commonly used autoscaling mechanism in Kubernetes. It works by automatically adjusting the number of pod replicas in a deployment, replica set, or stateful set based on observed metrics. When the target metric exceeds the configured threshold, HPA scales out by adding more pod replicas. When the metric drops below the threshold, HPA scales in by removing replicas. This horizontal scaling approach is well-suited for stateless workloads that can distribute load across multiple instances.

HPA supports scaling based on CPU utilization, memory utilization, and custom metrics exposed through the Kubernetes metrics API. The most straightforward configuration scales based on CPU utilization, which works well for compute-bound workloads. Here is an example HPA configuration that we use for our API gateway service:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 5
  maxReplicas: 50
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
        - type: Pods
          value: 5
          periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 120
      selectPolicy: Min
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"

Several aspects of this configuration deserve attention. First, we set the target CPU utilization to sixty-five percent rather than a higher value like eighty or ninety percent. This headroom is intentional—it provides a buffer that allows the service to handle short-lived traffic bursts while HPA is still in the process of scaling out new replicas. If the target is set too high, there is not enough headroom to absorb traffic during the scaling delay, which can lead to degraded performance or errors during rapid traffic increases.

Second, we configure asymmetric scaling behavior: fast scale-up and slow scale-down. The scale-up policy allows adding up to fifty percent more replicas per minute, enabling rapid response to traffic spikes. The scale-down policy is much more conservative, removing only ten percent of replicas per two-minute period with a five-minute stabilization window. This asymmetry prevents the "flapping" problem where HPA rapidly scales up and down in response to oscillating metrics, which wastes resources and can cause service instability.

Common HPA Pitfalls

Despite its apparent simplicity, HPA has several pitfalls that catch even experienced Kubernetes operators:

  • Missing resource requests: HPA requires that pods have resource requests defined in their container spec. Without resource requests, HPA cannot calculate utilization percentages and will not function. Always define both CPU and memory requests for pods that will be managed by HPA.
  • Metrics server delays: the Kubernetes metrics server collects resource metrics every fifteen seconds by default, and HPA evaluates metrics every fifteen seconds. This means there is a minimum thirty-second delay between a load change and the first scaling decision. For workloads with very spiky traffic, this delay can be significant.
  • Pod startup time: HPA adds replicas, but those replicas are not useful until they are fully started and ready to serve traffic. If your pods take sixty seconds to start, then the effective scaling delay is the HPA evaluation delay plus the pod startup time. Optimizing pod startup time is critical for effective autoscaling.

"Autoscaling is not a set-and-forget configuration. It requires ongoing tuning based on actual traffic patterns, application performance characteristics, and cost optimization goals. Treat your autoscaling configuration as code that evolves with your application." — Kelsey Hightower

Vertical Pod Autoscaler (VPA)

While HPA adjusts the number of pod replicas, the Vertical Pod Autoscaler adjusts the CPU and memory requests and limits for individual pods. VPA monitors the actual resource usage of pods over time and automatically recommends or applies updated resource requests that better match the workload's actual needs. This is particularly valuable for workloads where the correct resource allocation is difficult to determine in advance, such as Java applications with variable heap requirements or data processing jobs with unpredictable memory consumption patterns.

VPA operates in three modes: "Off" mode where it only generates recommendations without applying them, "Initial" mode where it sets resource requests only when pods are created, and "Auto" mode where it can evict and recreate pods with updated resource requests. We use VPA in "Off" mode for production workloads that are also managed by HPA—since VPA and HPA can conflict when both try to manage CPU resources simultaneously—and in "Auto" mode for batch processing workloads and development environments where pod recreation is acceptable.

The combination of HPA and VPA requires careful consideration. As a general rule, use HPA for scaling based on CPU utilization and use VPA only for memory right-sizing in the same deployment. Alternatively, if using both, configure HPA to scale based on custom metrics rather than CPU utilization to avoid conflicts with VPA's CPU recommendations. The following table summarizes when to use each autoscaler:

ScenarioRecommended AutoscalerScaling DimensionKey Consideration
Stateless web servicesHPAReplica countScale on requests per second or CPU
Memory-intensive batch jobsVPAResource requestsAuto mode acceptable for batch
Stateful databasesVPA (Off mode)Recommendations onlyManual review before applying
Event-driven processorsKEDAReplica countScale on queue depth or lag
Mixed workloadsHPA + VPABothAvoid CPU metric conflicts

Event-Driven Autoscaling with KEDA

KEDA, the Kubernetes Event-Driven Autoscaler, extends Kubernetes autoscaling to support scaling based on events from external sources such as message queues, databases, and custom metrics endpoints. While HPA can scale based on custom metrics through the metrics adapter API, KEDA provides a much simpler and more flexible way to define scaling triggers and supports a rich ecosystem of pre-built scalers for popular event sources including Kafka, RabbitMQ, Azure Service Bus, AWS SQS, Redis, PostgreSQL, and many more.

We use KEDA extensively for our event processing pipeline, where the appropriate number of processing instances depends on the volume of unprocessed events in our Kafka topics rather than traditional resource utilization metrics. KEDA monitors the consumer lag for our Kafka consumer groups and scales our Flink processing instances accordingly—adding instances when lag grows and removing them when processing catches up. This event-driven approach ensures that our processing capacity matches the actual event volume, rather than maintaining a fixed number of instances that may be over-provisioned during quiet periods or under-provisioned during peak periods.

KEDA can also scale deployments to zero replicas when there are no events to process, which is particularly valuable for cost optimization of workloads with intermittent traffic patterns. Scaling to zero is not possible with native HPA, which requires a minimum of one replica. For our internal batch processing jobs that run several times per day, KEDA's scale-to-zero capability reduces compute costs by approximately seventy percent compared to maintaining a minimum of one always-on replica.

  1. Start with HPA for stateless services, using CPU utilization as the primary metric.
  2. Configure asymmetric scaling behavior: fast scale-up, slow scale-down.
  3. Set resource requests accurately—VPA in recommendation mode can help determine appropriate values.
  4. Use KEDA for event-driven workloads that scale based on queue depth or consumer lag.
  5. Monitor autoscaling behavior continuously and tune configurations based on observed patterns.
  6. Test autoscaling behavior under load before relying on it in production.

Performance Benchmarks

To help you set realistic expectations, here are performance benchmarks from our production autoscaling configurations. Our API gateway, managed by HPA with a target CPU utilization of sixty-five percent, scales from five to thirty-five replicas during peak traffic, with an average scaling response time of forty-five seconds from traffic increase to new pods serving requests. Our event processing pipeline, managed by KEDA with a Kafka lag trigger, scales from two to twenty replicas based on consumer lag, with an average scaling response time of sixty seconds. Our batch processing jobs scale from zero to ten replicas within ninety seconds of events appearing in their input queues.

These benchmarks highlight an important reality: autoscaling is not instantaneous. There is always a delay between a load increase and the availability of additional capacity, and this delay must be accounted for in your application architecture. Strategies for mitigating scaling delay include maintaining a modest minimum replica count that can handle baseline load, implementing request queuing or buffering at the ingestion layer, and using predictive scaling policies that anticipate traffic increases based on historical patterns. The right combination of these strategies depends on your application's latency requirements, cost constraints, and traffic patterns.

Kubernetes autoscaling is a powerful capability that, when configured correctly, can dramatically improve both the performance and cost efficiency of your cloud-native applications. The key is to understand the strengths and limitations of each autoscaling mechanism, choose the right approach for each workload, and invest in ongoing monitoring and tuning to ensure that your autoscaling configurations remain optimal as your applications and traffic patterns evolve.

Sarah Chen

About the Author

Sarah Chen

Head of Engineering

Sarah Chen is the Head of Engineering at Primates, where she leads the platform infrastructure and distributed systems teams. With over fifteen years of experience building large-scale systems at companies including Google and Stripe, Sarah specializes in designing fault-tolerant architectures that handle billions of requests daily. She holds a Ph.D. in Computer Science from MIT and is a frequent speaker at distributed systems conferences worldwide.

Comments (3)

Alex Thompson
Alex Thompson March 12, 2026

This is an excellent deep dive! The architecture diagrams really helped me understand the overall flow. We have been considering a similar approach at our company and this gives us a great starting point.

Jennifer Walsh
Jennifer Walsh March 14, 2026

Great article. I especially appreciated the section on error handling and fault tolerance. One question: have you considered using an event sourcing pattern for the audit trail instead of the approach described here?

Ryan Patel
Ryan Patel March 16, 2026

We implemented something very similar last quarter after reading your previous post. The performance improvements were even better than expected. Looking forward to more content like this!

Leave a Comment