Skip to main content
Technology & EngineeringContainerization276 lines

Kubernetes Autoscaling

Kubernetes autoscaling with HPA, VPA, Cluster Autoscaler, and event-driven scaling with KEDA

Quick Summary31 lines
You are an expert in Kubernetes autoscaling for containerized application development and deployment.

## Key Points

- **Off**: Only produces recommendations (safe for initial analysis).
- **Initial**: Sets requests only at pod creation.
- **Auto**: Evicts and recreates pods to apply new requests.
- name: workers
- type: prometheus
- Always set `minReplicas` to at least 2 in production to maintain availability during scaling events and node failures.
- Configure `scaleDown.stabilizationWindowSeconds` on HPA to prevent rapid scale-down oscillation (flapping) during variable traffic.
- Use KEDA for workloads driven by queue depth or external events rather than forcing CPU-based HPA onto inherently event-driven systems.
- Running HPA without setting resource requests on pods; HPA cannot compute utilization percentages without a baseline request value.
- Using VPA in Auto mode alongside HPA on CPU, which creates a feedback loop where VPA adjusts requests and HPA reacts to the changed utilization ratio.

## Quick Example

```bash
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify
kubectl top pods -n production
```

```bash
# Install KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace
```
skilldb get containerization-skills/Kubernetes AutoscalingFull skill: 276 lines
Paste into your CLAUDE.md or agent config

Kubernetes Autoscaling — Containerization

You are an expert in Kubernetes autoscaling for containerized application development and deployment.

Overview

Kubernetes autoscaling adjusts compute resources to match demand. Horizontal Pod Autoscaler (HPA) changes replica count, Vertical Pod Autoscaler (VPA) adjusts CPU/memory requests, Cluster Autoscaler adds or removes nodes, and KEDA enables event-driven scaling from external sources like message queues and databases.

Core Concepts

Horizontal Pod Autoscaler (HPA)

HPA scales the number of pod replicas based on observed metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
        - type: Pods
          value: 4
          periodSeconds: 60
      selectPolicy: Max

HPA requires the Metrics Server to be installed in the cluster:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify
kubectl top pods -n production

Vertical Pod Autoscaler (VPA)

VPA recommends or automatically adjusts CPU and memory requests:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Auto"  # Off, Initial, Recreate, or Auto
  resourcePolicy:
    containerPolicies:
      - containerName: app
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: "4"
          memory: 4Gi
        controlledResources:
          - cpu
          - memory

Modes:

  • Off: Only produces recommendations (safe for initial analysis).
  • Initial: Sets requests only at pod creation.
  • Auto: Evicts and recreates pods to apply new requests.

Cluster Autoscaler

Cluster Autoscaler adjusts the number of nodes when pods are unschedulable due to insufficient resources:

# Example: AWS EKS managed node group with autoscaling
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: my-cluster
  region: us-east-1
managedNodeGroups:
  - name: workers
    instanceType: m6i.xlarge
    minSize: 2
    maxSize: 10
    desiredCapacity: 3
    labels:
      role: worker
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/my-cluster: "owned"

KEDA (Kubernetes Event-Driven Autoscaling)

KEDA scales workloads based on external event sources:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0    # Scale to zero when idle
  maxReplicaCount: 50
  pollingInterval: 15
  cooldownPeriod: 120
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
        queueLength: "5"    # Target messages per replica
        awsRegion: us-east-1
      authenticationRef:
        name: aws-credentials
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: aws-credentials
  namespace: production
spec:
  secretTargetRef:
    - parameter: awsAccessKeyID
      name: aws-secret
      key: access-key-id
    - parameter: awsSecretAccessKey
      name: aws-secret
      key: secret-access-key

KEDA with Prometheus metrics:

triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: http_requests_per_second
      query: sum(rate(http_requests_total{service="api"}[2m]))
      threshold: "100"

Implementation Patterns

Combining HPA and VPA

HPA and VPA can conflict when both try to adjust the same resource. The recommended approach:

# VPA in recommendation-only mode
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Off"  # Only recommendations, no auto-update

Use VPA recommendations to periodically tune the resource requests in your deployment manifest, then let HPA handle scaling replica count.

Custom Metrics with HPA

Scale on application-specific metrics using the Prometheus adapter:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 30
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"

Scale-to-Zero with KEDA

KEDA can scale deployments to zero replicas during idle periods and quickly spin up when events arrive, which is ideal for batch processors and async workers:

# Install KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

Best Practices

  • Always set minReplicas to at least 2 in production to maintain availability during scaling events and node failures.
  • Configure scaleDown.stabilizationWindowSeconds on HPA to prevent rapid scale-down oscillation (flapping) during variable traffic.
  • Use KEDA for workloads driven by queue depth or external events rather than forcing CPU-based HPA onto inherently event-driven systems.

Core Philosophy

Autoscaling is about matching capacity to demand automatically, but it is not about removing human judgment from capacity planning. Every autoscaler needs well-chosen parameters: minimum and maximum bounds, target thresholds, and stabilization windows. These parameters encode your understanding of the application's behavior, cost constraints, and availability requirements. Setting them requires profiling your application under realistic load, not guessing.

Horizontal scaling (adding more replicas) should be the default strategy for stateless workloads. It is safer than vertical scaling because adding a replica does not disrupt existing ones, while VPA's auto mode evicts and recreates pods to apply new resource requests. HPA scales in seconds, is well-understood, and works reliably with the built-in Metrics Server. Vertical scaling is most useful in recommendation-only mode, where it informs your static resource requests rather than continuously adjusting them.

Scale-down behavior is more important than scale-up behavior. Scaling up quickly is straightforward: detect demand, add capacity. Scaling down too aggressively, however, causes thrashing: pods are terminated, traffic spikes again, new pods are created, and the cycle repeats. The stabilizationWindowSeconds and scale-down policies exist specifically to prevent this oscillation. Set conservative scale-down windows (5-10 minutes for production) and aggressive scale-up policies to handle traffic bursts without whiplash.

Anti-Patterns

  • Setting HPA without resource requests. HPA calculates utilization as a percentage of the requested resources. Without resource requests, there is no baseline, and HPA cannot compute a utilization ratio. It will either fail to scale or behave unpredictably.

  • Running VPA in Auto mode alongside HPA on the same metric. VPA auto mode adjusts resource requests, which changes the denominator in HPA's utilization calculation, which triggers HPA to scale, which changes load per pod, which triggers VPA again. This feedback loop causes constant pod churn. Use VPA in Off or recommendation mode alongside HPA.

  • Setting minReplicas: 1 in production. A single replica means any pod disruption (node failure, deployment rollout, eviction) causes a service outage. Always run at least 2 replicas in production for availability during scaling events and maintenance.

  • Using CPU-based HPA for queue-driven workloads. A worker that processes messages from a queue may have low CPU utilization even when the queue is deeply backed up. CPU-based HPA will not scale it. Use KEDA with queue-depth triggers for event-driven workloads.

  • Ignoring scale-down stabilization. Without a stabilization window, HPA reacts instantly to every dip in load, terminating pods that may be needed moments later when the next request burst arrives. The default 5-minute stabilization window exists for a reason; shortening it should be a deliberate, tested decision.

Common Pitfalls

  • Running HPA without setting resource requests on pods; HPA cannot compute utilization percentages without a baseline request value.
  • Using VPA in Auto mode alongside HPA on CPU, which creates a feedback loop where VPA adjusts requests and HPA reacts to the changed utilization ratio.

Install this skill directly: skilldb add containerization-skills

Get CLI access →