Kubernetes Autoscaling
Kubernetes autoscaling with HPA, VPA, Cluster Autoscaler, and event-driven scaling with KEDA
You are an expert in Kubernetes autoscaling for containerized application development and deployment. ## Key Points - **Off**: Only produces recommendations (safe for initial analysis). - **Initial**: Sets requests only at pod creation. - **Auto**: Evicts and recreates pods to apply new requests. - name: workers - type: prometheus - Always set `minReplicas` to at least 2 in production to maintain availability during scaling events and node failures. - Configure `scaleDown.stabilizationWindowSeconds` on HPA to prevent rapid scale-down oscillation (flapping) during variable traffic. - Use KEDA for workloads driven by queue depth or external events rather than forcing CPU-based HPA onto inherently event-driven systems. - Running HPA without setting resource requests on pods; HPA cannot compute utilization percentages without a baseline request value. - Using VPA in Auto mode alongside HPA on CPU, which creates a feedback loop where VPA adjusts requests and HPA reacts to the changed utilization ratio. ## Quick Example ```bash kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml # Verify kubectl top pods -n production ``` ```bash # Install KEDA helm repo add kedacore https://kedacore.github.io/charts helm install keda kedacore/keda --namespace keda --create-namespace ```
skilldb get containerization-skills/Kubernetes AutoscalingFull skill: 276 linesKubernetes Autoscaling — Containerization
You are an expert in Kubernetes autoscaling for containerized application development and deployment.
Overview
Kubernetes autoscaling adjusts compute resources to match demand. Horizontal Pod Autoscaler (HPA) changes replica count, Vertical Pod Autoscaler (VPA) adjusts CPU/memory requests, Cluster Autoscaler adds or removes nodes, and KEDA enables event-driven scaling from external sources like message queues and databases.
Core Concepts
Horizontal Pod Autoscaler (HPA)
HPA scales the number of pod replicas based on observed metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 4
periodSeconds: 60
selectPolicy: Max
HPA requires the Metrics Server to be installed in the cluster:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Verify
kubectl top pods -n production
Vertical Pod Autoscaler (VPA)
VPA recommends or automatically adjusts CPU and memory requests:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
updatePolicy:
updateMode: "Auto" # Off, Initial, Recreate, or Auto
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: "4"
memory: 4Gi
controlledResources:
- cpu
- memory
Modes:
- Off: Only produces recommendations (safe for initial analysis).
- Initial: Sets requests only at pod creation.
- Auto: Evicts and recreates pods to apply new requests.
Cluster Autoscaler
Cluster Autoscaler adjusts the number of nodes when pods are unschedulable due to insufficient resources:
# Example: AWS EKS managed node group with autoscaling
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: my-cluster
region: us-east-1
managedNodeGroups:
- name: workers
instanceType: m6i.xlarge
minSize: 2
maxSize: 10
desiredCapacity: 3
labels:
role: worker
tags:
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/my-cluster: "owned"
KEDA (Kubernetes Event-Driven Autoscaling)
KEDA scales workloads based on external event sources:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
namespace: production
spec:
scaleTargetRef:
name: order-processor
minReplicaCount: 0 # Scale to zero when idle
maxReplicaCount: 50
pollingInterval: 15
cooldownPeriod: 120
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
queueLength: "5" # Target messages per replica
awsRegion: us-east-1
authenticationRef:
name: aws-credentials
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: aws-credentials
namespace: production
spec:
secretTargetRef:
- parameter: awsAccessKeyID
name: aws-secret
key: access-key-id
- parameter: awsSecretAccessKey
name: aws-secret
key: secret-access-key
KEDA with Prometheus metrics:
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: http_requests_per_second
query: sum(rate(http_requests_total{service="api"}[2m]))
threshold: "100"
Implementation Patterns
Combining HPA and VPA
HPA and VPA can conflict when both try to adjust the same resource. The recommended approach:
# VPA in recommendation-only mode
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
updatePolicy:
updateMode: "Off" # Only recommendations, no auto-update
Use VPA recommendations to periodically tune the resource requests in your deployment manifest, then let HPA handle scaling replica count.
Custom Metrics with HPA
Scale on application-specific metrics using the Prometheus adapter:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 30
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
Scale-to-Zero with KEDA
KEDA can scale deployments to zero replicas during idle periods and quickly spin up when events arrive, which is ideal for batch processors and async workers:
# Install KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace
Best Practices
- Always set
minReplicasto at least 2 in production to maintain availability during scaling events and node failures. - Configure
scaleDown.stabilizationWindowSecondson HPA to prevent rapid scale-down oscillation (flapping) during variable traffic. - Use KEDA for workloads driven by queue depth or external events rather than forcing CPU-based HPA onto inherently event-driven systems.
Core Philosophy
Autoscaling is about matching capacity to demand automatically, but it is not about removing human judgment from capacity planning. Every autoscaler needs well-chosen parameters: minimum and maximum bounds, target thresholds, and stabilization windows. These parameters encode your understanding of the application's behavior, cost constraints, and availability requirements. Setting them requires profiling your application under realistic load, not guessing.
Horizontal scaling (adding more replicas) should be the default strategy for stateless workloads. It is safer than vertical scaling because adding a replica does not disrupt existing ones, while VPA's auto mode evicts and recreates pods to apply new resource requests. HPA scales in seconds, is well-understood, and works reliably with the built-in Metrics Server. Vertical scaling is most useful in recommendation-only mode, where it informs your static resource requests rather than continuously adjusting them.
Scale-down behavior is more important than scale-up behavior. Scaling up quickly is straightforward: detect demand, add capacity. Scaling down too aggressively, however, causes thrashing: pods are terminated, traffic spikes again, new pods are created, and the cycle repeats. The stabilizationWindowSeconds and scale-down policies exist specifically to prevent this oscillation. Set conservative scale-down windows (5-10 minutes for production) and aggressive scale-up policies to handle traffic bursts without whiplash.
Anti-Patterns
-
Setting HPA without resource requests. HPA calculates utilization as a percentage of the requested resources. Without resource requests, there is no baseline, and HPA cannot compute a utilization ratio. It will either fail to scale or behave unpredictably.
-
Running VPA in Auto mode alongside HPA on the same metric. VPA auto mode adjusts resource requests, which changes the denominator in HPA's utilization calculation, which triggers HPA to scale, which changes load per pod, which triggers VPA again. This feedback loop causes constant pod churn. Use VPA in Off or recommendation mode alongside HPA.
-
Setting
minReplicas: 1in production. A single replica means any pod disruption (node failure, deployment rollout, eviction) causes a service outage. Always run at least 2 replicas in production for availability during scaling events and maintenance. -
Using CPU-based HPA for queue-driven workloads. A worker that processes messages from a queue may have low CPU utilization even when the queue is deeply backed up. CPU-based HPA will not scale it. Use KEDA with queue-depth triggers for event-driven workloads.
-
Ignoring scale-down stabilization. Without a stabilization window, HPA reacts instantly to every dip in load, terminating pods that may be needed moments later when the next request burst arrives. The default 5-minute stabilization window exists for a reason; shortening it should be a deliberate, tested decision.
Common Pitfalls
- Running HPA without setting resource requests on pods; HPA cannot compute utilization percentages without a baseline request value.
- Using VPA in Auto mode alongside HPA on CPU, which creates a feedback loop where VPA adjusts requests and HPA reacts to the changed utilization ratio.
Install this skill directly: skilldb add containerization-skills
Related Skills
Container Registries
Container registry setup, authentication, and image management for ECR, GCR, GHCR, and Docker Hub
Container Security
Container image scanning, runtime hardening, and security best practices for production workloads
Docker Compose
Docker Compose configuration for multi-service development, testing, and local orchestration
Docker Networking
Docker networking modes, custom networks, DNS resolution, and multi-host connectivity patterns
Dockerfile Optimization
Multi-stage builds, layer caching, and image size optimization for production Docker images
Helm Charts
Helm chart creation, templating, dependency management, and release lifecycle for Kubernetes