Skip to main content

Kubernetes Autoscaling

Automatically scale your application workloads based on demand. Codiac integrates with Kubernetes Horizontal Pod Autoscaler (HPA) to dynamically adjust replica counts based on CPU, memory, or custom metrics.

Web UI Configuration

Autoscaling is configured through the Codiac web UI at app.codiac.io. Navigate to your asset and configure scaling parameters in the visual interface.

What Is Autoscaling?

Autoscaling automatically adjusts the number of running pods (replicas) for an asset based on observed metrics. When load increases, Kubernetes adds more replicas. When load decreases, it scales back down.

Business Value:

  • Cost optimization: Scale down during low traffic, reduce cloud spending by 40-60%
  • Performance: Scale up automatically during demand spikes, maintain responsiveness
  • Zero manual intervention: No need to manually adjust replica counts
  • Resilience: Handle unexpected traffic bursts without capacity planning

How Autoscaling Works

Low Traffic (10% CPU)          High Traffic (80% CPU)
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Pod 1 │ Scales Up → │ Pod 1 │ │ Pod 2 │ │ Pod 3 │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
1 replica 3 replicas

Load Decreases Load Increases Again
↓ ↑
Scales Down ← [Cooldown Period] → Scales Up

Key Concepts:

  • Min Replicas: Minimum number of pods (never scale below this)
  • Max Replicas: Maximum number of pods (never scale above this)
  • Target Metric: CPU/memory utilization percentage that triggers scaling
  • Cooldown Period: Wait time between scaling actions to prevent thrashing

Supported Metrics

Metric TypeWhat It MeasuresWhen to Use
CPUAverage CPU utilization across all podsMost common, works for CPU-bound apps (APIs, web servers)
MemoryAverage memory usage across all podsMemory-intensive apps (caching, data processing)
Custom MetricsApplication-specific metrics (requests/sec, queue depth)Advanced scenarios requiring domain-specific scaling

Configuring Autoscaling in Web UI

Manage autoscaling through Codiac's visual interface with real-time scaling metrics.

Step 1: Navigate to Asset

  1. Open Codiac web UI at https://app.codiac.io
  2. Select your Enterprise
  3. Select the Environment
  4. Navigate to the Cabinet
  5. Click on your Asset

Step 2: Configure Autoscaler

  1. Click Scaling or Configure tab
  2. Toggle Enable Autoscaling to ON
  3. Configure scaling parameters:

Min Replicas:

  • Minimum number of pods to maintain
  • Recommended: At least 2 for high availability
  • Never scales below this number

Max Replicas:

  • Maximum number of pods allowed
  • Set based on expected peak load
  • Prevents runaway scaling costs

Target CPU Utilization (%):

  • CPU percentage that triggers scaling
  • Recommended: 60-80% for APIs, 80-90% for batch jobs
  • Leave blank to disable CPU-based scaling

Target Memory Utilization (%):

  • Memory percentage that triggers scaling
  • Recommended: 70-85%
  • Leave blank to disable memory-based scaling

Step 3: Save and Deploy

  1. Click Save to persist autoscaler configuration
  2. Click Deploy to apply changes
  3. Monitor deployment progress

Expected Outcome:

  • HPA created in Kubernetes
  • Asset scales based on configured metrics
  • Real-time scaling visible in UI dashboard

Example Configurations

High-Traffic API

SettingValueRationale
Min Replicas3High availability across 3 AZs
Max Replicas50Handle 10x normal traffic
CPU Target60%Keep latency low

Background Worker

SettingValueRationale
Min Replicas1Cost optimization
Max Replicas10Limit concurrent jobs
CPU Target85%Maximize resource usage

Caching Layer

SettingValueRationale
Min Replicas2High availability
Max Replicas10Memory-bound
Memory Target75%Scale on memory pressure

Monitoring Autoscaler Activity

Scaling Metrics Dashboard

  1. Navigate to asset in web UI
  2. View Scaling Metrics section
  3. See real-time data:
    • Current replica count
    • Current CPU/memory utilization
    • Target thresholds
    • Scaling history (last 24 hours)

Metrics Displayed:

  • Current Replicas: Number of running pods
  • CPU Usage: Average across all pods (e.g., "68% / 70% target")
  • Memory Usage: Average across all pods
  • Last Scaled: Timestamp and action (scaled up/down, from X to Y replicas)

Scaling Event History

View historical scaling events:

  1. Click History or Events tab
  2. Filter by "Scaling Events"
  3. See timeline of scale-up/scale-down actions

Example Events:

  • 14:32 - Scaled up from 3 to 5 replicas (CPU: 82%)
  • 15:45 - Scaled down from 5 to 4 replicas (CPU: 58%)

Editing Autoscaler Configuration

  1. Navigate to asset Scaling tab
  2. Modify min/max replicas or target percentages
  3. Click Save and Deploy
  4. Changes take effect within 1-2 minutes

Disabling Autoscaling

  1. Navigate to asset Scaling tab
  2. Toggle Enable Autoscaling to OFF
  3. Set Fixed Replicas count
  4. Click Save and Deploy

Result:

  • Autoscaler removed
  • Asset runs with fixed replica count
  • No automatic scaling

Autoscaling Best Practices

1. Always Set Resource Requests

Autoscaling requires CPU/memory requests to be defined on containers:

resources:
requests:
cpu: 200m # Required for CPU-based autoscaling
memory: 256Mi # Required for memory-based autoscaling
limits:
cpu: 500m
memory: 512Mi

Why: HPA calculates utilization as (actual usage / requested resources). Without requests, HPA doesn't know what "70% CPU" means.

2. Start Conservative, Then Optimize

Initial Configuration:

  • Min: 2-3 replicas (for HA)
  • Max: 2-3x your average expected load
  • CPU Target: 70%

Monitor for 1-2 weeks, then adjust:

  • Too many scale-ups? Lower target CPU
  • Too many scale-downs? Raise target CPU
  • Hitting max frequently? Increase max replicas

3. Set Appropriate Min/Max Bounds

Min Replicas:

  • Production: Minimum 2 for zero-downtime deployments
  • Staging/Dev: Can be 1 to save costs
  • HA-Critical Services: Minimum 3 across availability zones

Max Replicas:

  • Calculate: (Peak Traffic / Per-Pod Capacity) × 1.5 (50% buffer)
  • Consider cost limits (max replicas × cost per pod)
  • Set alerts when approaching max

4. Use CPU for Most Use Cases

CPU-based scaling works best for:

  • Web APIs and microservices
  • Request-driven workloads
  • Most containerized applications

Memory-based scaling better for:

  • Caching layers (Redis, Memcached)
  • In-memory databases
  • Data processing pipelines

5. Avoid Autoscaling Stateful Workloads

Don't autoscale:

  • Databases (use read replicas instead)
  • Stateful sets with persistent volumes
  • Singleton services (queue consumers with global state)

Use autoscaling for:

  • Stateless APIs
  • Background workers (if idempotent)
  • Read-only services

6. Test Scaling Behavior

Before production, verify autoscaling works:

Load Test:

# Generate artificial load
kubectl run -i --tty load-generator --rm --image=busybox --restart=Never -- /bin/sh

# Inside the pod, generate CPU load
while true; do wget -q -O- http://my-api-service; done

Watch Scaling:

kubectl get hpa -w -n prod

Expected: Replicas increase within 1-3 minutes as CPU rises.

7. Monitor Costs

Autoscaling can increase cloud costs if not monitored:

  • Set max replica limits
  • Use Zombie Mode for non-prod environments
  • Track cost per replica and set budgets
  • Alert on sustained high replica counts

Common Autoscaling Patterns

Pattern 1: User-Facing Web API

Min: 3 replicas (HA across 3 AZs)
Max: 50 replicas (handles 10x normal traffic)
CPU Target: 60% (low latency)

Pattern 2: Background Job Processor

Min: 1 replica (cost optimization)
Max: 20 replicas (limit concurrent jobs)
CPU Target: 85% (maximize resource usage)

Pattern 3: Caching Layer

Min: 2 replicas (HA)
Max: 10 replicas (memory-bound)
Memory Target: 75%

Pattern 4: Event-Driven Microservice

Min: 2 replicas (HA)
Max: 100 replicas (burst capacity)
CPU Target: 70%
Custom Metric: Messages in queue

Advanced: Custom Metrics

For autoscaling based on application-specific metrics (e.g., queue depth, requests/second), Kubernetes supports custom metrics via adapters.

Common Custom Metrics:

  • RabbitMQ queue depth
  • HTTP requests per second
  • Active connections
  • Custom business metrics

Setup: Requires installing metrics adapter (Prometheus, Datadog, etc.) and configuring HPA with custom metric source.

Documentation: Kubernetes HPA Walkthrough


FAQ

Q: How quickly does autoscaling react to load changes?

A: HPA checks metrics every 15 seconds by default. Scale-up happens within 1-3 minutes. Scale-down is more conservative (5 minutes) to avoid thrashing.

Q: Can I schedule scaling (e.g., scale up during business hours)?

A: HPA is reactive, not scheduled. For predictive scaling, use CronJobs to adjust min/max replicas based on time of day, or use Kubernetes Vertical Pod Autoscaler (VPA) for right-sizing.

Q: What's the difference between HPA and VPA?

A: HPA (Horizontal Pod Autoscaler) adds/removes pods. VPA (Vertical Pod Autoscaler) adjusts CPU/memory limits per pod. HPA is more common.

Q: Does autoscaling work during deployments?

A: Yes, HPA continues to operate during rolling updates. New pods are added to the replica count as they become ready.

Q: Can I autoscale to zero replicas?

A: No, min replicas must be at least 1. For zero-scaling, use Knative or KEDA (event-driven autoscaling).



Need help optimizing autoscaling? Contact Support or check our performance tuning guide.