Appearance
Scaling β
Goals β
- Keep latency within SLO while maintaining cost efficiency.
- Default to horizontal scale (more replicas) and only scale vertically (bigger vCPU/RAM) when necessary.
- Use data (response times, CPU/memory, instance counts, and cost) to tune decisions.
Default Policy β
Scalers: Two complementary ACA scalers run in parallel β HTTP concurrency and CPU utilization.
- HTTP concurrency target:
50concurrent requests per replica. - CPU utilization target:
80%β fires earlier than HTTP concurrency for CPU-bound sync bursts where per-replica CPU saturates before concurrent request count climbs. - Either scaler can independently trigger a scale-out; both must drop below their thresholds before scale-in begins.
- HTTP concurrency target:
Replica bounds:
minReplicas:1for public APIs (avoid cold start).maxReplicas: set per observed peak traffic (public API:10).
Scale-down cooldown:
600 secondsfor the public API. This prevents premature scale-in during intra-burst traffic dips (brief drops in the middle of an active sync workload).Preference:
- Scale horizontally first for cost efficiency. Adding replicas only incurs cost while they run under load.
- Scale vertically later if a single request needs more headroom; vertical sizing increases βfixedβ cost because the larger container size is paid for throughout its active lifetime.
Configuration (ACA) β
Example (Bicep) β
The containerApp.bicep module exposes the following scaling parameters:
bicep
param minReplicas int = 0
param maxReplicas int = 1
param httpConcurrentRequests int = 0 // 0 = scaler disabled
param cpuUtilizationThreshold int = 0 // 0 = scaler disabled; percentage (e.g. 70)
param scaleDownCooldownSeconds int = 300Public API configuration (from main.bicep):
bicep
cpu: valueString(environment, { prd: '1', default: '0.25' })
memory: valueString(environment, { prd: '2Gi', default: '0.5Gi' })
minReplicas: valueInt(environment, { prd: 1, default: 0 })
maxReplicas: valueInt(environment, { prd: 10, acc: 2, default: 1 })
httpConcurrentRequests: 50
cpuUtilizationThreshold: valueInt(environment, { prd: 80, acc: 90, default: 0 })
scaleDownCooldownSeconds: 600Notes
- Start with
cpu: 0.5, memory: 1Gifor new services. Right-size after profiling (see "When to Scale Vertically").- If a service is truly bursty and can tolerate cold starts or is infrequently called,
minReplicas: 0is acceptable.- Omitting
httpConcurrentRequestsorcpuUtilizationThreshold(or passing0) disables that scaler β useful for non-HTTP workloads or services where CPU scaling is not meaningful.- Raise
scaleDownCooldownSecondswhen traffic is bursty with brief dips mid-burst (see "Scale-Down Cooldown" below).
How We Tune concurrentRequests (20 β β¦) β
- Collect a baseline for 3β7 days under representative traffic.
- If CPU < 50% and p95 well below SLO, consider raising concurrency (e.g., 20 β 30).
- If CPU β₯ 75% or p95 approaches SLO, lower concurrency (e.g., 20 β 15) or add replicas by raising
maxReplicas. - Re-assess until we hit the "sweet spot" (steady p95 and 60β75% CPU during normal load).
The public API was tuned from the initial default of
40down to20based on observed data: sync workloads are CPU-heavy per request, meaning a single replica saturates CPU well before HTTP concurrency climbs to 40. The CPU scaler at 70% covers burst detection for those patterns.
Scale-Down Cooldown β
Scale-in is automatic β no rules required. KEDA polls all active scalers and begins counting down the cooldown period once all scalers drop below their thresholds.
| Parameter | Default | Public API |
|---|---|---|
scaleDownCooldownSeconds | 300 | 600 |
The public API cooldown is set to 600 seconds because sync workloads produce intra-burst traffic dips β brief quiet windows in the middle of an active sync β that would otherwise trigger premature scale-in. With a 10-minute cooldown, replicas stay warm through the full burst and are released within ~10 minutes after the workload ends.
Decision Tree: Scale Out vs. Optimize vs. Scale Up β
Is p95 latency above SLO?
- No β Do nothing. Keep observing.
- Yes β Go to 2.
Is per-replica CPU β₯ 75% or memory β₯ 80%?
- Yes β Try horizontal scale first (increase
maxReplicasor decrease target concurrency slightly). - No β Go to 3.
- Yes β Try horizontal scale first (increase
Where is time spent (from traces)?
- Mostly external (DB, HTTP downstream) β Optimize queries, add caching, reduce payloads; scaling wonβt help much.
- Mostly app CPU / GC β Consider vertical scale (more vCPU/RAM) or reduce allocations/compute; then re-test.
Are we frequently at
maxReplicaswith p95 ~ SLO and costs rising?- Yes β Prioritize optimization (DB indexes, caching, batching) before adding more replicas.
When We Scale Horizontally (Default) β
- Workload is I/O-bound or embarrassingly parallel.
- Increasing replicas reduces queueing and improves p95.
- Cost is proportional to actual load (replicas scale down when idle).
Actions
- Increase
maxReplicaswith the 60β75% CPU target in mind. - Keep
concurrentRequestsnear the latency sweet spot. - Watch DB/redis/queue limits as you add replicas.
When We Scale Vertically (Exception, Not Default) β
Scale vertically only when many requests are CPU/memory intensive and a single replica is the bottleneck:
- p95/p99 dominated by CPU work inside the service even with low concurrency.
- High GC% time, LOH pressure, or frequent OOM.
- Thread pool starvation at modest concurrency.
Actions
- Move the app to a larger vCPU/RAM size (workload profile).
- Keep horizontal scaling enabled; vertical β disable autoscale.
- Re-check cost: bigger instances increase the βfixedβ cost floor because we pay the higher rate for the entire time the container runs.
FAQ β
Why two scalers instead of one? HTTP concurrency alone misses CPU-bound bursts. When sync workloads arrive (CPU-heavy per request), a replica can hit 100% CPU while concurrent request count stays low. The CPU scaler at 70% catches this before requests start queuing.
Why 50 concurrent requests? 50 is the tuned value for the public API after observing that CPU saturates at modest concurrency for sync-heavy traffic. For I/O-bound services with fast response times and low CPU.
Why horizontal before vertical? Horizontal scaling matches cost to demand; vertical sizing raises the baseline cost as long as the container is running.
What's the signal to go vertical? When p95/p99 are dominated by in-process CPU/memory work and a single request needs more headroom despite modest concurrency.
Why 600s cooldown for the public API? Sync workloads produce brief traffic dips mid-burst. A 300s default would start scaling in replicas that are needed again 60β90 seconds later. The 10-minute window covers the observed burst durations and avoids oscillation.