-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Description
Description
Cluster Autoscaler fails to scale down underutilized nodes when pods have both topology spread constraints (maxSkew=2) and PodDisruptionBudgets, even when the PDB shows status.disruptionsAllowed > 0. This causes severe cluster fragmentation and wasted resources.
CAS logs show:
not enough pod disruption budget to move <pod-name>
However, kubectl get pdb shows the PDB has available disruptions:
status:
disruptionsAllowed: 2 # > 0, should allow pod movement
currentHealthy: 10
desiredHealthy: 8Environment
- Kubernetes Version: 1.33
- Cluster Autoscaler Version: [from AKS, exact version TBD]
- Cloud Provider: Azure AKS
- Affected Cluster: Production cluster with significant fragmentation
- Impact: Severe resource waste with nodes not being scaled down despite being underutilized
Workload Configuration
Deployment with Topology Spread Constraints:
spec:
replicas: 20
template:
spec:
topologySpreadConstraints:
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-appPodDisruptionBudget:
spec:
maxUnavailable: 20%
selector:
matchLabels:
app: my-appExpected Behavior
CAS should scale down underutilized nodes when:
- PDB shows
status.disruptionsAllowed > 0 - Pods can be rescheduled to other nodes while satisfying topology spread constraints
- The node removal would not violate the PDB
Actual Behavior
CAS refuses to scale down underutilized nodes, logging "not enough pod disruption budget" even though:
- PDB status shows
disruptionsAllowed > 0 - All pods are healthy
- Nodes remain underutilized indefinitely
- This is a chronic issue that can persist for extended periods in production
Root Cause Analysis
After analyzing the CAS source code, the bug is a timing and state mismatch between PDB drainability checks and topology spread constraint evaluation.
Buggy Code Flow
1. PDB Drainability Check (Too Early)
File: cluster-autoscaler/simulator/drainability/rules/pdb/rule.go:44-45
func (Rule) Drainable(drainCtx *drainability.DrainContext, pod *apiv1.Pod, _ *framework.NodeInfo) drainability.Status {
for _, pdb := range drainCtx.RemainingPdbTracker.MatchingPdbs(pod) {
if pdb.Status.DisruptionsAllowed < 1 { // ← BUG: Uses PDB status from current cluster state
return drainability.NewBlockedStatus(drain.NotEnoughPdb, fmt.Errorf("not enough pod disruption budget to move %s/%s", pod.Namespace, pod.Name))
}
}
return drainability.NewUndefinedStatus()
}2. Node Removal Simulation Flow
File: cluster-autoscaler/simulator/cluster.go:144-199
func (r *RemovalSimulator) SimulateNodeRemoval(...) {
// STEP 1: Check PDB drainability - node is STILL in cluster state
podsToRemove, daemonSetPods, blockingPod, err := GetPodsToMove(nodeInfo, ...)
if err != nil {
// ← Fails here because PDB.Status.DisruptionsAllowed reflects current topology spread
return nil, &UnremovableNode{Node: nodeInfo.Node(), Reason: BlockedByPod, BlockingPod: blockingPod}
}
// STEP 2: Remove node from snapshot and try rescheduling - NEVER REACHED
err = r.withForkedSnapshot(func() error {
return r.findPlaceFor(nodeName, podsToRemove, destinationMap, timestamp)
})
}3. Where Node Removal Happens (Correctly, but Too Late)
File: cluster-autoscaler/simulator/cluster.go:191-199
func (r *RemovalSimulator) findPlaceFor(removedNode string, pods []*apiv1.Pod, ...) error {
// Unschedule pods
for _, pod := range pods {
r.clusterSnapshot.UnschedulePod(pod.Namespace, pod.Name, removedNode)
}
// Remove the node from the snapshot, so it doesn't interfere with topology spread
r.clusterSnapshot.RemoveNodeInfo(removedNode) // ← Correct, but PDB check already failed!
// Try to reschedule pods - this WOULD succeed, but we never get here
statuses, _, err := r.schedulingSimulator.TrySchedulePods(r.clusterSnapshot, newpods, ...)
}The Problem
- Kubernetes PDB Controller calculates
Status.DisruptionsAllowedbased on the current cluster state, including current topology spread distribution - When pods have tight topology spread constraints (maxSkew=2), the PDB controller may set
DisruptionsAllowed=0because moving pods in the current state would violate topology spread - CAS checks PDB status BEFORE simulating node removal:
- Sees
DisruptionsAllowed < 1from Kubernetes API - Immediately fails the drainability check
- Never reaches the code that removes the node from the snapshot
- Never attempts pod rescheduling, which would succeed without the node
- Sees
- The Paradox: PDB says "can't move pods because topology spread would be violated", but this is only true WITH the node present. If the node were removed, topology spread would be satisfied.
Why This Differs from Related Issues
- Issue Cluster Autoscaler Does Not Respect TopologySpread maxSkew=1 on Scale Down #6984 (maxSkew=1): Focused on skew calculation including the removed node
- Issue Cluster Autoscaler does not scale down with Topology Spread (despite max skew being satisfied when scaled down) #8093: Fixed scale-down blocking, but didn't address PDB interaction
- Issue Cluster Autoscaler binpacking poorly when hard pod topology spread constraints are used #8161: Binpacking creates too many nodes, but doesn't explain why scale-down fails
- This Issue: The PDB drainability check uses stale PDB status from Kubernetes API that reflects topology spread constraints in the current state, not the simulated state after node removal
Reproduction Steps
Test Environment Setup
- Cloud: Azure AKS (or any Kubernetes cluster)
- Node pool: 1 system nodepool, autoscaling min=3, max=100
- Node SKU: Any size (e.g., Standard_D32s_v3 with 32 cores)
Test Deployment
Step 1: Filler Pods to Scale Up Cluster
apiVersion: v1
kind: Namespace
metadata:
name: test-cas-bug
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: filler-pods
namespace: test-cas-bug
spec:
replicas: 20
selector:
matchLabels:
app: filler
template:
metadata:
labels:
app: filler
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- filler
topologyKey: kubernetes.io/hostname
containers:
- name: pause
image: registry.k8s.io/pause:3.9
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 100m
memory: 128MiStep 2: Actual Workload with Topology Spread + PDB
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-topology-spread
namespace: test-cas-bug
spec:
replicas: 20
selector:
matchLabels:
app: test-topology-spread
template:
metadata:
labels:
app: test-topology-spread
spec:
topologySpreadConstraints:
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: test-topology-spread
containers:
- name: nginx
image: nginx:alpine
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 100m
memory: 128Mi
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: test-topology-spread-pdb
namespace: test-cas-bug
spec:
maxUnavailable: 20%
selector:
matchLabels:
app: test-topology-spreadSteps to Reproduce
1. Deploy Filler Pods to Scale Up Node Pool
kubectl create namespace test-cas-bug
kubectl apply -f filler-deployment.yamlExpected: Due to anti-affinity, CAS will create 20 nodes (one filler pod per node). Wait until all filler pods are running.
kubectl get pods -n test-cas-bug -l app=filler
kubectl get nodes # Should show 20 nodes2. Deploy Workload with Topology Spread + PDB
kubectl apply -f topology-spread-deployment.yamlExpected: The 20 replicas will be scheduled across the 20 nodes with maxSkew=2 constraint. With maxSkew=2, pods will be distributed relatively evenly (e.g., 1-2 pods per node). Wait until all pods are running.
kubectl get pods -n test-cas-bug -l app=test-topology-spread -o wide
kubectl get pdb -n test-cas-bug # Should show disruptionsAllowed > 03. Remove Filler Pods to Trigger Scale-Down
kubectl delete deployment filler-pods -n test-cas-bug
# Wait 10-15 minutes for CAS to attempt scale-down
kubectl get nodesExpected: After removing filler pods (20 × 100m CPU = 2 cores total), the nodes are now underutilized. The 20 test-topology-spread pods (20 × 100m CPU = 2 cores total) should consolidate to far fewer nodes (e.g., 2-4 nodes depending on node size). CAS should scale down the underutilized nodes within 10 minutes.
Actual (Bug): Most or all of the 20 nodes remain for 10+ minutes despite being severely underutilized (only 100m CPU per node from test-topology-spread pods).
4. Check CAS Logs
kubectl logs -n kube-system deployment/cluster-autoscaler | grep -A 10 "test-topology-spread" | grep -i "pdb\|disruption\|drainable"Expected Log Output (Bug Manifestation):
not enough pod disruption budget to move test-cas-bug/test-topology-spread-xxx
5. Verify PDB Status
kubectl get pdb -n test-cas-bug test-topology-spread-pdb -o yamlExpected Output:
status:
currentHealthy: 20
desiredHealthy: 16 # 20 replicas × 80% (maxUnavailable 20%)
disruptionsAllowed: 4 # > 0, SHOULD allow disruptions
expectedPods: 20Bug Confirmation: status.disruptionsAllowed > 0 BUT CAS still blocks scale-down with "not enough pod disruption budget"
6. Validation Tests
Test A: Remove PDB (Proves PDB interaction)
kubectl delete pdb test-topology-spread-pdb -n test-cas-bug
# Wait 10 minutes
kubectl get nodesExpected: CAS scales down empty nodes
Test B: Remove Topology Spread (Proves topology constraint interaction)
# Modify deployment to remove topologySpreadConstraints
kubectl apply -f test-deployment-no-topology.yamlExpected: CAS scales down empty nodes
Test C: Increase maxUnavailable (May or may not help)
kubectl patch pdb test-topology-spread-pdb -n test-cas-bug \
--type=merge -p '{"spec":{"maxUnavailable":"40%"}}'Expected: May or may not help depending on PDB status calculation
Success Criteria - Bug is Reproduced If:
- ✅ After step 1, CAS creates 20 nodes (one per filler pod due to anti-affinity)
- ✅ After step 2, topology-spread pods are scheduled across the 20 nodes
- ✅ After step 3 (removing filler pods), most nodes remain for 10+ minutes despite being severely underutilized (only 100m CPU per node)
- ✅ CAS logs show PDB blocking scale-down: "not enough pod disruption budget to move"
- ✅
kubectl get pdbshowsstatus.disruptionsAllowed > 0(should be 4 with 20 replicas and maxUnavailable 20%) - ✅ Removing PDB OR topology spread constraints allows scale-down within 10 minutes
Production Impact
This bug has been observed in production environments causing:
- Severe cluster fragmentation with nodes running well below optimal capacity
- Significant cost waste due to unnecessary node provisioning
- Chronic issue persisting for extended periods (60+ days observed)
- Low resource efficiency with application workloads using only a small fraction of allocatable capacity
This is not a recent regression - affected clusters maintain excessive node counts for extended periods.
Proposed Fix
The PDB drainability check needs to be re-architected to either:
- Recalculate
DisruptionsAllowedafter simulating node removal from the cluster snapshot, OR - Defer PDB check until after attempting pod rescheduling in
findPlaceFor(), OR - Simulate topology spread state without the target node when checking PDB
The challenge is that pdb.Status.DisruptionsAllowed comes from the Kubernetes API (reflecting current cluster state), not from the simulated cluster snapshot (reflecting post-removal state).
Potential Fix Location
Modify cluster-autoscaler/simulator/cluster.go:SimulateNodeRemoval() to:
- Fork the cluster snapshot earlier
- Remove the node from the snapshot BEFORE the PDB drainability check
- Check if pods can be rescheduled
- THEN check PDB based on the simulated state
OR
Modify cluster-autoscaler/simulator/drainability/rules/pdb/rule.go to:
- Not fail immediately on
DisruptionsAllowed < 1 - Return a "needs verification" status
- Allow the removal simulator to proceed with rescheduling attempt
- Verify PDB after successful rescheduling simulation
Related Issues
- #6984 - CAS Does Not Respect TopologySpread maxSkew=1 (Closed as "not planned")
- #8093 - CAS does not scale down with Topology Spread (Fixed in PR fix: Cluster Autoscaler not scaling down nodes where Pods with hard topology spread constraints are scheduled #8164, but doesn't address PDB interaction)
- #8161 - CAS binpacking poorly with hard topology spread constraints (Open, but focuses on scale-up, not scale-down)
This issue is distinct because it identifies the specific root cause: PDB drainability checks use stale PDB status from Kubernetes API that doesn't account for topology spread changes after node removal.
Additional Context
This affects all AKS customers using the combination of:
- PodDisruptionBudgets (common for production workloads)
- Topology spread constraints (increasingly common for HA)
- Cluster Autoscaler (standard for cost optimization)
The workarounds mentioned in related issues (increase maxSkew, remove PDB, manual drain) are not viable for production environments requiring high availability.
Environment Details to Collect:
- CAS version:
kubectl get deployment cluster-autoscaler -n kube-system -o yaml | grep image: - Kubernetes version:
kubectl version --short - Full CAS logs during reproduction
- PDB status snapshots over time