Skip to content

Cluster Autoscaler incorrectly blocks scale-down due to PDB+TopologySpread interaction #9111

@jinglinliang

Description

@jinglinliang

Description

Cluster Autoscaler fails to scale down underutilized nodes when pods have both topology spread constraints (maxSkew=2) and PodDisruptionBudgets, even when the PDB shows status.disruptionsAllowed > 0. This causes severe cluster fragmentation and wasted resources.

CAS logs show:

not enough pod disruption budget to move <pod-name>

However, kubectl get pdb shows the PDB has available disruptions:

status:
  disruptionsAllowed: 2  # > 0, should allow pod movement
  currentHealthy: 10
  desiredHealthy: 8

Environment

  • Kubernetes Version: 1.33
  • Cluster Autoscaler Version: [from AKS, exact version TBD]
  • Cloud Provider: Azure AKS
  • Affected Cluster: Production cluster with significant fragmentation
  • Impact: Severe resource waste with nodes not being scaled down despite being underutilized

Workload Configuration

Deployment with Topology Spread Constraints:

spec:
  replicas: 20
  template:
    spec:
      topologySpreadConstraints:
      - maxSkew: 2
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: my-app

PodDisruptionBudget:

spec:
  maxUnavailable: 20%
  selector:
    matchLabels:
      app: my-app

Expected Behavior

CAS should scale down underutilized nodes when:

  1. PDB shows status.disruptionsAllowed > 0
  2. Pods can be rescheduled to other nodes while satisfying topology spread constraints
  3. The node removal would not violate the PDB

Actual Behavior

CAS refuses to scale down underutilized nodes, logging "not enough pod disruption budget" even though:

  • PDB status shows disruptionsAllowed > 0
  • All pods are healthy
  • Nodes remain underutilized indefinitely
  • This is a chronic issue that can persist for extended periods in production

Root Cause Analysis

After analyzing the CAS source code, the bug is a timing and state mismatch between PDB drainability checks and topology spread constraint evaluation.

Buggy Code Flow

1. PDB Drainability Check (Too Early)

File: cluster-autoscaler/simulator/drainability/rules/pdb/rule.go:44-45

func (Rule) Drainable(drainCtx *drainability.DrainContext, pod *apiv1.Pod, _ *framework.NodeInfo) drainability.Status {
    for _, pdb := range drainCtx.RemainingPdbTracker.MatchingPdbs(pod) {
        if pdb.Status.DisruptionsAllowed < 1 {  // ← BUG: Uses PDB status from current cluster state
            return drainability.NewBlockedStatus(drain.NotEnoughPdb, fmt.Errorf("not enough pod disruption budget to move %s/%s", pod.Namespace, pod.Name))
        }
    }
    return drainability.NewUndefinedStatus()
}

2. Node Removal Simulation Flow

File: cluster-autoscaler/simulator/cluster.go:144-199

func (r *RemovalSimulator) SimulateNodeRemoval(...) {
    // STEP 1: Check PDB drainability - node is STILL in cluster state
    podsToRemove, daemonSetPods, blockingPod, err := GetPodsToMove(nodeInfo, ...)
    if err != nil {
        // ← Fails here because PDB.Status.DisruptionsAllowed reflects current topology spread
        return nil, &UnremovableNode{Node: nodeInfo.Node(), Reason: BlockedByPod, BlockingPod: blockingPod}
    }

    // STEP 2: Remove node from snapshot and try rescheduling - NEVER REACHED
    err = r.withForkedSnapshot(func() error {
        return r.findPlaceFor(nodeName, podsToRemove, destinationMap, timestamp)
    })
}

3. Where Node Removal Happens (Correctly, but Too Late)

File: cluster-autoscaler/simulator/cluster.go:191-199

func (r *RemovalSimulator) findPlaceFor(removedNode string, pods []*apiv1.Pod, ...) error {
    // Unschedule pods
    for _, pod := range pods {
        r.clusterSnapshot.UnschedulePod(pod.Namespace, pod.Name, removedNode)
    }
    // Remove the node from the snapshot, so it doesn't interfere with topology spread
    r.clusterSnapshot.RemoveNodeInfo(removedNode)  // ← Correct, but PDB check already failed!

    // Try to reschedule pods - this WOULD succeed, but we never get here
    statuses, _, err := r.schedulingSimulator.TrySchedulePods(r.clusterSnapshot, newpods, ...)
}

The Problem

  1. Kubernetes PDB Controller calculates Status.DisruptionsAllowed based on the current cluster state, including current topology spread distribution
  2. When pods have tight topology spread constraints (maxSkew=2), the PDB controller may set DisruptionsAllowed=0 because moving pods in the current state would violate topology spread
  3. CAS checks PDB status BEFORE simulating node removal:
    • Sees DisruptionsAllowed < 1 from Kubernetes API
    • Immediately fails the drainability check
    • Never reaches the code that removes the node from the snapshot
    • Never attempts pod rescheduling, which would succeed without the node
  4. The Paradox: PDB says "can't move pods because topology spread would be violated", but this is only true WITH the node present. If the node were removed, topology spread would be satisfied.

Why This Differs from Related Issues

Reproduction Steps

Test Environment Setup

  • Cloud: Azure AKS (or any Kubernetes cluster)
  • Node pool: 1 system nodepool, autoscaling min=3, max=100
  • Node SKU: Any size (e.g., Standard_D32s_v3 with 32 cores)

Test Deployment

Step 1: Filler Pods to Scale Up Cluster

apiVersion: v1
kind: Namespace
metadata:
  name: test-cas-bug
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: filler-pods
  namespace: test-cas-bug
spec:
  replicas: 20
  selector:
    matchLabels:
      app: filler
  template:
    metadata:
      labels:
        app: filler
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - filler
            topologyKey: kubernetes.io/hostname
      containers:
      - name: pause
        image: registry.k8s.io/pause:3.9
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 100m
            memory: 128Mi

Step 2: Actual Workload with Topology Spread + PDB

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-topology-spread
  namespace: test-cas-bug
spec:
  replicas: 20
  selector:
    matchLabels:
      app: test-topology-spread
  template:
    metadata:
      labels:
        app: test-topology-spread
    spec:
      topologySpreadConstraints:
      - maxSkew: 2
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: test-topology-spread
      containers:
      - name: nginx
        image: nginx:alpine
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 100m
            memory: 128Mi
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: test-topology-spread-pdb
  namespace: test-cas-bug
spec:
  maxUnavailable: 20%
  selector:
    matchLabels:
      app: test-topology-spread

Steps to Reproduce

1. Deploy Filler Pods to Scale Up Node Pool

kubectl create namespace test-cas-bug
kubectl apply -f filler-deployment.yaml

Expected: Due to anti-affinity, CAS will create 20 nodes (one filler pod per node). Wait until all filler pods are running.

kubectl get pods -n test-cas-bug -l app=filler
kubectl get nodes  # Should show 20 nodes

2. Deploy Workload with Topology Spread + PDB

kubectl apply -f topology-spread-deployment.yaml

Expected: The 20 replicas will be scheduled across the 20 nodes with maxSkew=2 constraint. With maxSkew=2, pods will be distributed relatively evenly (e.g., 1-2 pods per node). Wait until all pods are running.

kubectl get pods -n test-cas-bug -l app=test-topology-spread -o wide
kubectl get pdb -n test-cas-bug  # Should show disruptionsAllowed > 0

3. Remove Filler Pods to Trigger Scale-Down

kubectl delete deployment filler-pods -n test-cas-bug
# Wait 10-15 minutes for CAS to attempt scale-down
kubectl get nodes

Expected: After removing filler pods (20 × 100m CPU = 2 cores total), the nodes are now underutilized. The 20 test-topology-spread pods (20 × 100m CPU = 2 cores total) should consolidate to far fewer nodes (e.g., 2-4 nodes depending on node size). CAS should scale down the underutilized nodes within 10 minutes.

Actual (Bug): Most or all of the 20 nodes remain for 10+ minutes despite being severely underutilized (only 100m CPU per node from test-topology-spread pods).

4. Check CAS Logs

kubectl logs -n kube-system deployment/cluster-autoscaler | grep -A 10 "test-topology-spread" | grep -i "pdb\|disruption\|drainable"

Expected Log Output (Bug Manifestation):

not enough pod disruption budget to move test-cas-bug/test-topology-spread-xxx

5. Verify PDB Status

kubectl get pdb -n test-cas-bug test-topology-spread-pdb -o yaml

Expected Output:

status:
  currentHealthy: 20
  desiredHealthy: 16  # 20 replicas × 80% (maxUnavailable 20%)
  disruptionsAllowed: 4  # > 0, SHOULD allow disruptions
  expectedPods: 20

Bug Confirmation: status.disruptionsAllowed > 0 BUT CAS still blocks scale-down with "not enough pod disruption budget"

6. Validation Tests

Test A: Remove PDB (Proves PDB interaction)

kubectl delete pdb test-topology-spread-pdb -n test-cas-bug
# Wait 10 minutes
kubectl get nodes

Expected: CAS scales down empty nodes

Test B: Remove Topology Spread (Proves topology constraint interaction)

# Modify deployment to remove topologySpreadConstraints
kubectl apply -f test-deployment-no-topology.yaml

Expected: CAS scales down empty nodes

Test C: Increase maxUnavailable (May or may not help)

kubectl patch pdb test-topology-spread-pdb -n test-cas-bug \
  --type=merge -p '{"spec":{"maxUnavailable":"40%"}}'

Expected: May or may not help depending on PDB status calculation

Success Criteria - Bug is Reproduced If:

  1. ✅ After step 1, CAS creates 20 nodes (one per filler pod due to anti-affinity)
  2. ✅ After step 2, topology-spread pods are scheduled across the 20 nodes
  3. ✅ After step 3 (removing filler pods), most nodes remain for 10+ minutes despite being severely underutilized (only 100m CPU per node)
  4. ✅ CAS logs show PDB blocking scale-down: "not enough pod disruption budget to move"
  5. kubectl get pdb shows status.disruptionsAllowed > 0 (should be 4 with 20 replicas and maxUnavailable 20%)
  6. ✅ Removing PDB OR topology spread constraints allows scale-down within 10 minutes

Production Impact

This bug has been observed in production environments causing:

  • Severe cluster fragmentation with nodes running well below optimal capacity
  • Significant cost waste due to unnecessary node provisioning
  • Chronic issue persisting for extended periods (60+ days observed)
  • Low resource efficiency with application workloads using only a small fraction of allocatable capacity

This is not a recent regression - affected clusters maintain excessive node counts for extended periods.

Proposed Fix

The PDB drainability check needs to be re-architected to either:

  1. Recalculate DisruptionsAllowed after simulating node removal from the cluster snapshot, OR
  2. Defer PDB check until after attempting pod rescheduling in findPlaceFor(), OR
  3. Simulate topology spread state without the target node when checking PDB

The challenge is that pdb.Status.DisruptionsAllowed comes from the Kubernetes API (reflecting current cluster state), not from the simulated cluster snapshot (reflecting post-removal state).

Potential Fix Location

Modify cluster-autoscaler/simulator/cluster.go:SimulateNodeRemoval() to:

  1. Fork the cluster snapshot earlier
  2. Remove the node from the snapshot BEFORE the PDB drainability check
  3. Check if pods can be rescheduled
  4. THEN check PDB based on the simulated state

OR

Modify cluster-autoscaler/simulator/drainability/rules/pdb/rule.go to:

  1. Not fail immediately on DisruptionsAllowed < 1
  2. Return a "needs verification" status
  3. Allow the removal simulator to proceed with rescheduling attempt
  4. Verify PDB after successful rescheduling simulation

Related Issues

This issue is distinct because it identifies the specific root cause: PDB drainability checks use stale PDB status from Kubernetes API that doesn't account for topology spread changes after node removal.

Additional Context

This affects all AKS customers using the combination of:

  • PodDisruptionBudgets (common for production workloads)
  • Topology spread constraints (increasingly common for HA)
  • Cluster Autoscaler (standard for cost optimization)

The workarounds mentioned in related issues (increase maxSkew, remove PDB, manual drain) are not viable for production environments requiring high availability.


Environment Details to Collect:

  • CAS version: kubectl get deployment cluster-autoscaler -n kube-system -o yaml | grep image:
  • Kubernetes version: kubectl version --short
  • Full CAS logs during reproduction
  • PDB status snapshots over time

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions