Cluster Autoscaler incorrectly blocks scale-down due to PDB+TopologySpread interaction

## Description

Cluster Autoscaler fails to scale down underutilized nodes when pods have both **topology spread constraints** (maxSkew=2) and **PodDisruptionBudgets**, even when the PDB shows `status.disruptionsAllowed > 0`. This causes severe cluster fragmentation and wasted resources.

CAS logs show:
```
not enough pod disruption budget to move <pod-name>
```

However, `kubectl get pdb` shows the PDB has available disruptions:
```yaml
status:
  disruptionsAllowed: 2  # > 0, should allow pod movement
  currentHealthy: 10
  desiredHealthy: 8
```

## Environment

- **Kubernetes Version**: 1.33
- **Cluster Autoscaler Version**: [from AKS, exact version TBD]
- **Cloud Provider**: Azure AKS
- **Affected Cluster**: Production cluster with significant fragmentation
- **Impact**: Severe resource waste with nodes not being scaled down despite being underutilized

## Workload Configuration

**Deployment with Topology Spread Constraints**:
```yaml
spec:
  replicas: 20
  template:
    spec:
      topologySpreadConstraints:
      - maxSkew: 2
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: my-app
```

**PodDisruptionBudget**:
```yaml
spec:
  maxUnavailable: 20%
  selector:
    matchLabels:
      app: my-app
```

## Expected Behavior

CAS should scale down underutilized nodes when:
1. PDB shows `status.disruptionsAllowed > 0`
2. Pods can be rescheduled to other nodes while satisfying topology spread constraints
3. The node removal would not violate the PDB

## Actual Behavior

CAS refuses to scale down underutilized nodes, logging "not enough pod disruption budget" even though:
- PDB status shows `disruptionsAllowed > 0`
- All pods are healthy
- Nodes remain underutilized indefinitely
- This is a **chronic issue** that can persist for extended periods in production

## Root Cause Analysis

After analyzing the CAS source code, the bug is a **timing and state mismatch** between PDB drainability checks and topology spread constraint evaluation.

### Buggy Code Flow

**1. PDB Drainability Check (Too Early)**

File: `cluster-autoscaler/simulator/drainability/rules/pdb/rule.go:44-45`

```go
func (Rule) Drainable(drainCtx *drainability.DrainContext, pod *apiv1.Pod, _ *framework.NodeInfo) drainability.Status {
    for _, pdb := range drainCtx.RemainingPdbTracker.MatchingPdbs(pod) {
        if pdb.Status.DisruptionsAllowed < 1 {  // ← BUG: Uses PDB status from current cluster state
            return drainability.NewBlockedStatus(drain.NotEnoughPdb, fmt.Errorf("not enough pod disruption budget to move %s/%s", pod.Namespace, pod.Name))
        }
    }
    return drainability.NewUndefinedStatus()
}
```

**2. Node Removal Simulation Flow**

File: `cluster-autoscaler/simulator/cluster.go:144-199`

```go
func (r *RemovalSimulator) SimulateNodeRemoval(...) {
    // STEP 1: Check PDB drainability - node is STILL in cluster state
    podsToRemove, daemonSetPods, blockingPod, err := GetPodsToMove(nodeInfo, ...)
    if err != nil {
        // ← Fails here because PDB.Status.DisruptionsAllowed reflects current topology spread
        return nil, &UnremovableNode{Node: nodeInfo.Node(), Reason: BlockedByPod, BlockingPod: blockingPod}
    }

    // STEP 2: Remove node from snapshot and try rescheduling - NEVER REACHED
    err = r.withForkedSnapshot(func() error {
        return r.findPlaceFor(nodeName, podsToRemove, destinationMap, timestamp)
    })
}
```

**3. Where Node Removal Happens (Correctly, but Too Late)**

File: `cluster-autoscaler/simulator/cluster.go:191-199`

```go
func (r *RemovalSimulator) findPlaceFor(removedNode string, pods []*apiv1.Pod, ...) error {
    // Unschedule pods
    for _, pod := range pods {
        r.clusterSnapshot.UnschedulePod(pod.Namespace, pod.Name, removedNode)
    }
    // Remove the node from the snapshot, so it doesn't interfere with topology spread
    r.clusterSnapshot.RemoveNodeInfo(removedNode)  // ← Correct, but PDB check already failed!

    // Try to reschedule pods - this WOULD succeed, but we never get here
    statuses, _, err := r.schedulingSimulator.TrySchedulePods(r.clusterSnapshot, newpods, ...)
}
```

### The Problem

1. **Kubernetes PDB Controller** calculates `Status.DisruptionsAllowed` based on the **current cluster state**, including current topology spread distribution
2. When pods have tight topology spread constraints (maxSkew=2), the PDB controller may set `DisruptionsAllowed=0` because moving pods in the **current state** would violate topology spread
3. **CAS checks PDB status BEFORE simulating node removal**:
   - Sees `DisruptionsAllowed < 1` from Kubernetes API
   - Immediately fails the drainability check
   - Never reaches the code that removes the node from the snapshot
   - Never attempts pod rescheduling, which **would succeed** without the node
4. **The Paradox**: PDB says "can't move pods because topology spread would be violated", but this is only true WITH the node present. If the node were removed, topology spread would be satisfied.

### Why This Differs from Related Issues

- **Issue #6984** (maxSkew=1): Focused on skew calculation including the removed node
- **Issue #8093**: Fixed scale-down blocking, but didn't address PDB interaction
- **Issue #8161**: Binpacking creates too many nodes, but doesn't explain why scale-down fails
- **This Issue**: The PDB drainability check uses **stale PDB status from Kubernetes API** that reflects topology spread constraints in the current state, not the simulated state after node removal

## Reproduction Steps

### Test Environment Setup

- **Cloud**: Azure AKS (or any Kubernetes cluster)
- **Node pool**: 1 system nodepool, autoscaling min=3, max=100
- **Node SKU**: Any size (e.g., Standard_D32s_v3 with 32 cores)

### Test Deployment

**Step 1: Filler Pods to Scale Up Cluster**

```yaml
apiVersion: v1
kind: Namespace
metadata:
  name: test-cas-bug
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: filler-pods
  namespace: test-cas-bug
spec:
  replicas: 20
  selector:
    matchLabels:
      app: filler
  template:
    metadata:
      labels:
        app: filler
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - filler
            topologyKey: kubernetes.io/hostname
      containers:
      - name: pause
        image: registry.k8s.io/pause:3.9
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 100m
            memory: 128Mi
```

**Step 2: Actual Workload with Topology Spread + PDB**

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-topology-spread
  namespace: test-cas-bug
spec:
  replicas: 20
  selector:
    matchLabels:
      app: test-topology-spread
  template:
    metadata:
      labels:
        app: test-topology-spread
    spec:
      topologySpreadConstraints:
      - maxSkew: 2
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: test-topology-spread
      containers:
      - name: nginx
        image: nginx:alpine
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 100m
            memory: 128Mi
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: test-topology-spread-pdb
  namespace: test-cas-bug
spec:
  maxUnavailable: 20%
  selector:
    matchLabels:
      app: test-topology-spread
```

### Steps to Reproduce

#### 1. Deploy Filler Pods to Scale Up Node Pool
```bash
kubectl create namespace test-cas-bug
kubectl apply -f filler-deployment.yaml
```

**Expected**: Due to anti-affinity, CAS will create 20 nodes (one filler pod per node). Wait until all filler pods are running.

```bash
kubectl get pods -n test-cas-bug -l app=filler
kubectl get nodes  # Should show 20 nodes
```

#### 2. Deploy Workload with Topology Spread + PDB
```bash
kubectl apply -f topology-spread-deployment.yaml
```

**Expected**: The 20 replicas will be scheduled across the 20 nodes with maxSkew=2 constraint. With maxSkew=2, pods will be distributed relatively evenly (e.g., 1-2 pods per node). Wait until all pods are running.

```bash
kubectl get pods -n test-cas-bug -l app=test-topology-spread -o wide
kubectl get pdb -n test-cas-bug  # Should show disruptionsAllowed > 0
```

#### 3. Remove Filler Pods to Trigger Scale-Down
```bash
kubectl delete deployment filler-pods -n test-cas-bug
# Wait 10-15 minutes for CAS to attempt scale-down
kubectl get nodes
```

**Expected**: After removing filler pods (20 × 100m CPU = 2 cores total), the nodes are now underutilized. The 20 test-topology-spread pods (20 × 100m CPU = 2 cores total) should consolidate to far fewer nodes (e.g., 2-4 nodes depending on node size). CAS should scale down the underutilized nodes within 10 minutes.

**Actual (Bug)**: Most or all of the 20 nodes remain for 10+ minutes despite being severely underutilized (only 100m CPU per node from test-topology-spread pods).

#### 4. Check CAS Logs
```bash
kubectl logs -n kube-system deployment/cluster-autoscaler | grep -A 10 "test-topology-spread" | grep -i "pdb\|disruption\|drainable"
```

**Expected Log Output (Bug Manifestation)**:
```
not enough pod disruption budget to move test-cas-bug/test-topology-spread-xxx
```

#### 5. Verify PDB Status
```bash
kubectl get pdb -n test-cas-bug test-topology-spread-pdb -o yaml
```

**Expected Output**:
```yaml
status:
  currentHealthy: 20
  desiredHealthy: 16  # 20 replicas × 80% (maxUnavailable 20%)
  disruptionsAllowed: 4  # > 0, SHOULD allow disruptions
  expectedPods: 20
```

**Bug Confirmation**: `status.disruptionsAllowed > 0` BUT CAS still blocks scale-down with "not enough pod disruption budget"

#### 6. Validation Tests

**Test A: Remove PDB** (Proves PDB interaction)
```bash
kubectl delete pdb test-topology-spread-pdb -n test-cas-bug
# Wait 10 minutes
kubectl get nodes
```
Expected: CAS scales down empty nodes

**Test B: Remove Topology Spread** (Proves topology constraint interaction)
```bash
# Modify deployment to remove topologySpreadConstraints
kubectl apply -f test-deployment-no-topology.yaml
```
Expected: CAS scales down empty nodes

**Test C: Increase maxUnavailable** (May or may not help)
```bash
kubectl patch pdb test-topology-spread-pdb -n test-cas-bug \
  --type=merge -p '{"spec":{"maxUnavailable":"40%"}}'
```
Expected: May or may not help depending on PDB status calculation

### Success Criteria - Bug is Reproduced If:

1. ✅ After step 1, CAS creates 20 nodes (one per filler pod due to anti-affinity)
2. ✅ After step 2, topology-spread pods are scheduled across the 20 nodes
3. ✅ After step 3 (removing filler pods), most nodes remain for 10+ minutes despite being severely underutilized (only 100m CPU per node)
4. ✅ CAS logs show PDB blocking scale-down: "not enough pod disruption budget to move"
5. ✅ `kubectl get pdb` shows `status.disruptionsAllowed > 0` (should be 4 with 20 replicas and maxUnavailable 20%)
6. ✅ Removing PDB OR topology spread constraints allows scale-down within 10 minutes

## Production Impact

This bug has been observed in production environments causing:
- **Severe cluster fragmentation** with nodes running well below optimal capacity
- **Significant cost waste** due to unnecessary node provisioning
- **Chronic issue** persisting for extended periods (60+ days observed)
- **Low resource efficiency** with application workloads using only a small fraction of allocatable capacity

This is not a recent regression - affected clusters maintain excessive node counts for extended periods.

## Proposed Fix

The PDB drainability check needs to be re-architected to either:

1. **Recalculate `DisruptionsAllowed`** after simulating node removal from the cluster snapshot, OR
2. **Defer PDB check** until after attempting pod rescheduling in `findPlaceFor()`, OR
3. **Simulate topology spread state** without the target node when checking PDB

The challenge is that `pdb.Status.DisruptionsAllowed` comes from the Kubernetes API (reflecting current cluster state), not from the simulated cluster snapshot (reflecting post-removal state).

### Potential Fix Location

Modify `cluster-autoscaler/simulator/cluster.go:SimulateNodeRemoval()` to:
1. Fork the cluster snapshot earlier
2. Remove the node from the snapshot BEFORE the PDB drainability check
3. Check if pods can be rescheduled
4. THEN check PDB based on the simulated state

OR

Modify `cluster-autoscaler/simulator/drainability/rules/pdb/rule.go` to:
1. Not fail immediately on `DisruptionsAllowed < 1`
2. Return a "needs verification" status
3. Allow the removal simulator to proceed with rescheduling attempt
4. Verify PDB after successful rescheduling simulation

## Related Issues

- [#6984](https://github.com/kubernetes/autoscaler/issues/6984) - CAS Does Not Respect TopologySpread maxSkew=1 (Closed as "not planned")
- [#8093](https://github.com/kubernetes/autoscaler/issues/8093) - CAS does not scale down with Topology Spread (Fixed in PR #8164, but doesn't address PDB interaction)
- [#8161](https://github.com/kubernetes/autoscaler/issues/8161) - CAS binpacking poorly with hard topology spread constraints (Open, but focuses on scale-up, not scale-down)

This issue is distinct because it identifies the **specific root cause**: PDB drainability checks use stale PDB status from Kubernetes API that doesn't account for topology spread changes after node removal.

## Additional Context

This affects **all AKS customers** using the combination of:
- PodDisruptionBudgets (common for production workloads)
- Topology spread constraints (increasingly common for HA)
- Cluster Autoscaler (standard for cost optimization)

The workarounds mentioned in related issues (increase maxSkew, remove PDB, manual drain) are not viable for production environments requiring high availability.

---

**Environment Details to Collect**:
- CAS version: `kubectl get deployment cluster-autoscaler -n kube-system -o yaml | grep image:`
- Kubernetes version: `kubectl version --short`
- Full CAS logs during reproduction
- PDB status snapshots over time


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Autoscaler incorrectly blocks scale-down due to PDB+TopologySpread interaction #9111

Description

Environment

Workload Configuration

Expected Behavior

Actual Behavior

Root Cause Analysis

Buggy Code Flow

The Problem

Why This Differs from Related Issues

Reproduction Steps

Test Environment Setup

Test Deployment

Steps to Reproduce

1. Deploy Filler Pods to Scale Up Node Pool

2. Deploy Workload with Topology Spread + PDB

3. Remove Filler Pods to Trigger Scale-Down

4. Check CAS Logs

5. Verify PDB Status

6. Validation Tests

Success Criteria - Bug is Reproduced If:

Production Impact

Proposed Fix

Potential Fix Location

Related Issues

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cluster Autoscaler incorrectly blocks scale-down due to PDB+TopologySpread interaction #9111

Description

Description

Environment

Workload Configuration

Expected Behavior

Actual Behavior

Root Cause Analysis

Buggy Code Flow

The Problem

Why This Differs from Related Issues

Reproduction Steps

Test Environment Setup

Test Deployment

Steps to Reproduce

1. Deploy Filler Pods to Scale Up Node Pool

2. Deploy Workload with Topology Spread + PDB

3. Remove Filler Pods to Trigger Scale-Down

4. Check CAS Logs

5. Verify PDB Status

6. Validation Tests

Success Criteria - Bug is Reproduced If:

Production Impact

Proposed Fix

Potential Fix Location

Related Issues

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions