Skip to content

Autoscaling with nifinodegroupautoscaler/keda leads to unstable nifi cluster #600

@haamonc

Description

@haamonc

What steps will reproduce the bug?

A stable nifi cluster of 2 nodes(nifi pods) is working fine on the eks cluster.
The leader election and state management happens via kubernetes (the same issue occured with nifikop version 1.10.0 with zookeeper for state management and leader election)
nifi-node-group-autoscalers and scaled-object is defined the way recommended in the official documentation with prometheus query and threashold set such that the triggers should set to true and false to test scale up and scale down respectively.
Whenever the scale up happens, only the nodes(nifi pods) that scales up goes into a boot loop as in each scaled up nifi pod stays stable for 7-10 mins and then automatically terminates itself and a new nifi pod with the same numeric id comes up.
Example:

NAME                           READY   STATUS               RESTARTS   AGE
nifi-one-2-nodeshp5l   0/4          Terminated               0          12m 
nifi-one-2-nodecdi6c   0/4          Container-init           0           2s

This makes the nifi cluster unstable.
Scaling down also is difficult as it leads to few orpaned pods.
Scaling down is tried via two ways

  • First set the scaled object threashold such that triggers should become false
  • Second delete/remove the scaled object as well as nifi-node-group-autoscalers

Both the methods lead to orphaned nodes in varying degrees, the second method is much worse than first method, but even first method doesn't help
Deleting the orphaned nodes become very difficult and they definitely leave back they statuses on nifi cluster crd object which is managed by nifikop
The nifi cluster is stable only and only when nifi nodes are defined in nifi cluster and no autoscaling is enabled.

The error in the nifikop logs:

{"level":"error","time":"2025-07-31T06:56:19.482Z","logger":"nifi_client","caller":"nificlient/system.go:52","msg":"Error during preparing the request","error":"The target node id doesn't exist in the cluster","errorVerbose":"The target node id doesn't exist in the cluster\ngithub.com/konpyutaika/nifikop/pkg/nificlient.init\n\t/workspace/pkg/nificlient/common.go:14\nruntime.doInit1\n\t/usr/local/go/src/runtime/proc.go:7353\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:7320\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:254\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1700","stacktrace":"github.com/konpyutaika/nifikop/pkg/nificlient.(*nifiClient).GetClusterNode\n\t/workspace/pkg/nificlient/system.go:52\ngithub.com/konpyutaika/nifikop/pkg/clientwrappers/scale.CheckIfNCActionStepFinished\n\t/workspace/pkg/clientwrappers/scale/scale.go:151\ngithub.com/konpyutaika/nifikop/internal/controller.(*NifiClusterTaskReconciler).checkNCActionStep\n\t/workspace/internal/controller/nificlustertask_controller.go:370\ngithub.com/konpyutaika/nifikop/internal/controller.(*NifiClusterTaskReconciler).handlePodRunningTask\n\t/workspace/internal/controller/nificlustertask_controller.go:312\ngithub.com/konpyutaika/nifikop/internal/controller.(*NifiClusterTaskReconciler).Reconcile\n\t/workspace/internal/controller/nificlustertask_controller.go:93\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224"}
{"level":"error","time":"2025-07-31T06:56:19.541Z","logger":"nifi_client","caller":"nificlient/system.go:126","msg":"Error during preparing the request","error":"The target node id doesn't exist in the cluster","errorVerbose":"The target node id doesn't exist in the cluster\ngithub.com/konpyutaika/nifikop/pkg/nificlient.init\n\t/workspace/pkg/nificlient/common.go:14\nruntime.doInit1\n\t/usr/local/go/src/runtime/proc.go:7353\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:7320\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:254\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1700","stacktrace":"github.com/konpyutaika/nifikop/pkg/nificlient.(*nifiClient).setClusterNodeStatus\n\t/workspace/pkg/nificlient/system.go:126\ngithub.com/konpyutaika/nifikop/pkg/nificlient.(*nifiClient).ConnectClusterNode\n\t/workspace/pkg/nificlient/system.go:75\ngithub.com/konpyutaika/nifikop/pkg/clientwrappers/scale.ConnectClusterNode\n\t/workspace/pkg/clientwrappers/scale/scale.go:93\ngithub.com/konpyutaika/nifikop/internal/controller.(*NifiClusterTaskReconciler).checkNCActionStep\n\t/workspace/internal/controller/nificlustertask_controller.go:415\ngithub.com/konpyutaika/nifikop/internal/controller.(*NifiClusterTaskReconciler).handlePodRunningTask\n\t/workspace/internal/controller/nificlustertask_controller.go:312\ngithub.com/konpyutaika/nifikop/internal/controller.(*NifiClusterTaskReconciler).Reconcile\n\t/workspace/internal/controller/nificlustertask_controller.go:93\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224"}
{"level":"error","time":"2025-07-31T06:56:19.542Z","logger":"nifi_client","caller":"nificlient/system.go:160","msg":"Connect node gracefully failed since Nifi node returned non 200 error since Nifi node returned non 200","error":"The target node id doesn't exist in the cluster","errorVerbose":"The target node id doesn't exist in the cluster\ngithub.com/konpyutaika/nifikop/pkg/nificlient.init\n\t/workspace/pkg/nificlient/common.go:14\nruntime.doInit1\n\t/usr/local/go/src/runtime/proc.go:7353\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:7320\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:254\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1700","stacktrace":"github.com/konpyutaika/nifikop/pkg/nificlient.setClusterNodeStatusReturn\n\t/workspace/pkg/nificlient/system.go:160\ngithub.com/konpyutaika/nifikop/pkg/nificlient.(*nifiClient).ConnectClusterNode\n\t/workspace/pkg/nificlient/system.go:77\ngithub.com/konpyutaika/nifikop/pkg/clientwrappers/scale.ConnectClusterNode\n\t/workspace/pkg/clientwrappers/scale/scale.go:93\ngithub.com/konpyutaika/nifikop/internal/controller.(*NifiClusterTaskReconciler).checkNCActionStep\n\t/workspace/internal/controller/nificlustertask_controller.go:415\ngithub.com/konpyutaika/nifikop/internal/controller.(*NifiClusterTaskReconciler).handlePodRunningTask\n\t/workspace/internal/controller/nificlustertask_controller.go:312\ngithub.com/konpyutaika/nifikop/internal/controller.(*NifiClusterTaskReconciler).Reconcile\n\t/workspace/internal/controller/nificlustertask_controller.go:93\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224"}
{"level":"error","time":"2025-07-31T06:56:19.542Z","logger":"scale-method","caller":"clientwrappers/common.go:17","msg":"could not communicate with nifi node","action":"Connect node gracefully","error":"The target node id doesn't exist in the cluster","errorVerbose":"The target node id doesn't exist in the cluster\ngithub.com/konpyutaika/nifikop/pkg/nificlient.init\n\t/workspace/pkg/nificlient/common.go:14\nruntime.doInit1\n\t/usr/local/go/src/runtime/proc.go:7353\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:7320\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:254\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1700","stacktrace":"github.com/konpyutaika/nifikop/pkg/clientwrappers.ErrorUpdateOperation\n\t/workspace/pkg/clientwrappers/common.go:17\ngithub.com/konpyutaika/nifikop/pkg/clientwrappers/scale.ConnectClusterNode\n\t/workspace/pkg/clientwrappers/scale/scale.go:94\ngithub.com/konpyutaika/nifikop/internal/controller.(*NifiClusterTaskReconciler).checkNCActionStep\n\t/workspace/internal/controller/nificlustertask_controller.go:415\ngithub.com/konpyutaika/nifikop/internal/controller.(*NifiClusterTaskReconciler).handlePodRunningTask\n\t/workspace/internal/controller/nificlustertask_controller.go:312\ngithub.com/konpyutaika/nifikop/internal/controller.(*NifiClusterTaskReconciler).Reconcile\n\t/workspace/internal/controller/nificlustertask_controller.go:93\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224"}
{"level":"info","time":"2025-07-31T06:56:19.542Z","logger":"controller.NifiClusterTask","caller":"controller/controller_common.go:35","msg":"nifi cluster communication error for cluster nifi-one: The target node id doesn't exist in the cluster"}
{"level":"error","time":"2025-07-31T06:56:19.542Z","caller":"controller/controller.go:316","msg":"Reconciler error","controller":"nificluster","controllerGroup":"nifi.konpyutaika.com","controllerKind":"NifiCluster","nifiCluster":{"name":"nifi-one","namespace":"nifi"},"namespace":"nifi","name":"nifi-one","reconcileID":"63a0e21f-26e1-4066-a161-6885cbebd5bb","error":"nifi cluster communication error for cluster nifi-one: The target node id doesn't exist in the cluster","errorVerbose":"The target node id doesn't exist in the cluster\ngithub.com/konpyutaika/nifikop/pkg/nificlient.init\n\t/workspace/pkg/nificlient/common.go:14\nruntime.doInit1\n\t/usr/local/go/src/runtime/proc.go:7353\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:7320\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:254\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1700\nnifi cluster communication error for cluster nifi-one","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224"}

What is the expected behavior?

The nifi nodes scale up and down smoothly without boot loop according to the triggers

Scale down and eviction of autoscaled nifi nodes should be smooth and easy without any status residuals in nifi cluster crd

What do you see instead?

Boot loop of all the nifi pods that have scaled up which leads to unstable nifi cluster

bad scale down with ophaned nifi pods
residual corrupted status of nifi pods in nifi cluster crd object even after force deletion of orphaned nifi pods

Possible solution

No response

NiFiKop version

v1.14.1

Golang version

golang 1.24.4

Kubernetes version

Client Version: v1.32.2
Kustomize Version: v5.5.0
Server Version: v1.30.13-eks-5d4a308

NiFi version

2.0.0M4

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions