Skip to content

Conversation

@sohankunkerkar
Copy link
Member

@sohankunkerkar sohankunkerkar commented Jan 20, 2026

When a non-TAS pod terminates or is deleted, capacity is freed on the node. This fix requeues inadmissible workloads to reconsider the freed capacity.

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #8653

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Requeue inadmissible workloads after non-TAS pod finishes

Copilot AI review requested due to automatic review settings January 20, 2026 21:43
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Jan 20, 2026
@netlify
Copy link

netlify bot commented Jan 20, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit d7f11ef
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/69718fae1dec9400080481d2

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 20, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where inadmissible workloads were not automatically requeued when non-TAS pods terminated or were deleted, potentially causing workload starvation. The fix adds queue manager access to the NonTasUsageReconciler and calls QueueInadmissibleWorkloads after the cache is updated, following the same pattern used in the TAS ResourceFlavor controller.

Changes:

  • Added queue manager to NonTasUsageReconciler to enable requeuing of inadmissible workloads
  • Implemented automatic requeue when non-TAS pods terminate or are deleted, freeing capacity
  • Removed workaround code from integration tests that manually triggered requeuing

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
pkg/controller/tas/non_tas_usage_controller.go Added queue manager parameter and requeueInadmissibleWorkloads method; triggers requeue when pods are deleted or terminated
pkg/controller/tas/controllers.go Passes queue manager to NonTasUsageReconciler constructor
test/integration/singlecluster/tas/tas_test.go Removed manual requeue workarounds now that automatic requeuing is fixed
test/integration/singlecluster/tas/suite_test.go Cleaned up global qManager variable that was only needed for the workaround

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mimowo
Copy link
Contributor

mimowo commented Jan 21, 2026

/assign @gabesaba
ptal

@gabesaba
Copy link
Contributor

This solution works, but has the same issues @mimowo raised in this thread: #8484 (comment)

We need to have some combination of the suggestions @mimowo made here: #8484 (comment)

  • filter for non-TAS pods (TAS pods are already handled by TAS Workloads)
  • bulk the requests in time, like 1min or so.
  • probably only requeue for ClusterQueues which are affected by the change (using Flavors matching the affected nodes)

1 is already accomplished via event filters. I think 3 is quite tricky.

My recommendation would be to go with option 2. You may be able to use libraries from client-go's workqueue: https://pkg.go.dev/k8s.io/client-go/util/workqueue#pkg-overview, add a dummy item (since, without implementing 1, we are requeueing everything anyway), and then requeue everything (in a batched manner) when the dummy item is picked up for work

@mimowo
Copy link
Contributor

mimowo commented Jan 21, 2026

Yes, using a dedicated workqueue sgtm with delay of 1min. We do something similar with the core k8s in Job for clearing orphaned Pods, ptal: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/job_controller.go#L189

Ofc you could parametrize the workqueue (or entrire controller) by the batch time, and for testing use smaller, say 5s, while on prod we would use something like 1min.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sohankunkerkar
Once this PR has been reviewed and has the lgtm label, please ask for approval from gabesaba. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 22, 2026
When a non-TAS pod terminates or is deleted, capacity is freed on
the node. This fix requeues inadmissible workloads to reconsider
the freed capacity.

Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
@sohankunkerkar
Copy link
Member Author

@gabesaba @mimowo could you PTAL?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Requeue relevant inadmissible workloads after a non-TAS workload finishes

4 participants