-
Notifications
You must be signed in to change notification settings - Fork 513
[TAS] Requeue inadmissible workloads after non-TAS pod finishes #8709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[TAS] Requeue inadmissible workloads after non-TAS pod finishes #8709
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a bug where inadmissible workloads were not automatically requeued when non-TAS pods terminated or were deleted, potentially causing workload starvation. The fix adds queue manager access to the NonTasUsageReconciler and calls QueueInadmissibleWorkloads after the cache is updated, following the same pattern used in the TAS ResourceFlavor controller.
Changes:
- Added queue manager to NonTasUsageReconciler to enable requeuing of inadmissible workloads
- Implemented automatic requeue when non-TAS pods terminate or are deleted, freeing capacity
- Removed workaround code from integration tests that manually triggered requeuing
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| pkg/controller/tas/non_tas_usage_controller.go | Added queue manager parameter and requeueInadmissibleWorkloads method; triggers requeue when pods are deleted or terminated |
| pkg/controller/tas/controllers.go | Passes queue manager to NonTasUsageReconciler constructor |
| test/integration/singlecluster/tas/tas_test.go | Removed manual requeue workarounds now that automatic requeuing is fixed |
| test/integration/singlecluster/tas/suite_test.go | Cleaned up global qManager variable that was only needed for the workaround |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/assign @gabesaba |
|
This solution works, but has the same issues @mimowo raised in this thread: #8484 (comment) We need to have some combination of the suggestions @mimowo made here: #8484 (comment)
1 is already accomplished via event filters. I think 3 is quite tricky. My recommendation would be to go with option 2. You may be able to use libraries from client-go's workqueue: https://pkg.go.dev/k8s.io/client-go/util/workqueue#pkg-overview, add a dummy item (since, without implementing 1, we are requeueing everything anyway), and then requeue everything (in a batched manner) when the dummy item is picked up for work |
|
Yes, using a dedicated workqueue sgtm with delay of 1min. We do something similar with the core k8s in Job for clearing orphaned Pods, ptal: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/job_controller.go#L189 Ofc you could parametrize the workqueue (or entrire controller) by the batch time, and for testing use smaller, say 5s, while on prod we would use something like 1min. |
66dc064 to
118145f
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sohankunkerkar The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
When a non-TAS pod terminates or is deleted, capacity is freed on the node. This fix requeues inadmissible workloads to reconsider the freed capacity. Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
118145f to
d7f11ef
Compare
When a non-TAS pod terminates or is deleted, capacity is freed on the node. This fix requeues inadmissible workloads to reconsider the freed capacity.
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #8653
Special notes for your reviewer:
Does this PR introduce a user-facing change?