-
Notifications
You must be signed in to change notification settings - Fork 12
Description
🤖 Kelos Strategist Agent @gjkim42
Problem
When a TaskSpawner creates a Task from a GitHub PR or issue, and the source item subsequently changes state (PR closed, PR force-pushed, issue resolved), the running Task continues executing on stale context until it either completes or hits activeDeadlineSeconds. This wastes compute, produces outdated results, and can create conflicting PRs.
Current behavior:
- Spawner discovers items and creates Tasks (lines 296-378 of
cmd/kelos-spawner/main.go) - Spawner counts active tasks for concurrency limits but never cancels running tasks
- The only way to stop a running task is to delete the Task resource, which triggers finalizer cleanup but leaves no audit trail
- Task phases are:
Pending→Running→Succeeded|Failed(orWaitingfor deps/branch lock) — there is noCancelledphase (api/v1alpha1/task_types.go:18-32) - The retrigger logic (lines 260-277) only handles completed tasks — it deletes finished tasks when a new trigger arrives, but does nothing for tasks that are still Running
Real-world impact:
- PR closed while agent is working — agent continues for minutes/hours, wastes tokens, may push to a branch that's already been abandoned
- New commit pushed to PR — agent works against stale code, produces a diff based on old state; especially relevant for review agents and workers that create PRs from PR context
- Issue label changed to exclude-label — agent continues working on an item that no longer matches discovery filters
- Cron task still running when next schedule fires — the retrigger logic only handles completed tasks, so overlapping runs can occur (mitigated by branch locking but not for all use cases)
Proposal
1. Add Cancelled phase to Task
const (
TaskPhasePending TaskPhase = "Pending"
TaskPhaseRunning TaskPhase = "Running"
TaskPhaseSucceeded TaskPhase = "Succeeded"
TaskPhaseFailed TaskPhase = "Failed"
TaskPhaseWaiting TaskPhase = "Waiting"
TaskPhaseCancelled TaskPhase = "Cancelled" // NEW
)The Cancelled phase is a terminal phase (like Succeeded/Failed). When a Task enters this phase:
- The underlying Job is deleted (same as deletion finalizer behavior)
- The branch lock is released
status.completionTimeis setstatus.messagerecords the cancellation reason- TTL cleanup applies normally
- A new
kelos_task_completed_total{phase="Cancelled"}metric is emitted
2. Add cancellationPolicy to TaskSpawner spec
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
name: pr-worker
spec:
when:
githubPullRequests:
labels: [kelos/work]
cancellationPolicy:
# Cancel running tasks when the source item no longer matches discovery filters
onSourceRemoved: Cancel # Cancel | Ignore (default: Ignore for backward compat)
# Cancel running tasks when a retrigger creates a replacement task
onRetrigger: Cancel # Cancel | Ignore (default: Ignore)
# Grace period before cancellation takes effect (allows agent to checkpoint)
gracePeriodSeconds: 30 # optional, default: 0 (immediate)
taskTemplate:
# ...onSourceRemoved: Cancel: During each discovery cycle, the spawner compares the current discovered items against active (non-terminal) Tasks. If a Task's source item is no longer in the discovered set (PR was closed, label was removed, issue was resolved), the spawner patches the Task's status to Cancelled.
onRetrigger: Cancel: When the existing retrigger logic (line 260-277 in spawner) detects a new trigger for an item that has an active (Running/Pending/Waiting) task, it cancels the existing task before creating the replacement. Today this only works for completed tasks.
3. Add cancelledBy to TaskStatus
type TaskStatus struct {
// ...existing fields...
// CancelledBy records what initiated cancellation (e.g., "spawner:pr-worker", "user").
// +optional
CancelledBy string `json:"cancelledBy,omitempty"`
}This provides an audit trail distinguishing user-initiated deletions from automated cancellations.
4. Spawner implementation sketch
In runCycleWithSource(), after discovering items and building the existing task map, add a cancellation pass:
// Cancel tasks whose source items are no longer discovered
if ts.Spec.CancellationPolicy != nil && ts.Spec.CancellationPolicy.OnSourceRemoved == "Cancel" {
discoveredIDs := make(map[string]bool)
for _, item := range items {
discoveredIDs[fmt.Sprintf("%s-%s", ts.Name, item.ID)] = true
}
for taskName, task := range existingTaskMap {
if !discoveredIDs[taskName] && !isTerminal(task.Status.Phase) {
// Patch task status to Cancelled
task.Status.Phase = kelosv1alpha1.TaskPhaseCancelled
task.Status.Message = "Source item no longer matches discovery filters"
task.Status.CancelledBy = fmt.Sprintf("spawner:%s", ts.Name)
cl.Status().Update(ctx, task)
}
}
}The Task controller would handle the Cancelled phase similarly to Failed — delete the Job, release locks, record metrics.
Relationship to existing issues
- Complements API: Add retriggerOnPush to githubPullRequests for automatic re-triggering when new commits land #752 (retriggerOnPush): When a new push triggers a replacement task,
onRetrigger: Cancelwould cleanly cancel the stale task working on old code — without this, retriggerOnPush would either waste compute (two agents running) or need to wait for the old task to complete - Complements API: Add maxCostUSD budget enforcement to TaskSpawner for spend-limited autonomous agents #624 (maxCostUSD): Cancellation prevents cost overruns from stale tasks, complementing budget enforcement
- Complements API: Add retryStrategy to TaskSpec for configurable retry with error-context enrichment and model fallback #730 (retryStrategy): Cancelled tasks should NOT be retried — the retry logic should distinguish Failed from Cancelled
Backward compatibility
- Default
cancellationPolicyis nil, meaningIgnorefor both fields — no behavior change for existing TaskSpawners - The
Cancelledphase is additive — existing code that checks for terminal phases (Succeeded/Failed) would need to also checkCancelled, but the TTL and cleanup logic in the task controller already gates on specific phases, making this safe to add incrementally - The
cancelledByfield is optional and purely informational
Evidence from codebase
api/v1alpha1/task_types.go:18-32: Only 5 phases defined, no cancellation statecmd/kelos-spawner/main.go:241-249: Active task counting only checks Succeeded/Failed as terminal — no cancellation pathcmd/kelos-spawner/main.go:260-277: Retrigger logic only handles completed tasks, not running onesinternal/controller/task_controller.go:473-627:updateStatus()transitions are Pending→Running→Succeeded/Failed onlyinternal/controller/task_controller.go:628-647: TTL expiry only checks Succeeded/Failed phasesinternal/controller/task_controller.go:579-590: Branch lock release only on Succeeded/Failed
/kind feature