Skip to content

API: Add Cancelled phase and obsolescencePolicy to TaskSpawner for automatic cancellation of stale running tasks #765

@kelos-bot

Description

@kelos-bot

🤖 Kelos Strategist Agent @gjkim42

Problem

When a TaskSpawner creates a Task from a GitHub PR or issue, and the source item subsequently changes state (PR closed, PR force-pushed, issue resolved), the running Task continues executing on stale context until it either completes or hits activeDeadlineSeconds. This wastes compute, produces outdated results, and can create conflicting PRs.

Current behavior:

  • Spawner discovers items and creates Tasks (lines 296-378 of cmd/kelos-spawner/main.go)
  • Spawner counts active tasks for concurrency limits but never cancels running tasks
  • The only way to stop a running task is to delete the Task resource, which triggers finalizer cleanup but leaves no audit trail
  • Task phases are: PendingRunningSucceeded|Failed (or Waiting for deps/branch lock) — there is no Cancelled phase (api/v1alpha1/task_types.go:18-32)
  • The retrigger logic (lines 260-277) only handles completed tasks — it deletes finished tasks when a new trigger arrives, but does nothing for tasks that are still Running

Real-world impact:

  1. PR closed while agent is working — agent continues for minutes/hours, wastes tokens, may push to a branch that's already been abandoned
  2. New commit pushed to PR — agent works against stale code, produces a diff based on old state; especially relevant for review agents and workers that create PRs from PR context
  3. Issue label changed to exclude-label — agent continues working on an item that no longer matches discovery filters
  4. Cron task still running when next schedule fires — the retrigger logic only handles completed tasks, so overlapping runs can occur (mitigated by branch locking but not for all use cases)

Proposal

1. Add Cancelled phase to Task

const (
    TaskPhasePending   TaskPhase = "Pending"
    TaskPhaseRunning   TaskPhase = "Running"
    TaskPhaseSucceeded TaskPhase = "Succeeded"
    TaskPhaseFailed    TaskPhase = "Failed"
    TaskPhaseWaiting   TaskPhase = "Waiting"
    TaskPhaseCancelled TaskPhase = "Cancelled"  // NEW
)

The Cancelled phase is a terminal phase (like Succeeded/Failed). When a Task enters this phase:

  • The underlying Job is deleted (same as deletion finalizer behavior)
  • The branch lock is released
  • status.completionTime is set
  • status.message records the cancellation reason
  • TTL cleanup applies normally
  • A new kelos_task_completed_total{phase="Cancelled"} metric is emitted

2. Add cancellationPolicy to TaskSpawner spec

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: pr-worker
spec:
  when:
    githubPullRequests:
      labels: [kelos/work]
  cancellationPolicy:
    # Cancel running tasks when the source item no longer matches discovery filters
    onSourceRemoved: Cancel    # Cancel | Ignore (default: Ignore for backward compat)
    # Cancel running tasks when a retrigger creates a replacement task
    onRetrigger: Cancel        # Cancel | Ignore (default: Ignore)
    # Grace period before cancellation takes effect (allows agent to checkpoint)
    gracePeriodSeconds: 30     # optional, default: 0 (immediate)
  taskTemplate:
    # ...

onSourceRemoved: Cancel: During each discovery cycle, the spawner compares the current discovered items against active (non-terminal) Tasks. If a Task's source item is no longer in the discovered set (PR was closed, label was removed, issue was resolved), the spawner patches the Task's status to Cancelled.

onRetrigger: Cancel: When the existing retrigger logic (line 260-277 in spawner) detects a new trigger for an item that has an active (Running/Pending/Waiting) task, it cancels the existing task before creating the replacement. Today this only works for completed tasks.

3. Add cancelledBy to TaskStatus

type TaskStatus struct {
    // ...existing fields...
    
    // CancelledBy records what initiated cancellation (e.g., "spawner:pr-worker", "user").
    // +optional
    CancelledBy string `json:"cancelledBy,omitempty"`
}

This provides an audit trail distinguishing user-initiated deletions from automated cancellations.

4. Spawner implementation sketch

In runCycleWithSource(), after discovering items and building the existing task map, add a cancellation pass:

// Cancel tasks whose source items are no longer discovered
if ts.Spec.CancellationPolicy != nil && ts.Spec.CancellationPolicy.OnSourceRemoved == "Cancel" {
    discoveredIDs := make(map[string]bool)
    for _, item := range items {
        discoveredIDs[fmt.Sprintf("%s-%s", ts.Name, item.ID)] = true
    }
    for taskName, task := range existingTaskMap {
        if !discoveredIDs[taskName] && !isTerminal(task.Status.Phase) {
            // Patch task status to Cancelled
            task.Status.Phase = kelosv1alpha1.TaskPhaseCancelled
            task.Status.Message = "Source item no longer matches discovery filters"
            task.Status.CancelledBy = fmt.Sprintf("spawner:%s", ts.Name)
            cl.Status().Update(ctx, task)
        }
    }
}

The Task controller would handle the Cancelled phase similarly to Failed — delete the Job, release locks, record metrics.

Relationship to existing issues

Backward compatibility

  • Default cancellationPolicy is nil, meaning Ignore for both fields — no behavior change for existing TaskSpawners
  • The Cancelled phase is additive — existing code that checks for terminal phases (Succeeded/Failed) would need to also check Cancelled, but the TTL and cleanup logic in the task controller already gates on specific phases, making this safe to add incrementally
  • The cancelledBy field is optional and purely informational

Evidence from codebase

  • api/v1alpha1/task_types.go:18-32: Only 5 phases defined, no cancellation state
  • cmd/kelos-spawner/main.go:241-249: Active task counting only checks Succeeded/Failed as terminal — no cancellation path
  • cmd/kelos-spawner/main.go:260-277: Retrigger logic only handles completed tasks, not running ones
  • internal/controller/task_controller.go:473-627: updateStatus() transitions are Pending→Running→Succeeded/Failed only
  • internal/controller/task_controller.go:628-647: TTL expiry only checks Succeeded/Failed phases
  • internal/controller/task_controller.go:579-590: Branch lock release only on Succeeded/Failed

/kind feature

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions