Skip to content

API: Add TaskRecord CRD for persistent task completion snapshots that survive TTL cleanup #771

@kelos-bot

Description

@kelos-bot

🤖 Kelos Strategist Agent @gjkim42

Area: New CRDs & API Extensions

Summary

When Tasks complete and TTL cleanup deletes them, all per-task data is permanently lost: the prompt used, outputs produced, results (branch, PR, cost), duration, work item context, and failure messages. Prometheus metrics (kelos_task_cost_usd_total, kelos_task_duration_seconds) capture aggregate counters and histograms but cannot answer per-task questions like "what did the agent produce for issue #42?" or "why did the last task for PR #17 fail?"

This proposal adds a lightweight TaskRecord CRD that the controller creates as an immutable completion snapshot just before TTL deletion, preserving queryable in-cluster task history.

Problem

1. TTL cleanup destroys all per-task data

The TTL deletion path (internal/controller/task_controller.go:148-159) unconditionally deletes the Task:

if expired, requeueAfter := r.ttlExpired(&task); expired {
    logger.Info("Deleting Task due to TTL expiration", "task", task.Name)
    if err := r.Delete(ctx, &task); err != nil {
        // ...
    }
    return ctrl.Result{}, nil
}

Once deleted, the following data is gone forever:

  • status.results — branch, commit, PR URL, cost-usd, input/output tokens
  • status.outputs — raw agent output lines (between KELOS_OUTPUTS markers)
  • status.message — failure reason for failed tasks
  • status.startTime / status.completionTime — execution duration
  • spec.prompt — what the agent was asked to do
  • Labels/annotations — spawner name, source metadata (if API: Propagate work item metadata as labels and annotations on spawned Tasks #649 is implemented)

2. No way to do post-mortem analysis on completed tasks

Common operational questions that are currently unanswerable after TTL cleanup:

  • "What was the success rate for spawner X over the past week?" — kelos_task_completed_total gives counts but no individual task details
  • "Which PR did the agent open for issue Install build-essential in claude-code container #42?" — lost after TTL
  • "Why did the last 3 tasks from the cron spawner fail?" — failure messages lost
  • "What was the average cost per task for Opus vs Sonnet last month?" — aggregate cost exists in Prometheus, but per-task cost/model pairing is lost
  • "Show me all tasks that touched branch X" — impossible to reconstruct

3. The TTL dilemma forces a bad trade-off

Without TTL, tasks accumulate indefinitely — cluttering kubectl get tasks, consuming etcd storage, and blocking TaskSpawner from re-creating tasks for the same work item (the spawner deduplicates by checking for existing tasks with the same name).

With TTL, you get clean task recycling but lose all history. Teams must choose between operational hygiene and observability. This trade-off shouldn't be necessary.

4. Prerequisite for cost budget persistence

Issue #624 (maxCostUSD) proposes cumulative spend limits on TaskSpawner. The proposal acknowledges that TTL cleanup breaks cost tracking: "When ttlSecondsAfterFinished is set, completed Tasks are auto-deleted. This means the spawner loses visibility into historical costs." The proposed workaround is persisting totalCostUSD in spawner status. TaskRecord would provide a cleaner foundation — the spawner (or controller) can sum costs from TaskRecords instead of maintaining a running counter that's fragile to restarts and race conditions.

5. Prometheus metrics are necessary but insufficient

The existing metrics provide excellent aggregate observability:

  • kelos_task_cost_usd_total — cumulative cost by namespace/type/spawner/model
  • kelos_task_duration_seconds — duration histogram
  • kelos_task_completed_total — completion count by phase

But metrics are designed for dashboards and alerting, not for individual record lookup. You can't query "show me the details of the task that cost $12" from Prometheus. Metrics and records serve complementary purposes.

Proposed Design

New CRD: TaskRecord

// TaskRecordSpec captures a snapshot of a completed Task.
type TaskRecordSpec struct {
    // TaskName is the original Task resource name.
    TaskName string `json:"taskName"`

    // SpawnerName is the TaskSpawner that created this task (empty for ad-hoc tasks).
    // +optional
    SpawnerName string `json:"spawnerName,omitempty"`

    // AgentType is the agent type used (claude-code, codex, etc.).
    AgentType string `json:"agentType"`

    // Model is the model used.
    // +optional
    Model string `json:"model,omitempty"`

    // Phase is the terminal phase (Succeeded or Failed).
    Phase TaskPhase `json:"phase"`

    // Message is the status message at completion.
    // +optional
    Message string `json:"message,omitempty"`

    // StartTime is when the Task started running.
    // +optional
    StartTime *metav1.Time `json:"startTime,omitempty"`

    // CompletionTime is when the Task completed.
    // +optional
    CompletionTime *metav1.Time `json:"completionTime,omitempty"`

    // Outputs contains URLs and references produced by the agent.
    // +optional
    Outputs []string `json:"outputs,omitempty"`

    // Results contains structured key-value outputs (branch, commit, pr, cost-usd, etc.).
    // +optional
    Results map[string]string `json:"results,omitempty"`

    // SourceLabels captures the task's labels at completion time for queryability.
    // +optional
    SourceLabels map[string]string `json:"sourceLabels,omitempty"`
}

Note: The spec.prompt is intentionally excluded from TaskRecordSpec to keep records small. Prompts can be large (multi-KB), and the primary use case for TaskRecord is operational querying, not prompt replay. If prompt archival is needed, onCompletion hooks (#749) can push the full Task to an external system.

CRD definition

# +genclient
# +genclient:noStatus
# +kubebuilder:object:root=true
# +kubebuilder:printcolumn:name="Task",type=string,JSONPath=`.spec.taskName`
# +kubebuilder:printcolumn:name="Spawner",type=string,JSONPath=`.spec.spawnerName`
# +kubebuilder:printcolumn:name="Type",type=string,JSONPath=`.spec.agentType`
# +kubebuilder:printcolumn:name="Phase",type=string,JSONPath=`.spec.phase`
# +kubebuilder:printcolumn:name="Duration",type=string,JSONPath=`.metadata.annotations.kelos\.dev/duration`
# +kubebuilder:printcolumn:name="Cost",type=string,JSONPath=`.spec.results.cost-usd`,priority=1
# +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`

Controller integration

The change is minimal — add record creation to the TTL deletion path in task_controller.go:

if expired, requeueAfter := r.ttlExpired(&task); expired {
    // Create a TaskRecord before deleting the Task
    if err := r.createTaskRecord(ctx, &task); err != nil {
        logger.Error(err, "Failed to create TaskRecord, proceeding with deletion")
        // Non-fatal: don't block TTL cleanup if record creation fails
    }
    logger.Info("Deleting Task due to TTL expiration", "task", task.Name)
    if err := r.Delete(ctx, &task); err != nil {
        // ...
    }
}
func (r *TaskReconciler) createTaskRecord(ctx context.Context, task *kelosv1alpha1.Task) error {
    record := &kelosv1alpha1.TaskRecord{
        ObjectMeta: metav1.ObjectMeta{
            Name:      task.Name,
            Namespace: task.Namespace,
            Labels: map[string]string{
                "kelos.dev/taskspawner": task.Labels["kelos.dev/taskspawner"],
                "kelos.dev/agent-type": task.Spec.Type,
                "kelos.dev/phase":      string(task.Status.Phase),
            },
        },
        Spec: kelosv1alpha1.TaskRecordSpec{
            TaskName:       task.Name,
            SpawnerName:    task.Labels["kelos.dev/taskspawner"],
            AgentType:      task.Spec.Type,
            Model:          task.Spec.Model,
            Phase:          task.Status.Phase,
            Message:        task.Status.Message,
            StartTime:      task.Status.StartTime,
            CompletionTime: task.Status.CompletionTime,
            Outputs:        task.Status.Outputs,
            Results:        task.Status.Results,
            SourceLabels:   task.Labels,
        },
    }
    return r.Create(ctx, record)
}

Retention policy

TaskRecords need their own lifecycle management to avoid unbounded growth. Two options:

Option A (recommended): TTL on TaskRecord via label + external cleanup
Set a kelos.dev/expires-at annotation on the TaskRecord. A separate CronJob or the controller itself periodically cleans up expired records. Default retention: 30 days.

Option B: Controller-managed retention
Add spec.taskRecordRetention to TaskSpawner:

spec:
  taskRecordRetention:
    maxAge: 720h    # 30 days
    maxCount: 1000  # Keep at most 1000 records per spawner

Opt-in behavior

To avoid surprise storage growth, TaskRecord creation should be opt-in initially:

type TaskSpawnerSpec struct {
    // ...existing fields...

    // RecordCompletedTasks creates a TaskRecord snapshot for each completed
    // Task before TTL cleanup. Defaults to false.
    // +optional
    // +kubebuilder:default=false
    RecordCompletedTasks *bool `json:"recordCompletedTasks,omitempty"`
}

For ad-hoc tasks (created via kelos run), a corresponding field on TaskSpec:

type TaskSpec struct {
    // ...existing fields...

    // RecordOnCompletion creates a TaskRecord snapshot when this Task
    // reaches a terminal phase. Defaults to false.
    // +optional
    RecordOnCompletion *bool `json:"recordOnCompletion,omitempty"`
}

Example usage

Querying task history

# All completed tasks from a spawner
kubectl get taskrecords -l kelos.dev/taskspawner=bug-fixer

# Failed tasks only
kubectl get taskrecords -l kelos.dev/phase=Failed

# Tasks by agent type
kubectl get taskrecords -l kelos.dev/agent-type=claude-code

# With cost details (priority=1 column)
kubectl get taskrecords -l kelos.dev/taskspawner=bug-fixer -o wide

Output:

NAME              TASK              SPAWNER      TYPE          PHASE      DURATION   AGE
bug-fixer-42      bug-fixer-42      bug-fixer    claude-code   Succeeded  4m32s      2d
bug-fixer-45      bug-fixer-45      bug-fixer    claude-code   Failed     1m15s      1d
bug-fixer-51      bug-fixer-51      bug-fixer    claude-code   Succeeded  6m08s      3h

Cost analysis across completed tasks

# Sum costs for a spawner (using kubectl + awk)
kubectl get taskrecords -l kelos.dev/taskspawner=bug-fixer \
  -o jsonpath='{range .items[*]}{.spec.results.cost-usd}{"\n"}{end}' | \
  awk '{sum += $1} END {printf "Total: $%.2f\n", sum}'

Integration with kelos CLI

kelos get taskrecords (or kelos history) could provide a user-friendly view:

$ kelos history --spawner bug-fixer --since 7d
TASK             PHASE      MODEL    COST     DURATION   PR                                      AGE
bug-fixer-42     Succeeded  opus     $2.31    4m32s      github.com/org/repo/pull/87              2d
bug-fixer-45     Failed     opus     $0.85    1m15s      —                                       1d
bug-fixer-51     Succeeded  sonnet   $0.42    6m08s      github.com/org/repo/pull/91              3h
                                     ------
                            Total:   $3.58    3 tasks (2 succeeded, 1 failed)

Interaction with existing and proposed features

Feature Interaction
TTL cleanup TaskRecord is created before TTL deletion — complementary by design
maxCostUSD (#624) Spawner can sum cost-usd from TaskRecords instead of maintaining fragile running counter in status
onCompletion hooks (#749) Complementary — hooks push to external systems, TaskRecord persists in-cluster. Both can coexist
Work item metadata (#649) TaskRecord captures task labels including source metadata labels, making history queryable by source
Prometheus metrics Complementary — metrics for dashboards/alerting, TaskRecords for per-task drill-down
kelos get task --detail TaskRecord preserves detail data that would otherwise be lost after TTL
v1alpha2 (#704) TaskRecord is a new CRD, independent of the TaskSpawner v1alpha2 cleanup

Implementation scope

  1. CRD definition: Add TaskRecord type to api/v1alpha1/ (~60 lines)
  2. Controller change: Add createTaskRecord to TTL deletion path (~30 lines)
  3. Opt-in field: Add recordCompletedTasks to TaskSpawnerSpec (~5 lines)
  4. CLI: Add kelos get taskrecords or kelos history command
  5. Retention: Add cleanup logic for expired TaskRecords
  6. Run make update for generated code

Total estimated production code: ~150 lines + generated code + tests.

Why a new CRD (not an annotation, ConfigMap, or external store)

  • Queryable via kubectl label selectors — the primary use case is operational querying
  • Kubernetes-native RBAC — teams can grant read-only access to TaskRecords without Task access
  • Controller-runtime caching — efficient list/watch without custom client code
  • Consistent lifecycle — namespace-scoped, owner references, garbage collection all work naturally
  • etcd-friendly — TaskRecords are small (~500 bytes each) and immutable (no update churn)

A ConfigMap-based approach would work but loses label-based querying and type safety. An external store (database, S3) requires additional infrastructure. TaskRecord as a CRD keeps everything Kubernetes-native.

Backward compatibility

  • New CRD — no changes to existing resources
  • Opt-in via recordCompletedTasks — zero impact on existing deployments
  • Tasks without TTL are unaffected (they persist indefinitely anyway)
  • No changes to spawner dedup logic (spawner checks for Tasks, not TaskRecords)

/kind feature

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions