API: Add TaskRecord CRD for persistent task completion snapshots that survive TTL cleanup

🤖 **Kelos Strategist Agent** @gjkim42

## Area: New CRDs & API Extensions

## Summary

When Tasks complete and TTL cleanup deletes them, all per-task data is permanently lost: the prompt used, outputs produced, results (branch, PR, cost), duration, work item context, and failure messages. Prometheus metrics (`kelos_task_cost_usd_total`, `kelos_task_duration_seconds`) capture **aggregate** counters and histograms but cannot answer per-task questions like "what did the agent produce for issue #42?" or "why did the last task for PR #17 fail?"

This proposal adds a lightweight `TaskRecord` CRD that the controller creates as an immutable completion snapshot just before TTL deletion, preserving queryable in-cluster task history.

## Problem

### 1. TTL cleanup destroys all per-task data

The TTL deletion path (`internal/controller/task_controller.go:148-159`) unconditionally deletes the Task:

```go
if expired, requeueAfter := r.ttlExpired(&task); expired {
    logger.Info("Deleting Task due to TTL expiration", "task", task.Name)
    if err := r.Delete(ctx, &task); err != nil {
        // ...
    }
    return ctrl.Result{}, nil
}
```

Once deleted, the following data is gone forever:
- `status.results` — branch, commit, PR URL, cost-usd, input/output tokens
- `status.outputs` — raw agent output lines (between KELOS_OUTPUTS markers)
- `status.message` — failure reason for failed tasks
- `status.startTime` / `status.completionTime` — execution duration
- `spec.prompt` — what the agent was asked to do
- Labels/annotations — spawner name, source metadata (if #649 is implemented)

### 2. No way to do post-mortem analysis on completed tasks

Common operational questions that are currently unanswerable after TTL cleanup:

- "What was the success rate for spawner X over the past week?" — `kelos_task_completed_total` gives counts but no individual task details
- "Which PR did the agent open for issue #42?" — lost after TTL
- "Why did the last 3 tasks from the cron spawner fail?" — failure messages lost
- "What was the average cost per task for Opus vs Sonnet last month?" — aggregate cost exists in Prometheus, but per-task cost/model pairing is lost
- "Show me all tasks that touched branch X" — impossible to reconstruct

### 3. The TTL dilemma forces a bad trade-off

Without TTL, tasks accumulate indefinitely — cluttering `kubectl get tasks`, consuming etcd storage, and blocking TaskSpawner from re-creating tasks for the same work item (the spawner deduplicates by checking for existing tasks with the same name).

With TTL, you get clean task recycling but lose all history. Teams must choose between operational hygiene and observability. This trade-off shouldn't be necessary.

### 4. Prerequisite for cost budget persistence

Issue #624 (maxCostUSD) proposes cumulative spend limits on TaskSpawner. The proposal acknowledges that TTL cleanup breaks cost tracking: *"When `ttlSecondsAfterFinished` is set, completed Tasks are auto-deleted. This means the spawner loses visibility into historical costs."* The proposed workaround is persisting `totalCostUSD` in spawner status. TaskRecord would provide a cleaner foundation — the spawner (or controller) can sum costs from TaskRecords instead of maintaining a running counter that's fragile to restarts and race conditions.

### 5. Prometheus metrics are necessary but insufficient

The existing metrics provide excellent aggregate observability:
- `kelos_task_cost_usd_total` — cumulative cost by namespace/type/spawner/model
- `kelos_task_duration_seconds` — duration histogram
- `kelos_task_completed_total` — completion count by phase

But metrics are designed for dashboards and alerting, not for individual record lookup. You can't query "show me the details of the task that cost $12" from Prometheus. Metrics and records serve complementary purposes.

## Proposed Design

### New CRD: TaskRecord

```go
// TaskRecordSpec captures a snapshot of a completed Task.
type TaskRecordSpec struct {
    // TaskName is the original Task resource name.
    TaskName string `json:"taskName"`

    // SpawnerName is the TaskSpawner that created this task (empty for ad-hoc tasks).
    // +optional
    SpawnerName string `json:"spawnerName,omitempty"`

    // AgentType is the agent type used (claude-code, codex, etc.).
    AgentType string `json:"agentType"`

    // Model is the model used.
    // +optional
    Model string `json:"model,omitempty"`

    // Phase is the terminal phase (Succeeded or Failed).
    Phase TaskPhase `json:"phase"`

    // Message is the status message at completion.
    // +optional
    Message string `json:"message,omitempty"`

    // StartTime is when the Task started running.
    // +optional
    StartTime *metav1.Time `json:"startTime,omitempty"`

    // CompletionTime is when the Task completed.
    // +optional
    CompletionTime *metav1.Time `json:"completionTime,omitempty"`

    // Outputs contains URLs and references produced by the agent.
    // +optional
    Outputs []string `json:"outputs,omitempty"`

    // Results contains structured key-value outputs (branch, commit, pr, cost-usd, etc.).
    // +optional
    Results map[string]string `json:"results,omitempty"`

    // SourceLabels captures the task's labels at completion time for queryability.
    // +optional
    SourceLabels map[string]string `json:"sourceLabels,omitempty"`
}
```

Note: The `spec.prompt` is intentionally **excluded** from TaskRecordSpec to keep records small. Prompts can be large (multi-KB), and the primary use case for TaskRecord is operational querying, not prompt replay. If prompt archival is needed, onCompletion hooks (#749) can push the full Task to an external system.

### CRD definition

```yaml
# +genclient
# +genclient:noStatus
# +kubebuilder:object:root=true
# +kubebuilder:printcolumn:name="Task",type=string,JSONPath=`.spec.taskName`
# +kubebuilder:printcolumn:name="Spawner",type=string,JSONPath=`.spec.spawnerName`
# +kubebuilder:printcolumn:name="Type",type=string,JSONPath=`.spec.agentType`
# +kubebuilder:printcolumn:name="Phase",type=string,JSONPath=`.spec.phase`
# +kubebuilder:printcolumn:name="Duration",type=string,JSONPath=`.metadata.annotations.kelos\.dev/duration`
# +kubebuilder:printcolumn:name="Cost",type=string,JSONPath=`.spec.results.cost-usd`,priority=1
# +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`
```

### Controller integration

The change is minimal — add record creation to the TTL deletion path in `task_controller.go`:

```go
if expired, requeueAfter := r.ttlExpired(&task); expired {
    // Create a TaskRecord before deleting the Task
    if err := r.createTaskRecord(ctx, &task); err != nil {
        logger.Error(err, "Failed to create TaskRecord, proceeding with deletion")
        // Non-fatal: don't block TTL cleanup if record creation fails
    }
    logger.Info("Deleting Task due to TTL expiration", "task", task.Name)
    if err := r.Delete(ctx, &task); err != nil {
        // ...
    }
}
```

```go
func (r *TaskReconciler) createTaskRecord(ctx context.Context, task *kelosv1alpha1.Task) error {
    record := &kelosv1alpha1.TaskRecord{
        ObjectMeta: metav1.ObjectMeta{
            Name:      task.Name,
            Namespace: task.Namespace,
            Labels: map[string]string{
                "kelos.dev/taskspawner": task.Labels["kelos.dev/taskspawner"],
                "kelos.dev/agent-type": task.Spec.Type,
                "kelos.dev/phase":      string(task.Status.Phase),
            },
        },
        Spec: kelosv1alpha1.TaskRecordSpec{
            TaskName:       task.Name,
            SpawnerName:    task.Labels["kelos.dev/taskspawner"],
            AgentType:      task.Spec.Type,
            Model:          task.Spec.Model,
            Phase:          task.Status.Phase,
            Message:        task.Status.Message,
            StartTime:      task.Status.StartTime,
            CompletionTime: task.Status.CompletionTime,
            Outputs:        task.Status.Outputs,
            Results:        task.Status.Results,
            SourceLabels:   task.Labels,
        },
    }
    return r.Create(ctx, record)
}
```

### Retention policy

TaskRecords need their own lifecycle management to avoid unbounded growth. Two options:

**Option A (recommended): TTL on TaskRecord via label + external cleanup**
Set a `kelos.dev/expires-at` annotation on the TaskRecord. A separate CronJob or the controller itself periodically cleans up expired records. Default retention: 30 days.

**Option B: Controller-managed retention**
Add `spec.taskRecordRetention` to TaskSpawner:
```yaml
spec:
  taskRecordRetention:
    maxAge: 720h    # 30 days
    maxCount: 1000  # Keep at most 1000 records per spawner
```

### Opt-in behavior

To avoid surprise storage growth, TaskRecord creation should be **opt-in** initially:

```go
type TaskSpawnerSpec struct {
    // ...existing fields...

    // RecordCompletedTasks creates a TaskRecord snapshot for each completed
    // Task before TTL cleanup. Defaults to false.
    // +optional
    // +kubebuilder:default=false
    RecordCompletedTasks *bool `json:"recordCompletedTasks,omitempty"`
}
```

For ad-hoc tasks (created via `kelos run`), a corresponding field on TaskSpec:
```go
type TaskSpec struct {
    // ...existing fields...

    // RecordOnCompletion creates a TaskRecord snapshot when this Task
    // reaches a terminal phase. Defaults to false.
    // +optional
    RecordOnCompletion *bool `json:"recordOnCompletion,omitempty"`
}
```

## Example usage

### Querying task history

```bash
# All completed tasks from a spawner
kubectl get taskrecords -l kelos.dev/taskspawner=bug-fixer

# Failed tasks only
kubectl get taskrecords -l kelos.dev/phase=Failed

# Tasks by agent type
kubectl get taskrecords -l kelos.dev/agent-type=claude-code

# With cost details (priority=1 column)
kubectl get taskrecords -l kelos.dev/taskspawner=bug-fixer -o wide
```

Output:
```
NAME              TASK              SPAWNER      TYPE          PHASE      DURATION   AGE
bug-fixer-42      bug-fixer-42      bug-fixer    claude-code   Succeeded  4m32s      2d
bug-fixer-45      bug-fixer-45      bug-fixer    claude-code   Failed     1m15s      1d
bug-fixer-51      bug-fixer-51      bug-fixer    claude-code   Succeeded  6m08s      3h
```

### Cost analysis across completed tasks

```bash
# Sum costs for a spawner (using kubectl + awk)
kubectl get taskrecords -l kelos.dev/taskspawner=bug-fixer \
  -o jsonpath='{range .items[*]}{.spec.results.cost-usd}{"\n"}{end}' | \
  awk '{sum += $1} END {printf "Total: $%.2f\n", sum}'
```

### Integration with kelos CLI

`kelos get taskrecords` (or `kelos history`) could provide a user-friendly view:

```
$ kelos history --spawner bug-fixer --since 7d
TASK             PHASE      MODEL    COST     DURATION   PR                                      AGE
bug-fixer-42     Succeeded  opus     $2.31    4m32s      github.com/org/repo/pull/87              2d
bug-fixer-45     Failed     opus     $0.85    1m15s      —                                       1d
bug-fixer-51     Succeeded  sonnet   $0.42    6m08s      github.com/org/repo/pull/91              3h
                                     ------
                            Total:   $3.58    3 tasks (2 succeeded, 1 failed)
```

## Interaction with existing and proposed features

| Feature | Interaction |
|---------|-------------|
| TTL cleanup | TaskRecord is created **before** TTL deletion — complementary by design |
| `maxCostUSD` (#624) | Spawner can sum `cost-usd` from TaskRecords instead of maintaining fragile running counter in status |
| `onCompletion` hooks (#749) | Complementary — hooks push to external systems, TaskRecord persists in-cluster. Both can coexist |
| Work item metadata (#649) | TaskRecord captures task labels including source metadata labels, making history queryable by source |
| Prometheus metrics | Complementary — metrics for dashboards/alerting, TaskRecords for per-task drill-down |
| `kelos get task --detail` | TaskRecord preserves detail data that would otherwise be lost after TTL |
| v1alpha2 (#704) | TaskRecord is a new CRD, independent of the TaskSpawner v1alpha2 cleanup |

## Implementation scope

1. **CRD definition**: Add `TaskRecord` type to `api/v1alpha1/` (~60 lines)
2. **Controller change**: Add `createTaskRecord` to TTL deletion path (~30 lines)
3. **Opt-in field**: Add `recordCompletedTasks` to TaskSpawnerSpec (~5 lines)
4. **CLI**: Add `kelos get taskrecords` or `kelos history` command
5. **Retention**: Add cleanup logic for expired TaskRecords
6. Run `make update` for generated code

Total estimated production code: ~150 lines + generated code + tests.

## Why a new CRD (not an annotation, ConfigMap, or external store)

- **Queryable via kubectl label selectors** — the primary use case is operational querying
- **Kubernetes-native RBAC** — teams can grant read-only access to TaskRecords without Task access
- **Controller-runtime caching** — efficient list/watch without custom client code
- **Consistent lifecycle** — namespace-scoped, owner references, garbage collection all work naturally
- **etcd-friendly** — TaskRecords are small (~500 bytes each) and immutable (no update churn)

A ConfigMap-based approach would work but loses label-based querying and type safety. An external store (database, S3) requires additional infrastructure. TaskRecord as a CRD keeps everything Kubernetes-native.

## Backward compatibility

- New CRD — no changes to existing resources
- Opt-in via `recordCompletedTasks` — zero impact on existing deployments
- Tasks without TTL are unaffected (they persist indefinitely anyway)
- No changes to spawner dedup logic (spawner checks for Tasks, not TaskRecords)

/kind feature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Add TaskRecord CRD for persistent task completion snapshots that survive TTL cleanup #771

Area: New CRDs & API Extensions

Summary

Problem

1. TTL cleanup destroys all per-task data

2. No way to do post-mortem analysis on completed tasks

3. The TTL dilemma forces a bad trade-off

4. Prerequisite for cost budget persistence

5. Prometheus metrics are necessary but insufficient

Proposed Design

New CRD: TaskRecord

CRD definition

Controller integration

Retention policy

Opt-in behavior

Example usage

Querying task history

Cost analysis across completed tasks

Integration with kelos CLI

Interaction with existing and proposed features

Implementation scope

Why a new CRD (not an annotation, ConfigMap, or external store)

Backward compatibility

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature	Interaction
TTL cleanup	TaskRecord is created before TTL deletion — complementary by design
`maxCostUSD` (#624)	Spawner can sum `cost-usd` from TaskRecords instead of maintaining fragile running counter in status
`onCompletion` hooks (#749)	Complementary — hooks push to external systems, TaskRecord persists in-cluster. Both can coexist
Work item metadata (#649)	TaskRecord captures task labels including source metadata labels, making history queryable by source
Prometheus metrics	Complementary — metrics for dashboards/alerting, TaskRecords for per-task drill-down
`kelos get task --detail`	TaskRecord preserves detail data that would otherwise be lost after TTL
v1alpha2 (#704)	TaskRecord is a new CRD, independent of the TaskSpawner v1alpha2 cleanup

API: Add TaskRecord CRD for persistent task completion snapshots that survive TTL cleanup #771

Description

Area: New CRDs & API Extensions

Summary

Problem

1. TTL cleanup destroys all per-task data

2. No way to do post-mortem analysis on completed tasks

3. The TTL dilemma forces a bad trade-off

4. Prerequisite for cost budget persistence

5. Prometheus metrics are necessary but insufficient

Proposed Design

New CRD: TaskRecord

CRD definition

Controller integration

Retention policy

Opt-in behavior

Example usage

Querying task history

Cost analysis across completed tasks

Integration with kelos CLI

Interaction with existing and proposed features

Implementation scope

Why a new CRD (not an annotation, ConfigMap, or external store)

Backward compatibility

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions