-
Notifications
You must be signed in to change notification settings - Fork 12
Description
🤖 Kelos Strategist Agent @gjkim42
Area: New CRDs & API Extensions
Summary
When Tasks complete and TTL cleanup deletes them, all per-task data is permanently lost: the prompt used, outputs produced, results (branch, PR, cost), duration, work item context, and failure messages. Prometheus metrics (kelos_task_cost_usd_total, kelos_task_duration_seconds) capture aggregate counters and histograms but cannot answer per-task questions like "what did the agent produce for issue #42?" or "why did the last task for PR #17 fail?"
This proposal adds a lightweight TaskRecord CRD that the controller creates as an immutable completion snapshot just before TTL deletion, preserving queryable in-cluster task history.
Problem
1. TTL cleanup destroys all per-task data
The TTL deletion path (internal/controller/task_controller.go:148-159) unconditionally deletes the Task:
if expired, requeueAfter := r.ttlExpired(&task); expired {
logger.Info("Deleting Task due to TTL expiration", "task", task.Name)
if err := r.Delete(ctx, &task); err != nil {
// ...
}
return ctrl.Result{}, nil
}Once deleted, the following data is gone forever:
status.results— branch, commit, PR URL, cost-usd, input/output tokensstatus.outputs— raw agent output lines (between KELOS_OUTPUTS markers)status.message— failure reason for failed tasksstatus.startTime/status.completionTime— execution durationspec.prompt— what the agent was asked to do- Labels/annotations — spawner name, source metadata (if API: Propagate work item metadata as labels and annotations on spawned Tasks #649 is implemented)
2. No way to do post-mortem analysis on completed tasks
Common operational questions that are currently unanswerable after TTL cleanup:
- "What was the success rate for spawner X over the past week?" —
kelos_task_completed_totalgives counts but no individual task details - "Which PR did the agent open for issue Install build-essential in claude-code container #42?" — lost after TTL
- "Why did the last 3 tasks from the cron spawner fail?" — failure messages lost
- "What was the average cost per task for Opus vs Sonnet last month?" — aggregate cost exists in Prometheus, but per-task cost/model pairing is lost
- "Show me all tasks that touched branch X" — impossible to reconstruct
3. The TTL dilemma forces a bad trade-off
Without TTL, tasks accumulate indefinitely — cluttering kubectl get tasks, consuming etcd storage, and blocking TaskSpawner from re-creating tasks for the same work item (the spawner deduplicates by checking for existing tasks with the same name).
With TTL, you get clean task recycling but lose all history. Teams must choose between operational hygiene and observability. This trade-off shouldn't be necessary.
4. Prerequisite for cost budget persistence
Issue #624 (maxCostUSD) proposes cumulative spend limits on TaskSpawner. The proposal acknowledges that TTL cleanup breaks cost tracking: "When ttlSecondsAfterFinished is set, completed Tasks are auto-deleted. This means the spawner loses visibility into historical costs." The proposed workaround is persisting totalCostUSD in spawner status. TaskRecord would provide a cleaner foundation — the spawner (or controller) can sum costs from TaskRecords instead of maintaining a running counter that's fragile to restarts and race conditions.
5. Prometheus metrics are necessary but insufficient
The existing metrics provide excellent aggregate observability:
kelos_task_cost_usd_total— cumulative cost by namespace/type/spawner/modelkelos_task_duration_seconds— duration histogramkelos_task_completed_total— completion count by phase
But metrics are designed for dashboards and alerting, not for individual record lookup. You can't query "show me the details of the task that cost $12" from Prometheus. Metrics and records serve complementary purposes.
Proposed Design
New CRD: TaskRecord
// TaskRecordSpec captures a snapshot of a completed Task.
type TaskRecordSpec struct {
// TaskName is the original Task resource name.
TaskName string `json:"taskName"`
// SpawnerName is the TaskSpawner that created this task (empty for ad-hoc tasks).
// +optional
SpawnerName string `json:"spawnerName,omitempty"`
// AgentType is the agent type used (claude-code, codex, etc.).
AgentType string `json:"agentType"`
// Model is the model used.
// +optional
Model string `json:"model,omitempty"`
// Phase is the terminal phase (Succeeded or Failed).
Phase TaskPhase `json:"phase"`
// Message is the status message at completion.
// +optional
Message string `json:"message,omitempty"`
// StartTime is when the Task started running.
// +optional
StartTime *metav1.Time `json:"startTime,omitempty"`
// CompletionTime is when the Task completed.
// +optional
CompletionTime *metav1.Time `json:"completionTime,omitempty"`
// Outputs contains URLs and references produced by the agent.
// +optional
Outputs []string `json:"outputs,omitempty"`
// Results contains structured key-value outputs (branch, commit, pr, cost-usd, etc.).
// +optional
Results map[string]string `json:"results,omitempty"`
// SourceLabels captures the task's labels at completion time for queryability.
// +optional
SourceLabels map[string]string `json:"sourceLabels,omitempty"`
}Note: The spec.prompt is intentionally excluded from TaskRecordSpec to keep records small. Prompts can be large (multi-KB), and the primary use case for TaskRecord is operational querying, not prompt replay. If prompt archival is needed, onCompletion hooks (#749) can push the full Task to an external system.
CRD definition
# +genclient
# +genclient:noStatus
# +kubebuilder:object:root=true
# +kubebuilder:printcolumn:name="Task",type=string,JSONPath=`.spec.taskName`
# +kubebuilder:printcolumn:name="Spawner",type=string,JSONPath=`.spec.spawnerName`
# +kubebuilder:printcolumn:name="Type",type=string,JSONPath=`.spec.agentType`
# +kubebuilder:printcolumn:name="Phase",type=string,JSONPath=`.spec.phase`
# +kubebuilder:printcolumn:name="Duration",type=string,JSONPath=`.metadata.annotations.kelos\.dev/duration`
# +kubebuilder:printcolumn:name="Cost",type=string,JSONPath=`.spec.results.cost-usd`,priority=1
# +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`Controller integration
The change is minimal — add record creation to the TTL deletion path in task_controller.go:
if expired, requeueAfter := r.ttlExpired(&task); expired {
// Create a TaskRecord before deleting the Task
if err := r.createTaskRecord(ctx, &task); err != nil {
logger.Error(err, "Failed to create TaskRecord, proceeding with deletion")
// Non-fatal: don't block TTL cleanup if record creation fails
}
logger.Info("Deleting Task due to TTL expiration", "task", task.Name)
if err := r.Delete(ctx, &task); err != nil {
// ...
}
}func (r *TaskReconciler) createTaskRecord(ctx context.Context, task *kelosv1alpha1.Task) error {
record := &kelosv1alpha1.TaskRecord{
ObjectMeta: metav1.ObjectMeta{
Name: task.Name,
Namespace: task.Namespace,
Labels: map[string]string{
"kelos.dev/taskspawner": task.Labels["kelos.dev/taskspawner"],
"kelos.dev/agent-type": task.Spec.Type,
"kelos.dev/phase": string(task.Status.Phase),
},
},
Spec: kelosv1alpha1.TaskRecordSpec{
TaskName: task.Name,
SpawnerName: task.Labels["kelos.dev/taskspawner"],
AgentType: task.Spec.Type,
Model: task.Spec.Model,
Phase: task.Status.Phase,
Message: task.Status.Message,
StartTime: task.Status.StartTime,
CompletionTime: task.Status.CompletionTime,
Outputs: task.Status.Outputs,
Results: task.Status.Results,
SourceLabels: task.Labels,
},
}
return r.Create(ctx, record)
}Retention policy
TaskRecords need their own lifecycle management to avoid unbounded growth. Two options:
Option A (recommended): TTL on TaskRecord via label + external cleanup
Set a kelos.dev/expires-at annotation on the TaskRecord. A separate CronJob or the controller itself periodically cleans up expired records. Default retention: 30 days.
Option B: Controller-managed retention
Add spec.taskRecordRetention to TaskSpawner:
spec:
taskRecordRetention:
maxAge: 720h # 30 days
maxCount: 1000 # Keep at most 1000 records per spawnerOpt-in behavior
To avoid surprise storage growth, TaskRecord creation should be opt-in initially:
type TaskSpawnerSpec struct {
// ...existing fields...
// RecordCompletedTasks creates a TaskRecord snapshot for each completed
// Task before TTL cleanup. Defaults to false.
// +optional
// +kubebuilder:default=false
RecordCompletedTasks *bool `json:"recordCompletedTasks,omitempty"`
}For ad-hoc tasks (created via kelos run), a corresponding field on TaskSpec:
type TaskSpec struct {
// ...existing fields...
// RecordOnCompletion creates a TaskRecord snapshot when this Task
// reaches a terminal phase. Defaults to false.
// +optional
RecordOnCompletion *bool `json:"recordOnCompletion,omitempty"`
}Example usage
Querying task history
# All completed tasks from a spawner
kubectl get taskrecords -l kelos.dev/taskspawner=bug-fixer
# Failed tasks only
kubectl get taskrecords -l kelos.dev/phase=Failed
# Tasks by agent type
kubectl get taskrecords -l kelos.dev/agent-type=claude-code
# With cost details (priority=1 column)
kubectl get taskrecords -l kelos.dev/taskspawner=bug-fixer -o wideOutput:
NAME TASK SPAWNER TYPE PHASE DURATION AGE
bug-fixer-42 bug-fixer-42 bug-fixer claude-code Succeeded 4m32s 2d
bug-fixer-45 bug-fixer-45 bug-fixer claude-code Failed 1m15s 1d
bug-fixer-51 bug-fixer-51 bug-fixer claude-code Succeeded 6m08s 3h
Cost analysis across completed tasks
# Sum costs for a spawner (using kubectl + awk)
kubectl get taskrecords -l kelos.dev/taskspawner=bug-fixer \
-o jsonpath='{range .items[*]}{.spec.results.cost-usd}{"\n"}{end}' | \
awk '{sum += $1} END {printf "Total: $%.2f\n", sum}'Integration with kelos CLI
kelos get taskrecords (or kelos history) could provide a user-friendly view:
$ kelos history --spawner bug-fixer --since 7d
TASK PHASE MODEL COST DURATION PR AGE
bug-fixer-42 Succeeded opus $2.31 4m32s github.com/org/repo/pull/87 2d
bug-fixer-45 Failed opus $0.85 1m15s — 1d
bug-fixer-51 Succeeded sonnet $0.42 6m08s github.com/org/repo/pull/91 3h
------
Total: $3.58 3 tasks (2 succeeded, 1 failed)
Interaction with existing and proposed features
| Feature | Interaction |
|---|---|
| TTL cleanup | TaskRecord is created before TTL deletion — complementary by design |
maxCostUSD (#624) |
Spawner can sum cost-usd from TaskRecords instead of maintaining fragile running counter in status |
onCompletion hooks (#749) |
Complementary — hooks push to external systems, TaskRecord persists in-cluster. Both can coexist |
| Work item metadata (#649) | TaskRecord captures task labels including source metadata labels, making history queryable by source |
| Prometheus metrics | Complementary — metrics for dashboards/alerting, TaskRecords for per-task drill-down |
kelos get task --detail |
TaskRecord preserves detail data that would otherwise be lost after TTL |
| v1alpha2 (#704) | TaskRecord is a new CRD, independent of the TaskSpawner v1alpha2 cleanup |
Implementation scope
- CRD definition: Add
TaskRecordtype toapi/v1alpha1/(~60 lines) - Controller change: Add
createTaskRecordto TTL deletion path (~30 lines) - Opt-in field: Add
recordCompletedTasksto TaskSpawnerSpec (~5 lines) - CLI: Add
kelos get taskrecordsorkelos historycommand - Retention: Add cleanup logic for expired TaskRecords
- Run
make updatefor generated code
Total estimated production code: ~150 lines + generated code + tests.
Why a new CRD (not an annotation, ConfigMap, or external store)
- Queryable via kubectl label selectors — the primary use case is operational querying
- Kubernetes-native RBAC — teams can grant read-only access to TaskRecords without Task access
- Controller-runtime caching — efficient list/watch without custom client code
- Consistent lifecycle — namespace-scoped, owner references, garbage collection all work naturally
- etcd-friendly — TaskRecords are small (~500 bytes each) and immutable (no update churn)
A ConfigMap-based approach would work but loses label-based querying and type safety. An external store (database, S3) requires additional infrastructure. TaskRecord as a CRD keeps everything Kubernetes-native.
Backward compatibility
- New CRD — no changes to existing resources
- Opt-in via
recordCompletedTasks— zero impact on existing deployments - Tasks without TTL are unaffected (they persist indefinitely anyway)
- No changes to spawner dedup logic (spawner checks for Tasks, not TaskRecords)
/kind feature