Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docker/docker-compose.tif-phase1.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
services:
dakera:
image: ${DAKERA_IMAGE:-ghcr.io/dakera-ai/dakera:0.11.81}
image: ${DAKERA_IMAGE:-ghcr.io/dakera-ai/dakera:0.11.90}
ports:
- "127.0.0.1:3200:3000"
- "127.0.0.1:51051:50051"
Expand Down
102 changes: 102 additions & 0 deletions examples/tif-provenance/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# T-I-F Feedback Provenance Phase 2

This example validates Phase 2 of the Dakera T-I-F decision provenance RFC:

https://github.com/Dakera-AI/dakera-deploy/issues/161

Phase 1 proved that `metadata.reliability` survives store and recall and can
change agent-side decisions. Phase 2 tests the next maintainer-requested
question: can T-I-F scores be derived from real agent interaction signals and
used in a session-scoped decision trace?

## What This Tests

The example uses Dakera's public REST API only:

- `POST /v1/memory/store`
- `POST /v1/memory/recall`
- `POST /v1/memories/{memory_id}/feedback`
- `GET /v1/memories/{memory_id}/feedback`
- `POST /v1/sessions/start`
- `GET /v1/sessions/{session_id}/memories`
- `POST /v1/memories/{memory_id}/links`

The validation remains agent-side. Dakera stores memories, feedback, sessions,
and links. The local script computes T-I-F from feedback and stores a decision
trace under `metadata.decision_provenance`.

Dakera `v0.11.90` requires `agent_id` when submitting feedback, reading
feedback history, and creating memory links. The validator keeps those
requirements explicit instead of hiding them behind an SDK.

## Feedback-Derived T-I-F Rules

```text
upvote: t + 0.10, i - 0.03, f - 0.05
downvote: t - 0.10, i + 0.05, f + 0.15
flag: t - 0.05, i + 0.20, f + 0.10
```

Scores are clamped to `[0.0, 1.0]`.

Decision priority:

```text
f >= 0.50 -> surface_contradiction
i >= 0.50 -> ask_clarification
t >= 0.70 and i <= 0.35 and f <= 0.35 -> reuse_confidently
otherwise -> reuse_with_caveat
```

These thresholds are validation rules only. They are not proposed as Dakera
engine behavior.

## Scenarios

The fixture covers three developer-recognizable workflows:

| Scenario | Purpose |
|---|---|
| `coding-assistant` | feedback corrects an obsolete endpoint decision |
| `research-agent` | weak-source feedback raises indeterminacy |
| `customer-support` | outdated policy is surfaced as contradiction evidence |

Each scenario records:

- baseline importance-only decision;
- feedback-derived T-I-F decision;
- decision trace memory;
- session ID;
- linked evidence memory IDs;
- associated recall proof.

## Start Dakera

The shared T-I-F compose file defaults to Dakera `v0.11.90`, binds to
`127.0.0.1`, and disables auth only for local validation. Do not run it on a
shared or internet-facing host.

```bash
docker compose -f docker/docker-compose.tif-phase1.yml up -d
```

Stop:

```bash
docker compose -f docker/docker-compose.tif-phase1.yml down
```

## Run Self-Test

```bash
python examples/tif-provenance/validate_tif_provenance.py --self-test
```

## Run Runtime Validation

```bash
python examples/tif-provenance/validate_tif_provenance.py --api http://localhost:3200 --request-timeout 240
```

The script fails if feedback history, session trace storage, or associated
recall proof is missing.
152 changes: 152 additions & 0 deletions examples/tif-provenance/VALIDATION_RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# Phase 2 Validation Results

Date: 2026-06-12 17:07:47 -04:00

Status: passed local runtime validation.

## Target Runtime

```text
Dakera image: ghcr.io/dakera-ai/dakera:0.11.90
REST: http://127.0.0.1:3200
gRPC: 127.0.0.1:51051
Storage: in-memory
Auth: disabled for local validation only
```

The validation compose binds ports to localhost only.

## Commands

```powershell
python -m py_compile examples\tif-provenance\validate_tif_provenance.py
python examples\tif-provenance\validate_tif_provenance.py --self-test
docker compose -f docker\docker-compose.tif-phase1.yml down
docker compose -f docker\docker-compose.tif-phase1.yml up -d
python examples\tif-provenance\validate_tif_provenance.py --api http://localhost:3200 --request-timeout 240
docker compose -f docker\docker-compose.tif-phase1.yml down
```

## Acceptance Criteria

- all three scenarios pass;
- feedback endpoints accept `upvote`, `downvote`, and `flag`;
- feedback history is readable;
- feedback-derived T-I-F changes at least one decision per scenario;
- decision trace memory is stored with `metadata.decision_provenance`;
- session memories include the trace and evidence memories;
- associated recall returns linked evidence or contradiction memories;
- no engine code is modified;
- no first-class recall filters are added.

## Result Summary

All three scenarios passed against Dakera `0.11.90`.

Runtime health reported:

```json
{
"ready": true,
"version": "0.11.90",
"checks": {
"embedding_engine": "ok",
"storage": "ok",
"tiered_engine": "disabled"
}
}
```

Scenario outcomes:

| Scenario | Baseline action | Feedback-derived T-I-F action | Decision changed | Session proof | Associated recall proof |
| --- | --- | --- | --- | --- | --- |
| coding-assistant | `reuse_top_memory` | `surface_contradiction` | yes | yes | yes |
| research-agent | `reuse_top_memory` | `ask_clarification` | yes | yes | yes |
| customer-support | `reuse_top_memory` | `surface_contradiction` | yes | yes | yes |

The runtime accepted feedback signals `upvote`, `downvote`, and `flag`; feedback history was readable for every seeded memory; each scenario stored a decision trace with `metadata.decision_provenance`; session memory listing included the trace and evidence memories; associated recall returned linked evidence memories when recalling the decision trace with `include_associated=true` and `associated_memories_depth=1`.

Runtime contract notes observed on Dakera `0.11.90`:

- `POST /v1/sessions/start` returns the session id as `session.id`.
- `POST /v1/memories/{memory_id}/feedback` requires `agent_id`.
- `GET /v1/memories/{memory_id}/feedback` requires `agent_id` as a query parameter.
- `POST /v1/memories/{memory_id}/links` requires `agent_id`.

No engine code was modified. No first-class recall filters were added.

## Review Correction Rerun

Date: 2026-06-12 17:20:58 -04:00

Corrections after fork review:

- healthcheck now requires `ready: true` before runtime validation proceeds;
- unsupported feedback signals now produce a clear validation error instead of a raw `KeyError`;
- Phase 1 recall normalization was reviewed and already handles list, dict, and nested `memory` response shapes.

Rerun commands:

```powershell
python -m py_compile examples\tif-provenance\validate_tif_provenance.py examples\tif-reliability\validate_tif_reliability.py
python examples\tif-provenance\validate_tif_provenance.py --self-test
python examples\tif-reliability\validate_tif_reliability.py --self-test
docker compose -f docker\docker-compose.tif-phase1.yml down
docker compose -f docker\docker-compose.tif-phase1.yml up -d
python examples\tif-provenance\validate_tif_provenance.py --api http://localhost:3200 --request-timeout 240
docker compose -f docker\docker-compose.tif-phase1.yml down
```

Result: passed.

## Codex Review Correction Rerun

Date: 2026-06-12 18:02:19 -04:00

Additional Codex review findings corrected:

- runtime decisions now use the normalized `/v1/memory/recall` response for each scenario query before choosing the baseline and feedback-aware memory;
- each scenario records `scenario_recall_proof` and the recalled fixture/runtime memory IDs;
- associated recall proof now verifies that every linked evidence memory appears in the full associated recall response and reports `associated_recall_missing_ids`;
- runtime validation was rerun with PowerShell preserving the validator exit code before Docker cleanup.

Rerun commands:

```powershell
python -m py_compile examples\tif-provenance\validate_tif_provenance.py examples\tif-reliability\validate_tif_reliability.py
python examples\tif-provenance\validate_tif_provenance.py --self-test
python examples\tif-reliability\validate_tif_reliability.py --self-test
docker compose -f docker\docker-compose.tif-phase1.yml down
docker compose -f docker\docker-compose.tif-phase1.yml up -d
python examples\tif-provenance\validate_tif_provenance.py --api http://localhost:3200 --request-timeout 240
$validationExit = $LASTEXITCODE
docker compose -f docker\docker-compose.tif-phase1.yml down
exit $validationExit
```

Result: passed. All three scenarios returned `scenario_recall_proof: true`, `associated_recall_missing_ids: []`, `associated_recall_proof: true`, and `passed: true`.

## Second Review Correction Rerun

Date: 2026-06-12 17:40:38 -04:00

Additional Qodo findings corrected:

- runtime `changed_decision` now mirrors the self-test logic and treats same-memory `reuse_confidently` as unchanged reuse;
- runtime memory metadata is deep-copied before adding derived reliability, and malformed or missing `metadata.reliability` now fails with a clear validation error;
- associated recall keeps a single read-only retry to tolerate cold reranker startup without retrying mutating endpoints.

Rerun commands:

```powershell
python -m py_compile examples\tif-provenance\validate_tif_provenance.py examples\tif-reliability\validate_tif_reliability.py
python examples\tif-provenance\validate_tif_provenance.py --self-test
python examples\tif-reliability\validate_tif_reliability.py --self-test
docker compose -f docker\docker-compose.tif-phase1.yml down
docker compose -f docker\docker-compose.tif-phase1.yml up -d
python examples\tif-provenance\validate_tif_provenance.py --api http://localhost:3200 --request-timeout 240
docker compose -f docker\docker-compose.tif-phase1.yml down
```

Result: passed.
128 changes: 128 additions & 0 deletions examples/tif-provenance/phase2_scenarios.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
{
"agent_id": "dakera-tif-phase2",
"scenarios": [
{
"id": "coding-assistant",
"title": "Coding assistant review correction",
"query": "Which Dakera REST endpoint should the coding assistant use for storing memory with reliability metadata?",
"expected_action": "surface_contradiction",
"expected_changed_decision": true,
"expected_direct_memory": "coding-obsolete-endpoint",
"expected_safe_memory": "coding-current-endpoint",
"memories": [
{
"id": "coding-current-endpoint",
"content": "Dakera memory store examples should use POST /v1/memory/store for the current public REST API.",
"importance": 0.84,
"feedback": ["upvote"],
"metadata": {
"reliability": {
"t": 0.66,
"i": 0.14,
"f": 0.10,
"basis": "Phase 1 runtime validation and maintainer review",
"source": "phase2_seed"
}
}
},
{
"id": "coding-obsolete-endpoint",
"content": "Dakera examples should use POST /v1/memories when storing agent memories.",
"importance": 0.93,
"feedback": ["downvote", "downvote"],
"metadata": {
"reliability": {
"t": 0.38,
"i": 0.20,
"f": 0.34,
"basis": "obsolete quickstart assumption superseded by current API behavior",
"source": "phase2_seed"
}
}
}
]
},
{
"id": "research-agent",
"title": "Research agent source conflict",
"query": "Should the research agent cite an unsupported secondary note as confirmed evidence?",
"expected_action": "ask_clarification",
"expected_changed_decision": true,
"expected_direct_memory": "research-uncertain-source",
"expected_safe_memory": "research-source-backed",
"memories": [
{
"id": "research-source-backed",
"content": "A research agent should prefer source-backed claims and cite the primary evidence when summarizing technical decisions.",
"importance": 0.80,
"feedback": ["upvote"],
"metadata": {
"reliability": {
"t": 0.68,
"i": 0.16,
"f": 0.08,
"basis": "primary-source research discipline",
"source": "phase2_seed"
}
}
},
{
"id": "research-uncertain-source",
"content": "A research agent can treat an uncited secondary note as confirmed evidence when it sounds plausible.",
"importance": 0.92,
"feedback": ["flag", "flag"],
"metadata": {
"reliability": {
"t": 0.44,
"i": 0.18,
"f": 0.18,
"basis": "weak-source pattern flagged during review",
"source": "phase2_seed"
}
}
}
]
},
{
"id": "customer-support",
"title": "Customer support outdated policy",
"query": "Which customer support policy should the agent reuse when an old process conflicts with the current escalation rule?",
"expected_action": "surface_contradiction",
"expected_changed_decision": true,
"expected_direct_memory": "support-outdated-policy",
"expected_safe_memory": "support-current-policy",
"memories": [
{
"id": "support-current-policy",
"content": "Customer support agents should follow the current escalation policy and ask for verification when a prior policy conflicts.",
"importance": 0.83,
"feedback": ["upvote"],
"metadata": {
"reliability": {
"t": 0.67,
"i": 0.12,
"f": 0.07,
"basis": "current support process",
"source": "phase2_seed"
}
}
},
{
"id": "support-outdated-policy",
"content": "Customer support agents should always use the old refund process without checking for newer escalation rules.",
"importance": 0.91,
"feedback": ["downvote", "flag"],
"metadata": {
"reliability": {
"t": 0.42,
"i": 0.18,
"f": 0.30,
"basis": "outdated policy deliberately retained as contradiction evidence",
"source": "phase2_seed"
}
}
}
]
}
]
}
Loading