Skip to content

Latest commit

 

History

History
337 lines (271 loc) · 15.3 KB

File metadata and controls

337 lines (271 loc) · 15.3 KB

Trace — Groundedness & Phantom-Hallucination Scoring

Score LLM outputs for groundedness against retrieval context (RAG lane), phantom hallucinations in agentic code turns (code lane), or aggregate N per-turn outputs into a session scoreboard (rollup lane).

The three lanes share one RunPod pod; the gateway pins the lane server-side by URL so you cannot cross-wire a code payload into the RAG endpoint.

from latence import Latence
client = Latence(api_key="lat_xxx")

# RAG groundedness
r = client.experimental.trace.rag(
    response_text="Paris is the capital of France.",
    raw_context="France's capital city is Paris.",
    runtime_head_features={"v1_score": 0.99},  # optional TRACE v2 head features
)

decision = r.get("runtime_decision")
if decision:
    print(decision["action"], decision.get("head_id"), decision.get("head_score"))
print(r.score, r.band, r.context_coverage_ratio)

# Agentic-code phantom scoring
t = client.experimental.trace.code(
    response_text="def add(a, b): return a + b",
    raw_context="# utils.py\ndef sub(a, b): return a - b",
    response_language_hint="python",
)
print(t.band, t.code_lane, t.session_signals.recommendation)

# Stateless session rollup (CPU-only, sub-ms on the pod)
rollup = client.experimental.trace.rollup(turns=[t, t])
print(rollup.noise_pct, rollup.recommendations)

# Stateful session for long-running agents
session = client.experimental.trace.sessions.create(kind="code")
session.event(
    "Opened src/cache.py and found CacheClient.get_many.",
    event_type="file_read",
)
turn = session.code(
    response_text="I will update CacheClient.get_many in src/cache.py.",
    raw_context="class CacheClient:\n    def get_many(self, keys): ...",
    response_language_hint="python",
)
print(turn["trace_response"]["band"], turn["hot_context"][:200])

Note: Direct service APIs live under client.experimental.*. For production pipelines that orchestrate multiple services, use client.pipeline.

The basic rag(), code(), and rollup() APIs remain first-class for quick tests, notebooks, evals, and custom integrations. Stateful sessions are an additive layer for long-running agents that need continuous memory.

Which lane should I use?

  • rag() — you have a retrieval-augmented answer (response + the context you retrieved) and want to know whether the model actually used the context or hallucinated. Returns a single score, a risk band, and per-context-chunk coverage.
  • code() — your agent is generating or editing code across multiple turns. Detects phantom APIs (functions/classes that don't exist in the provided context) and drift over turns via the opaque next_session_state handoff.
  • rollup() — you already have N per-turn outputs from rag() or code() and want a single session scoreboard (noise_pct, retrieval_waste_pct, reason-code histogram, risk trail). Stateless, sub-ms, safe on every keystroke.
  • sessions.create() — you want TRACE and InfiniMem to act as a continuous stateful object across many agent steps. Each event updates memory immediately and each score call loads and saves session state.

Stateful Sessions

Use sessions for Cursor/Claude Code/Codex/OpenCode-style coding agents, LangGraph/LangChain agents, RAG agents, CrewAI/n8n workflows, or any runtime where context should stay bounded without waiting for an emergency summary.

session = client.experimental.trace.sessions.create(
    kind="rag",
    metadata={"app": "support-agent"},
)

session.event(
    "Retrieved policy doc section 14.2 and customer plan metadata.",
    event_type="retrieval_result",
    metadata={"source": "kb-search"},
)

result = session.rag(
    query_text="Can this customer export audit logs?",
    response_text="Yes, audit log export is available on Enterprise plans.",
    raw_context="Enterprise plans include audit log export in section 14.2.",
    force_original_on_trigger=True,
)

context = session.context()
repair = session.repair(
    missing_terms=["section 14.2"],
    reason="force-check original evidence before final response",
)
rollup = session.rollup()

Session helpers:

Method Description
session.event(...) Append user/tool/file/retrieval/error/decision events and update InfiniMem.
session.code(...) Score a coding-agent turn with server-managed TRACE state and memory.
session.rag(...) Score a RAG/general-agent turn with server-managed memory.
session.context() Fetch bounded hot context and memory counters.
session.source(source_id) Fetch a redacted immutable source-vault record by pointer.
session.repair(...) Build a bounded repair packet from original source history.
session.rollup() Roll up stored scored turns without resending all turns.
session.close() Close the session by retention policy.

Methods

client.experimental.trace.rag()

Score a response for groundedness against retrieval context.

At least one of raw_context, chunk_ids, or support_units must be supplied — the SDK enforces this client-side before the HTTP round-trip.

Parameter Type Default Description
response_text str required Generated response text to score.
query_text str | None None Optional query for query-conditioned diagnostics.
raw_context str | None None Raw context string to segment and encode on demand.
chunk_ids list[str | int] | None None External chunk ids whose stored support vectors to reuse (fast path).
support_units list[SupportUnitInput | dict] | None None Structured premise lane (mutually exclusive with chunk_ids / raw_context).
primary_metric "reverse_context" | "triangular" "reverse_context" (server default) Headline scalar metric.
evidence_limit int 8 (server default) Maximum top evidence links in the sparse response (1–128).
coverage_threshold float 0.5 (server default) Per-unit reverse-context threshold (0.0–1.0).
segmentation_mode "sentence" | "sentence_packed" | "paragraph" "sentence_packed" (server default) How raw_context is segmented.
attribution_mode "closed_book" | "open_domain" "closed_book" (server default) Evidence policy.
include_triangular_diagnostics bool True (server default) Include query-conditioned diagnostics.
heatmap_format "none" | "data" | "html" "data" Heatmap surface. html also returns a self-contained <div>.
verification_samples list[str] | None None Alternate responses for semantic-entropy fusion.
content_type str | None None Structured-source hint (e.g. application/json).
risk_band_stratum str | None None Failure-mode hint for the calibrated risk-band classifier.
model str | None None Optional encoder override.
session_id str | None None Opaque, hashed session identifier (do NOT put user text here).
request_id str | None None Optional tracking ID.
verbose bool False Return full diagnostics instead of the compact summary.
profile "standard" | "quality" | None None Hosted Trace profile. Omitted/standard bills $0.008/request; explicit quality bills $0.016/request.
return_job bool False Return JobSubmittedResponse for async polling.

Pricing: $0.008 per request for omitted/standard; $0.016 per request for explicit profile="quality".

Client-side validation (before HTTP):

  • ValueError("response_text must be a non-empty string.") — raised when response_text is empty or whitespace-only.
  • ValueError("Trace scoring requires at least one of: raw_context, chunk_ids, or support_units.") — raised when no premise lane is supplied.

client.experimental.trace.code()

Superset of rag(), plus three code-lane-only fields:

Parameter Type Default Description
response_language_hint "python" | "py" | "typescript" | "ts" | "tsx" | "javascript" | "js" | "jsx" | "go" | "golang" | "rust" | "rs" | None None Hint for the AST extractor.
emit_chunk_ownership bool False (server default) Return per-unit ownership table (adds a few KB over the wire).
session_state SessionState | dict | None None Echo the previous turn's next_session_state verbatim to chain turns.

Pricing: same hosted Trace profile pricing as rag(): $0.008 per request for omitted/standard; $0.016 per request for explicit profile="quality".

Client-side validation (before HTTP): same as rag() above — response_text must be non-empty and at least one premise lane (raw_context, chunk_ids, or support_units) must be supplied.

client.experimental.trace.rollup()

Aggregate N per-turn outputs into a session scoreboard. Stateless, CPU-only, sub-ms on the pod. Safe to call on every keystroke.

Parameter Type Default Description
turns list[TraceResponse | dict] required Ordered per-turn outputs (response objects from .rag() / .code() work directly).
session_id str | None None Echoed on the response.
heatmap_format "none" | "data" | "html" "data" (server default) Session-level heatmap surface.
request_id str | None None Optional tracking ID.

Pricing: $0.001 flat per request.

Client-side validation (before HTTP):

  • ValueError("turns must be a non-empty list.") — raised when turns is empty.

Responses

TraceRagResponse / TraceCodeResponse

Shared fields (both lanes):

Field Type Description
score float | None Top-level groundedness score (0–1).
primary_metric str | None Metric backing score. Typically "reverse_context" or "triangular", but the pod may echo a derived name (e.g. "groundedness_v2"); do not pin this to an enum on the response side.
band str | None Risk band (green / amber / red / unknown).
structured_score dict | None Per-component score breakdown.
nli_aggregate float | None Aggregate NLI entailment score.
context_coverage_ratio float | None Fraction of the response grounded in context.
context_usage_ratio float | None Fraction of context actually used.
context_unused_ratio float | None Fraction of context left unused.
context_uncertain_ratio float | None Fraction with uncertain grounding.
support_units_usage SupportUnitsUsageSummary | None Aggregate per-state counts.
support_units list[SupportUnitUsage] | None Per-unit verdicts.
latency_ms float | None Pod-side scoring latency.
scoring_mode "rag" | "code" | None Lane echoed back by the pod.
session_id str | None Echoed session id.
version str | None Pod handler version tag.
reason str | None Human-readable reason for the band.
warnings list[str] | None Non-fatal warnings.
file_attribution FileAttribution | None Per-file owner share + reason codes.
heatmap Heatmap | None Structured heatmap payload.
heatmap_html str | None Self-contained <div> (only when heatmap_format="html").

Code-lane-only additions on TraceCodeResponse:

Field Type Description
code_lane CodeLaneDiagnostics | None Composite / AST / NLI diagnostics for the turn.
next_session_state SessionState | None Opaque state to pass into the next code() call.
session_signals SessionSignals | None EMA groundedness, drift, phantom rate, recommendation.

TraceRollupResponse

Field Type Description
turns_processed int | None Number of per-turn outputs aggregated.
noise_pct float | None Fraction of turns flagged as noise.
model_drift_pct float | None Fraction of turns with model drift.
retrieval_waste_pct float | None Fraction of retrieved context left unused.
reason_code_histogram dict[str, int] | None Count of each reason code over the window.
recommendations list[str] | None Session-level recommendations.
risk_band_trail list[str] | None Risk band per turn, chronological.
drift_trend dict[str, Any] | None Drift trajectory summary ({"last", "max", "mean", "min", ...}).
top_dead_files list | None Files consistently marked dead-weight.
heatmap / heatmap_html Session-level heatmap payloads.
session_id str | None Echoed session id.

Multi-turn session chaining (code lane)

Pass the previous response's next_session_state into the next call via session_state=prev.next_session_state. The payload is fully opaque to the client — just round-trip it.

trace = client.experimental.trace

turn1 = trace.code(
    response_text="def add(a, b): return a + b",
    raw_context="# utils.py\ndef sub(a, b): return a - b",
    response_language_hint="python",
)

turn2 = trace.code(
    response_text="def mul(a, b): return a * b",
    raw_context="# utils.py\ndef sub(a, b): return a - b",
    response_language_hint="python",
    session_state=turn1.next_session_state,  # <- chain
)

print(turn2.session_signals.ema_groundedness)
print(turn2.session_signals.recommendation)  # "continue" / "re_anchor" / "fresh_chat"

Live session scoreboard (rollup)

turns = [turn1, turn2, turn3]  # mix TraceRagResponse / TraceCodeResponse / dicts
rollup = client.experimental.trace.rollup(turns=turns, session_id="sess_abc")

print(f"noise:     {rollup.noise_pct:.0%}")
print(f"drift:     {rollup.model_drift_pct:.0%}")
print(f"waste:     {rollup.retrieval_waste_pct:.0%}")
print(f"top reason codes: {rollup.reason_code_histogram}")
print(f"risk trail: {rollup.risk_band_trail}")
print(f"recommend:  {rollup.recommendations}")

Using SupportUnitInput

Structured premises with per-unit provenance (speaker, timestamp, source id) that the scorer propagates into support_units_usage on the response.

from latence import SupportUnitInput

units = [
    SupportUnitInput(text="Paris is the capital of France.", source_id="doc-42"),
    SupportUnitInput(text="It is located on the Seine.",    source_id="doc-42"),
    {"text": "Population: 2.1M.", "source_id": "wiki"},
]

r = client.experimental.trace.rag(
    response_text="Paris, France's capital, sits on the Seine.",
    support_units=units,
)

for unit in (r.support_units or []):
    print(unit.source_id, unit.usage_state, unit.coverage_score)

Async

from latence import AsyncLatence

async with AsyncLatence(api_key="lat_xxx") as client:
    r = await client.experimental.trace.rag(
        response_text="Paris is the capital of France.",
        raw_context="France's capital city is Paris.",
    )

Background jobs

job = client.experimental.trace.rag(
    response_text="Paris is the capital of France.",
    raw_context="France's capital city is Paris.",
    return_job=True,
)
print(job.job_id)
result = client.jobs.wait(job.job_id)