Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
242 changes: 242 additions & 0 deletions rfcs/0000-trace-views/0000-trace-views.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
start_date: 2026-04-10
mlflow_issue: [22499](https://github.com/mlflow/mlflow/issues/22499)
rfc_pr: # leave this empty
author(s): Forrest Murray (forrest.murray@databricks.com)

# Summary

Trace views are named, reusable configurations that filter and label MLflow traces. Each view defines an ordered set of **ranges** — labeled segments of a trace, each matching one or more spans with optional JSONPath extraction for inputs and outputs. Views make traces readable for the people consuming them — SMEs labeling, developers debugging, judges scoring — without altering the underlying trace data.

The default creation path is AI: a built-in skill analyzes a trace and proposes a view. Developers refine in the in-UI editor, create from scratch by selecting spans directly in the timeline, or build views programmatically via the Python API and CLI.

# Motivation

MLflow traces capture everything an agent did: LLM calls, tool invocations, retriever lookups, internal orchestration, HTTP requests, embeddings. For the developer debugging an agent, that level of detail is essential. For everyone else — SMEs labeling traces, judges scoring trajectories, PMs reviewing behavior — most of it is noise that drowns out the few decisions and outputs that actually matter.

Today, teams work around this with bespoke pipelines:

- **Exporting traces to spreadsheets.** Teams export to Excel for clinicians who aren't comfortable in the MLflow UI.
- **Building custom annotation UIs.** Teams build Streamlit tools because they deem the MLflow UI insufficient for their product managers and designers. Some build their own annotation app.
- **Picking up competing tools.** Anecdotally tools like Braintrust make it easier for nontechnical folks to provide feedback.
- **Giving SMEs raw traces and hoping.** Raw traces can cause more annotator fatigue.

Every workaround is bespoke, breaks when the agent changes, and ships nothing reusable. Trace views give the trace UI a first-class way to focus on what matters for a given task — annotation, debugging, judging — without altering the trace itself, and a way to share that focus with others.

# Critical user journeys

Four journeys ground the design. Each describes a user, the pain they hit today, and how trace views change the experience. UI screenshots are referenced inline.

## CUJ 1: SME reviewing a trace for labeling

**Who:** A subject-matter expert (clinician, compliance reviewer, support manager) reviewing an agent trace from a labeling session.

**Today:** She opens a trace and sees 80+ spans in a deeply nested tree, JSON inputs and outputs at every node, technical span names ("ChatCompletion", "EmbeddingRequest", "VectorSearch"). She has no idea what to focus on. She labels noisily, or she gives up and the session goes back to the developer.

**With trace views:** When she opens the trace, the view selector is already set to a view named after the agent's task (an experiment-scoped template, or an AI-generated trace-scoped view). The left timeline still shows the full nested tree — orientation preserved — but matching spans are highlighted by range color and others are visually dimmed. The right pane shows a small number of **range cards**, one per decision the agent made, each with:

- A one-line label
- A description (with optional `[text](spans/{id})` deeplinks to specific spans)
- The extracted input (when `input_path` is set)
- The extracted output (when `output_path` is set)

![View active with range summary](images/03-view-active-range-summary.png)

She reads top to bottom and labels each range. Clicking a range card opens a **range detail** view that shows the full inputs and outputs of the matched span(s), in case she needs more context. The summary can link to both the labeled version of the trace detail, as well as to specific spans as "deeplinks.

This is the highest-stakes user surface. Every customer pain quoted in *Motivation* maps to this CUJ: nontechnical users opening complex traces, getting overwhelmed, and producing bad labels (or leaving the platform).

![Range detail](images/edit-mode-detail.png)

## CUJ 2: Developer preparing traces for SME review

**Who:** A developer setting up a labeling session for a batch of traces from a new agent build.

**Today:** She runs the session with raw traces. SMEs ask her in Slack what to focus on. She writes a doc; SMEs ignore it. She gives up and labels them herself, or the session ships poor labels and eval datasets degrade.

**With trace views:** She runs the built-in trace-view skill — `trace.summarize()` followed by `summary.create_view()` — over the batch. An LLM analyzes each trace, identifies the agent's milestones (e.g., "Plan → Search → Synthesize"), and persists a `TraceView` with one `SpanRange` per milestone. An alternative approach would be to expose skills only and let users drive them with their own agent or the assistant. MLflow doesn't have an OOB agent harness yet so this is likely to be less complex than exposing a summarize method on the trace itself. In testing I found that a single prompt could produce decent milestones for traces from various agents.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the alternative approach more. To give a context, since MLflow is self-hosted, running built-in harness within MLflow requires users to configure api key and model. This has been a major dropout point for similar AI-driven products we released e.g. issue detection.

With the support of custom view and good apis/skills, I think coding agents can do a decent job.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For development of my draft PR I used only MLflow assistant + Claude code and it did work well.

I think there are a couple issues with this purely:

  1. Discoverability: Relying on the assistant would mean it's only discoverable via docs or LSP. We could remedy this by adding a button to open assistant with a pre-filled summarize prompt, but that assumes the user has already set up assistant.
  2. DX: essentially there's very little control on how to apply this over an experiment-level batch. It's fairly reasonable to say "create milestone trace views for all traces in the experiment", but then claude will likely take a while to figure out how to do this and there's no observability over the process vs.
traces = search_traces()
for t in traces:
    try:
        summary = t.summarize()
        t.create_view(summary)
   except:
        log(f'summary for {t.trace_id} failed')
  1. Differentiation: Other platforms don't provide this programmatically via sdk, mostly they expose it in their assistants. This affords developers the ability to build it in to their applications which meets teams where they already are.


She opens one of the traces in the UI to review the AI's output. It's close, but she wants to rename "Step 2: Search" to "Knowledge lookup," add a second span the AI missed to the planning range, and pick a different JSONPath for the synthesis output.

![Edit mode with drag-to-select](images/06-drag-to-select.gif)

She clicks **Edit**, drag-selects the missing span to add it to the planning range, picks the right output field via the `JsonFieldSelector` checkbox tree, and saves.

![JSON field selector](images/edit-mode-detail.png)

![Output path selection](images/07-output-path-selection.gif)

It's possible that the traces will have a consistent structure which generalizes across all traces in the experiment. In which case it should be easy to apply to all traces.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also possible the traces within experiment does not have very consistent structure. It is common that the same agent has different order and number of llm, tool, retriever calls. How does this RFC address it?


This is the developer-as-curator CUJ. It anchors the **"AI creates, developer edits"** creation model: AI does the first pass, the developer corrects rather than authoring from blank.

## CUJ 3: Developer debugging an agent failure

**Who:** The agent author, narrowing in on why a specific trace failed.

**Today:** She opens the trace, finds the error span six levels deep in the timeline, expands its parents to read inputs, copies the trace ID and a list of relevant span IDs into a Slack message, and pings a colleague. The colleague opens the link and re-navigates the same path.

**With trace views:** She opens the trace, clicks **+ Create view** in the selector, drag-selects the failure span plus the two parent spans that fed it, names the view "Tool X failure: bad query," and saves. She copies the trace URL — now deeplinked to her view — and shares it. The colleague opens the link, sees only the relevant three spans labeled and explained, and gives feedback in 30 seconds.

![Edit toolbar](images/edit-mode-detail.png)

This is the **direct-UI-creation** CUJ. No AI, no Python, no JSON editing. Drag, name, save, share.

## CUJ 4: Judge author scoring against a view

**Who:** An evals author building a judge that should only score one phase of agent behavior — for example, "did the planning step decompose the task correctly?" Alternatively when writing an evaluation of the agent's trajectory an example grading criteria would be: "did the trajectory follow or deviate from the initial plan?".

**Today:** She copies trace JSON into the judge prompt template. The prompt blows past the context budget, or it includes irrelevant tool-call noise that confuses the judge. Either way the judge is brittle: when the agent's span structure changes, the manual extraction breaks.

**With trace views:** She defines an experiment-scoped template that selects only the planning spans, with a JSONPath that extracts the relevant fields. Her judge references the view in the `{{trajectory}}` placeholder. At judge time, MLflow renders the view's extracted ranges instead of the raw trace — scoped automatically and consistently across traces.

```python
make_judge("""
evaluate the provided trajectory to see if the agent verifies it's work:
{{ trajectory }}
""")
```

This CUJ is API-shaped — no dedicated UI surface — but it justifies the view abstraction as a primitive beyond the trace explorer. The same view that helps an SME label also helps a judge score.

# UI design

The trace explorer gains two affordances when a view is in play: a **view selector** in the header and a **range-aware right pane**.

## Default state

Before any view is selected, the explorer behaves exactly as today: full nested timeline on the left, raw span details on the right.

## View selector

A dropdown in the header lists views available for the current trace, grouped by scope:

- **Trace views** — created for this specific trace (AI-generated, or manually saved)
- **Experiment templates** — defined at experiment scope, applied automatically to every trace in the experiment

"Raw trace" is always present as the no-view default. "+ Create view" opens the in-UI editor with an empty draft.

## View active — range summary

When a view is selected, the right pane switches from raw span details to a **range summary**: one card per range, in `position` order, with the label, description, and extracted I/O. Each card has a color that matches a highlight on the corresponding span(s) in the left timeline.

Non-matching spans in the timeline are visually dimmed but still visible. This preserves orientation — the user always knows where in the trace they are — and reduces the cognitive cost of toggling between view and raw.

## Edit mode

Clicking **Edit** on an active view, or **+ Create view** in the selector, enters edit mode. A toolbar appears at the top with the view's name field and Save/Cancel actions. The timeline becomes interactive: dragging across spans selects them as a range. On release, a configuration panel appears for the new range — label, description, and optional input/output JSONPath.

`JsonFieldSelector` (shown above in CUJ 2) is a checkbox tree over a span's input or output structure. Picking a leaf generates the corresponding JSONPath; power users can edit the JSONPath directly if they need a dialect feature the tree doesn't support.

Edit mode is non-destructive — Cancel discards the draft, Save persists.

# Creation model

Three creation paths, in priority order:

1. **AI generates, developer edits (default).** The built-in trace-view skill produces a `TraceView` from a trace. The developer reviews and corrects in the UI before saving or promoting to a template. This is the path most users take.

```python
trace = mlflow.get_trace("tr-abc")
summary = trace.summarize(model="openai:/gpt-4o")
view = summary.create_view()
```

2. **Direct UI creation.** From **+ Create view**, the developer drag-selects spans, defines ranges, and saves. No AI involved. This is CUJ 3.

3. **Python API and CLI.** For batch and programmatic use:

```python
trace.create_view(
name="Tool results",
ranges=[SpanRange(
from_selector=SpanSelector(span_type="TOOL"),
label="Tool calls",
output_path="$.result",
position=0,
)],
)
```

```
mlflow traces create-view --trace tr-abc --ranges-json '[...]'
```

The same `TraceView` entity backs all three paths. The AI path is a thin wrapper that calls the same `create_view` API a developer would.

# Data model and API

A `TraceView` contains an ordered list of `SpanRange`s. Each range identifies a segment of the trace and optionally extracts fields from it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Braintrust supports custom view through React function. What is the strength of doing limiting the choice to a range of span over that? Is it a risk that we cannot support some type of views that competitors can do?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listed that in the later section, personally I think that's quite a strong approach, and I seriously considered something like it initially. Here's how I see it:

Pros:

  1. Really flexible and can hook into the feedback APIs to provide custom human-feedback.
  2. Embeddable (I assume?) so it can be served directly in a custom app or replace a "dumb" thumbs-up/thumbs-down UX.

Cons:

  1. Locks consumers in to having a DOM-based rendering layer and the deps. to produce it. That actually disqualifies this from being usable as a judge template var {{ trajectory }} which is just as strong of a motivator as SME readability. Maybe there's a way to get really clever about SSR-ing views for judges.
  2. Unsafe/incorrect: Relies on a coding agent to produce error free JSX (TSX?), unclear how they test it, unclear if it's actually trustable code.
  3. Can't be as easily edited via the UI, e.g. how do I add a JSON field to one of the elements I want to render? I have to write some JSX in the text area?

Do you see any other pros that I'm missing? I think the solution is really pretty nice but technically more complex and the more I think about it, it's hard to come up with custom things you'd want the UI to actually do here.

Another approach is to have TraceViews support composable widgets that could be more easily created in the UI directly without custom code issues mentioned above.


```
TraceView
├── view_id : str (tv-<uuid>)
├── name : str
├── trace_id : str | None (trace-scoped)
├── experiment_id : str | None (experiment-scoped template)
├── created_by : str | None
└── ranges : list[SpanRange]
├── label : str
├── description : str (supports [text](spans/{id}) deeplinks)
├── from_selector : SpanSelector (required)
├── to_selector : SpanSelector | None (for multi-span ranges)
├── input_path : str | None (JSONPath)
├── output_path : str | None (JSONPath)
└── position : int
```

A `SpanSelector` matches one criterion at a time: `span_id`, `span_name`, `span_type`, or `attribute_key` / `attribute_value`. No boolean combinators — selectors are intentionally minimal, with multi-range views composing instead of one over-expressive selector per range.

**Scoping.** Trace-scoped views attach to one trace; experiment-scoped templates apply to every trace in their experiment. Both scopes appear in the selector for a given trace, grouped separately.

**REST API (sketch).**

```
POST /mlflow/traces/{trace_id}/views
GET /mlflow/traces/{trace_id}/views
GET /mlflow/traces/{trace_id}/views/{view_id}
PATCH /mlflow/traces/{trace_id}/views/{view_id}
DELETE /mlflow/traces/{trace_id}/views/{view_id}

POST /mlflow/experiments/{exp_id}/views
GET /mlflow/experiments/{exp_id}/views
PATCH /mlflow/experiments/{exp_id}/views/{view_id}
DELETE /mlflow/experiments/{exp_id}/views/{view_id}
```

**Schema.** A single `trace_views` table holds both scopes, with a `CHECK` constraint that exactly one of `trace_id` / `experiment_id` is set. Ranges are stored as a JSON column on the row to keep schema migrations minimal.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all views visible by default to everyone? In a large organization or teams, I think there should be some filtering/grouping otherwise it is hard to find a view from hundreds of them.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that there would be that many, maybe it's worth getting rid of experiment scoping on traces altogether? Others have project views as well as trace views.


Full implementation: see the PR at [`forrestmurray-db:impl/trace-views-slim`](https://github.com/mlflow/mlflow/compare/master...forrestmurray-db:mlflow:impl/trace-views-slim).

# Alternatives

**1. Client-side only (no persistence).** Views as URL parameters or local storage. Rejected because sharing views across users (CUJ 3) and experiment templates (CUJ 2) require server storage.

**2. Span tags instead of views.** Tag individual spans, filter the UI by tag. Rejected because tagging mutates trace data and doesn't support multiple coexisting perspectives. A single trace can have an SME view, a debugging view, and a judge view simultaneously without conflict.

**3. Single span filter per view.** The first draft used one `SpanFilter` per view. Real traces have multiple phases users want to highlight together — planning, retrieval, generation — so the model evolved to multi-range with `SpanRange[]`. The single-filter version made the common case awkward.

**4. Custom React components (Braintrust Custom Views).** Braintrust lets developers write arbitrary React for trace rendering. Rejected because (a) the surface is too large to land in OSS MLflow with reasonable maintenance cost, (b) React components are not diffable or auditable in the way a declarative view config is, and (c) the upside — total flexibility — isn't worth it when AI-generated declarative views handle the common cases. Storing and executing arbitrary React in Managed MLflow would also require a sandboxing story this RFC doesn't want to design.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if these weakness is

the surface is too large to land in OSS MLflow with reasonable maintenance cost

The interface between the custom view definition and MLflow is actually narrow and soft in custom component than proposed option. For example, Braintrust is represented as a function that takes trace/spans/updates. As far as the input is typed as those objects, the platform does not need to meet other contract. On the other hand, the proposed framework defines other interface like span range, selector, which needs to be maintained in a backward compatible way.

React components are not diffable or auditable in the way a declarative view config is

React components are code, which is easy to manage.

the upside — total flexibility — isn't worth it when AI-generated declarative views handle the common cases.

I think the upside is big. A basic view like this in Braintrust example cannot be represented with the proposed model. Whether or not the desired view is archivable or not is a hard blocker so can be a deal breaker.

Storing and executing arbitrary React in Managed MLflow would also require a sandboxing story this RFC doesn't want to design.

Imo this is the only cons, but I think it is worth building a sandbox solution. It is not very new problem and there should be an existing solution.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see this comment before responding here


**5. Conversation-mode rendering only (Braintrust Thread View, LangSmith Messages View).** Hard-filter to LLM and score spans, render as a chat thread. Works for simple chat agents; breaks for agents with significant tool use, retrieval, or orchestration where the non-LLM spans are exactly the decisions an SME or judge cares about. Trace views generalize this to "any range of spans, with extraction" — a Thread-view-style render is one possible view, not a baked-in mode.

# Adoption strategy

Trace views are additive. No existing behavior changes; users who never create a view see the trace explorer they have today.

- **Phase 1 — the slim PR.** Entity, REST, Python client, CLI, in-UI view selector, range rendering, in-UI editor, deeplinks. Direct UI creation (CUJ 3) works end-to-end. Experiment templates work. Python API works. AI generation lives behind the assistant integration and lands separately.
- **Phase 2 — AI generation and judge integration.** `trace.summarize()` / `summary.create_view()` and the assistant integration in the trace explorer (CUJ 2). `{{trajectory}}` rendering via views in the judge runtime (CUJ 4).
- **Phase 3 — scale.** Batch view creation across a dataset; cross-trace template versioning; selector variables in templates.

# Open questions

1. **Should the in-UI editor ship in Phase 1?** It increases the v1 surface area but materially shortens the path from "I see a useful trace shape" to "I have a view I can share" (CUJ 3 depends on it entirely). The slim PR includes it. The alternative is shipping Phase 1 with the API + selector only and deferring the editor to Phase 2 to reduce review scope.

2. **Where should display controls live (collapse defaults, attribute hiding, span renames)?** Not in the v1 schema. Options: (a) add a `display_config` JSON column on `trace_views`, (b) apply display at render-time only via the template, (c) defer entirely and ship v2 once we have data on what configurability users actually want.

3. **Selector variables in experiment templates.** A template that says "match the span named `{tool_name}`" where `tool_name` varies per trace. Currently templates use static selectors, which limits utility for experiments with heterogeneous trace shapes.

4. **JSONPath dialect.** Python (`jsonpath-ng`) and JavaScript (`jsonpath-plus`) implement subtly different JSONPath dialects. v1 restricts to the common subset, but a more expressive dialect — or a different extraction language — would let users do things like array slicing and conditional matches.

5. **AI-generated view quality.** The default creation path depends on the LLM correctly identifying the agent's phases and choosing reasonable selectors. Anecdotal results on a few agent shapes are promising; broader validation comes during Phase 2 dogfooding. The "developer edits AI output" flow is the safety net here — if it gets used heavily, the AI needs work; if it barely gets used, the AI is doing its job.

6. **Read-time consistency under schema change.** When a trace's span structure changes after a view was created (re-instrumentation, agent refactor), some `SpanSelector`s may no longer match. v1 silently renders empty ranges. Should we surface a warning? Auto-suggest fixes via the AI path? Out of scope for v1, worth surfacing.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.