Skip to content

Surface web search/fetch citations across all providers#318

Draft
cpsievert wants to merge 7 commits into
mainfrom
feat/citation-content-model
Draft

Surface web search/fetch citations across all providers#318
cpsievert wants to merge 7 commits into
mainfrom
feat/citation-content-model

Conversation

@cpsievert

@cpsievert cpsievert commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Why this matters

When a model answers using web search or fetch tools, the most useful artifact is which sources back the answer. Previously chatlas dropped that information — OpenAI's citation annotations were discarded, Anthropic's citation deltas never reached normalized content, and Google's grounding metadata was ignored entirely.

This PR gives chatlas a normalized citation model so grounded answers carry their sources through to turns uniformly — both progressively during streaming and on the final turn — enabling downstream UIs (e.g. shinychat) to render footnote markers and source lists.

What's now possible

  • ContentCitation(url, title) appears in the turn's contents list after the text it grounds, in stream order. Its position relative to surrounding ContentText items is the placement signal — no offsets or span-matching needed.
  • During streaming (content="all"), ContentCitation objects arrive interleaved with text (OpenAI/Anthropic) or at stream end (Google), so UIs can render citations progressively.
  • On the final turn, ContentCitation items sit in the same contents list in the same order — streaming and replay have one shape, not two.
  • Web search results are now richer Source objects (url, title, domain), and web fetch results carry a status field.
  • ContentCitation and Source are exported from chatlas.types.

Breaking change

ContentToolResponseSearch.urls (list[str]) is replaced by .sources (list[Source]). Migrate x.urls[s.url for s in x.sources]. (These content types are recent, so blast radius is small.)

Notes for reviewers

  • Per-provider wiring lives in each _provider_*.py's stream_content() and _as_turn(). All three providers filter ContentCitation out of turn serialization (it's client-side metadata, not sent back to APIs).
  • The Google path is greenfield — grounding and url-context metadata were previously unsurfaced.
  • stream_content() now returns list[Content] instead of Optional[Content], allowing a single event to produce multiple content items (e.g. text + citations at block stop).

Test plan

  • uv run pyright — 0 errors
  • Content model unit tests: types, round-trip, exports
  • Per-provider web search tests assert ContentCitation items in both streaming and final turns (all replaying VCR cassettes offline)
  • Turn accumulator tests verify non-mergeable content types append correctly
  • Reviewer: confirm full make check in CI

@cpsievert cpsievert marked this pull request as draft June 3, 2026 19:02
@cpsievert cpsievert force-pushed the feat/citation-content-model branch from 10560f5 to 9a944ca Compare June 12, 2026 16:28
Comment thread chatlas/_chat.py
@cpsievert cpsievert changed the title Surface web search & fetch citations across providers Surface web search/fetch citations across all providers Jun 12, 2026
…entToolResponseSearch

- Add `Citation` (url, title, cited_text) and `Source` (url, title, fetch_status) types
- `ContentText.citations` holds a list of `Citation` objects; merges on `__add__`
- `ContentToolResponseSearch.sources` replaces the old `urls` field, adding `fetch_status`
- Export `Citation` and `Source` from the public `chatlas.types` namespace
…eam_other_contents

`stream_content()` is now the single hook for streaming — it returns a list of
`Content` objects emitted at each chunk. The old per-type hooks (`stream_text`,
`stream_other_contents`) are removed; providers and accumulators use the unified
list contract instead. `Chat` iterates the returned list to dispatch yields.
…pic, Google)

Each provider now populates `ContentText.citations` from its web-search results
and emits `ContentCitation` items via `stream_content` during streaming:

- OpenAI: maps `url_citation` annotations to `Citation`; streams via `annotation.added`
- Anthropic: transfers web-search result citations to `ContentText`; streams
  citations interleaved at `content_block_stop`
- Google: surfaces grounding/url-context metadata as `Citation`/`Source`; emits
  citations at the final chunk via `stream_content`
- Document Citation, Source, and ContentCitation in the API reference
- Add sidebar/quarto entries for the new citation types
- Regenerate openai/_submit*.py type stubs to pick up the new moderation param
- CHANGELOG entry for the citation content model feature
…pe with flat url/title

Remove the dual citation representation (Citation class + ContentText.citations
vs ContentCitation) and unify on ContentCitation as the sole citation type in
both streaming and final turns. ContentCitation now carries url and title
directly (no wrapper); stream position is the placement signal.
@cpsievert cpsievert force-pushed the feat/citation-content-model branch from e1ef76b to 172147e Compare June 12, 2026 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant