diff --git a/docs/features/distributed-tracing.md b/docs/features/distributed-tracing.md new file mode 100644 index 0000000..2ca473c --- /dev/null +++ b/docs/features/distributed-tracing.md @@ -0,0 +1,308 @@ +# Distributed Tracing + +This document specifies how Braintrust SDKs propagate trace context across +service and process boundaries, so that spans produced by separate services +are linked into a single trace. + +## Scope + +This document only concerns **how spans are linked into a trace across a +boundary** — i.e. how a downstream service learns which span/trace/project a +unit of work belongs to. It does NOT cover: + +- How linked traces are rendered in the Braintrust UI. +- What happens when a service receives parent info referring to a trace or + project it does not have access to. (This is a backend/UI concern and is + out of scope here.) + +## Propagation mechanism + +Braintrust SDKs propagate trace context using the +[W3C Trace Context specification](https://www.w3.org/TR/trace-context/). W3C +carries trace identity (which trace and which parent span). Braintrust +additionally uses the [W3C Baggage](https://www.w3.org/TR/baggage/) header to +carry the **Braintrust parent** — the project or experiment the trace belongs +to. + +Example headers: + +``` +traceparent: 00-f53d4cd03acedba3ca85a4605ca4bdce-baeeec9367deae51-03 +baggage: braintrust.parent=project_id:12345 +``` + +- `traceparent` encodes the trace id and parent span id (plus the W3C version + and flags). +- `baggage` carries `braintrust.parent`, which tells the receiver which + Braintrust container (project, experiment, etc.) the trace belongs to. + +This is the propagation mechanism for **all** Braintrust SDKs, both the +OpenTelemetry-based SDKs (Java, Go, .NET, and the `braintrust[otel]` / +`@braintrust/otel` integrations) and the native SDKs (Python and TypeScript). + +## Wire format + +### Headers + +A conforming SDK MUST propagate trace context using the following headers. + +| Header | Spec | Required on send | Carries | +| ------------- | --------------------------------------------------------- | --------------------------------------- | ------------------------------------------------ | +| `traceparent` | [W3C Trace Context](https://www.w3.org/TR/trace-context/) | MUST | trace id, parent span id, version, flags | +| `baggage` | [W3C Baggage](https://www.w3.org/TR/baggage/) | MUST when a Braintrust parent is known | `braintrust.parent=` plus any user baggage | +| `tracestate` | [W3C Trace Context](https://www.w3.org/TR/trace-context/) | MAY | vendor extensions (passed through if present) | + +### `traceparent` + +The `traceparent` header MUST follow the W3C format +`version-trace_id-parent_id-flags`, e.g.: + +``` +traceparent: 00-f53d4cd03acedba3ca85a4605ca4bdce-baeeec9367deae51-03 +``` + +The 16-byte `trace_id` identifies the trace (it is the analogue of Braintrust's +`root_span_id`); the 8-byte `parent_id` identifies the specific parent span. + +### `baggage` and `braintrust.parent` + +W3C `traceparent` carries trace identity but **not** the Braintrust container +the trace belongs to. The SDK MUST therefore also propagate the Braintrust +parent in the `baggage` header under the key `braintrust.parent`: + +``` +baggage: braintrust.parent=project_id:12345 +``` + +The `braintrust.parent` value identifies a Braintrust container: +`project_id:`, `project_name:`, or `experiment_id:`. This is the +same value carried by the `braintrust.parent` span attribute documented in the +[instrumentation guide](../instrumentation-guide.md#routing-and-context). (The +exported span slug used by the deprecated slug-passing path is a separate +mechanism and is not a valid `braintrust.parent` baggage value.) + +The `braintrust.parent` baggage entry is REQUIRED whenever the sender knows its +Braintrust parent. Without it, a receiver can resolve trace identity from +`traceparent` but cannot determine which project/experiment to log under. + +Implementations MUST preserve any non-Braintrust baggage entries already +present (baggage is a shared, multi-tenant header). Only the +`braintrust.parent` key is owned by the SDK. + +## Send behavior + +When an SDK forwards a request as part of an active span, it MUST inject the +propagation headers into the outbound carrier (HTTP headers, message-queue +metadata, gRPC metadata, etc.): + +1. Inject `traceparent` (and `tracestate` if present) from the current span's + context, per W3C. +2. Inject `baggage` with a `braintrust.parent` entry set to the SDK's current + Braintrust parent, merged with any existing baggage. + +Injected header names MUST be lowercase (`traceparent`, `baggage`, +`tracestate`), per W3C (§3.2.1 / §3.3.1). If the outbound carrier already +contains a case-variant of one of these headers (e.g. a title-cased `Baggage` +added by a framework), the SDK MUST overwrite it in place of emitting a second, +conflicting case-variant, so the carrier ends up with a single lowercase key. + +Exception: native SDKs (Python and TypeScript) configured with the legacy UUID +ID format cannot produce W3C-shaped ids. For such spans, `inject` is a no-op — +neither `traceparent` nor `baggage` is written, and any pre-existing carrier +entries are left untouched. See +[Native SDK ID format and back-compat](#native-sdk-id-format-and-back-compat). + +## Receive behavior + +When an SDK begins a span in response to an inbound request, it MUST attempt to +resolve a parent from the incoming carrier: + +1. **W3C context** — extract `traceparent` (and `tracestate`) to establish + trace identity, and read `braintrust.parent` from `baggage` to establish the + Braintrust container. A span started under this context MUST share the + incoming trace id and be parented to the incoming span. If `traceparent` is + valid but no `braintrust.parent` baggage entry is present, the SDK MUST still + adopt the incoming trace identity and route under the currently-active + logger/experiment. +2. **No parent** — if no valid `traceparent` is present, start a fresh root + span. + +Header lookups MUST be case-insensitive. (Some HTTP frameworks normalize header +names to title case, e.g. `Traceparent`, while the W3C propagators look up +lowercase keys. Implementations that delegate to a strict propagator MUST +lowercase incoming header names first.) + +Native SDKs do not read inbound request objects themselves; the user supplies +the headers (see [Native SDK API](#native-sdk-api)). + +## Native SDK API + +The native SDKs (Python and TypeScript) expose a propagation API that mirrors +the OpenTelemetry vocabulary, so the concepts line up across paradigms: + +| Concept | OpenTelemetry | Braintrust native SDK | +|---------------------------------|--------------------|---------------------------------------------------| +| Write context into a carrier | `inject(carrier)` | `span.inject(carrier)` / `inject_trace_context()` | +| Read context from a carrier | `extract(carrier)` | `extract_trace_context(headers)` | +| The headers/metadata being read | carrier | carrier / headers | + +The terms **inject**, **extract**, and **carrier** are used exactly as in the +W3C/OTel propagation API. The native SDK never reads request objects directly: +the user pulls headers off their request and hands them to +`extract_trace_context`, just as they hand a carrier dict to `inject`. + +`extract_trace_context(headers)` returns an **opaque propagation context** — a +value the caller passes straight to `start_span(parent=...)`. (Concretely, the +native SDKs return a dict carrying the relevant W3C headers; callers MUST treat +it as opaque.) `start_span(parent=...)` accepts either this opaque context or a +previously exported slug string. + +``` +# Send: write the current span's context into an outbound carrier. +carrier = span.inject(headers) # or inject_trace_context(headers) + +# Receive: read an inbound carrier and start a span under it. +ctx = extract_trace_context(request.headers) +with start_span(name="handler", parent=ctx) as span: + ... +``` + +`span.inject` / `inject_trace_context` implement the +[Send behavior](#send-behavior); `extract_trace_context` + `start_span` +implement the [Receive behavior](#receive-behavior). Both interpret only the +W3C headers (`traceparent`, `baggage`, `tracestate`), case-insensitively. When +`extract_trace_context` finds no valid `traceparent` it returns the SDK's "no +parent" value, so `start_span` begins a fresh root. + +`tracestate` is forwarded transparently: `start_span` captures any inbound +`tracestate` when it resolves an extracted context, every span in the trace +inherits it, and a later `inject` anywhere in the trace re-emits it unchanged. +Braintrust does not author or interpret `tracestate` entries of its own. + +### Deprecated: passing the parent slug + +Prior to adopting W3C Trace Context, the native SDKs propagated trace context by +serializing a span into a single opaque slug via `span.export()` and shipping it +across the boundary (conventionally in an `x-bt-parent` header, though the SDK +reads and writes no particular header — the user moves the slug across the +boundary themselves). The slug encodes the parent span id, root span id, and the +Braintrust parent in one self-describing token, and `start_span(parent=)` +still accepts it. + +**This pattern is deprecated.** A single opaque slug cannot represent +multi-header W3C context and does not interoperate with OTel SDKs. New code +SHOULD use the `inject` / `extract` pattern above; `span.export()` / +`start_span(parent=)` remain only for backward compatibility. OTel SDKs +MUST NOT emit or honor the slug — linking a native service to an OTel service +requires the native SDK to speak W3C, not the slug. + +## Test cases + +SDK conformance tests SHOULD cover the following. Tests are organized by the +unit under test: header injection (send), header extraction (receive), and the +end-to-end round trip. + +### Send: header injection + +Given an active span with a known Braintrust parent, the SDK injects headers +into an outbound carrier. Assert: + +- `traceparent` is present and well-formed: matches + `^00-[0-9a-f]{32}-[0-9a-f]{16}-[0-9a-f]{2}$`, the trace id is non-zero, and + the parent id is non-zero. +- The injected trace id equals the active span's trace id / `root_span_id` + analogue, and the parent id equals the active span's span id. +- `baggage` is present and contains `braintrust.parent` set to the SDK's current + parent (e.g. `project_id:`). +- Pre-existing, non-Braintrust baggage entries on the outbound context are + preserved (inject does not clobber unrelated baggage). +- When no Braintrust parent is known (and the span uses W3C-shaped hex ids), + `traceparent` is still injected; `braintrust.parent` is absent from baggage + (rather than emitted empty). (Native SDKs using legacy UUID ids inject nothing; + see the ID-format cases below.) +- Injected header names are lowercase. If the outbound carrier already carries a + case-variant (e.g. a title-cased `Baggage`/`Traceparent`), the result has a + single lowercase key, not two conflicting case-variants. + +### Receive: header extraction + +Given an inbound carrier, the SDK starts a span and resolves its parent. +Assert one case per row: + +| Inbound headers | Expected resolution | +|--------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------| +| valid `traceparent` + `baggage: braintrust.parent=...` | span shares the inbound trace id, is parented to the inbound span, and is routed to the baggage parent | +| valid `traceparent`, no `braintrust.parent` baggage | span shares the inbound trace id and parent; routing falls back to the active logger/experiment | +| no propagation headers | span is a fresh root (new trace id, no parent) | +| malformed `traceparent` (bad version, wrong length, zero ids) | treated as absent → fresh root span | +| header names in non-lowercase form (e.g. `Traceparent`, `Baggage`) | extracted correctly (case-insensitive lookup) | +| `baggage` with both `braintrust.parent` and unrelated keys | `braintrust.parent` is consumed; unrelated keys are ignored, not errored | +| valid `traceparent` + `tracestate` | the inbound `tracestate` is captured and forwarded unchanged on any later `inject` within the trace | + +### Round trip + +Inject from a parent span, then extract on a fresh context using the produced +headers. Assert the extracted context's trace id and parent span id match the +originating span, and the resolved Braintrust parent matches. This catches +inject/extract asymmetries (e.g. one side using a different baggage key or +encoding). + +Also assert `tracestate` pass-through: given an inbound carrier with a +`tracestate` header, a span started from it (and its descendants) MUST forward +that same `tracestate` value when injecting onward; when no inbound `tracestate` +was present, none is emitted. + +### Negative / robustness + +- Injecting then exporting to Braintrust MUST NOT fail or drop the span if the + Braintrust parent is unknown — propagation is best-effort and MUST NOT break + span emission. +- An oversized or syntactically invalid `baggage` header MUST NOT throw; the SDK + falls back to trace identity from `traceparent` (or a fresh root). + +### Native SDK ID format and back-compat + +These cases apply only to the native SDKs (Python and TypeScript), which can +generate span ids in two formats: the default W3C/OTel-compatible **hex** ids +(16-byte trace id, distinct 8-byte span id) or legacy **UUID** ids (opt-in). The +ID format determines whether W3C propagation is possible and which span-export +serialization is used. See [Native SDK API](#native-sdk-api). + +**ID format defaults and selection** + +- By default (no configuration), spans use hex ids and `span.export()` produces + the hex-compatible serialization. +- The legacy opt-out (e.g. `BRAINTRUST_LEGACY_IDS`) switches generation to + UUID ids and the UUID serialization. The two move together: hex ids MUST + serialize as the hex format and UUID ids as the UUID format, so a span is + never exported in a serialization that cannot represent its ids. +- OTel-compat requires hex ids, so it takes precedence over the legacy UUID + opt-out. If both are enabled, the SDK uses hex ids and logs a warning (at most + once per process) noting that the legacy opt-out was ignored; it MUST NOT + silently produce a contradictory configuration. + +**Send (`inject`) by ID format** + +| Active ID format | Expected `inject` behavior | +| ---------------- | ------------------------------------------------------------------------------------- | +| hex (default) | injects a well-formed `traceparent` (and `braintrust.parent` baggage when known) | +| legacy UUID | no-op: ids are not W3C-shaped, so `traceparent`/`baggage` are NOT written; any pre-existing carrier entries are left untouched | + +**Receive: parent-slug format match** + +`start_span(parent=)` accepts a serialized span slug (the deprecated +slug-passing path). The slug's ids are honored only when they match the active +ID format; a format mismatch MUST NOT mix formats within a trace. + +| Active ID format | Parent slug ids | Expected resolution | +| ---------------- | --------------- | ----------------------------------------------------------------------------------- | +| hex (default) | hex | child links to the slug's span/root ids (shares trace, parented to the slug span) | +| hex (default) | legacy UUID | slug span/root ids are ignored → child is a fresh **hex** root; object routing (project/experiment) from the slug is still honored | +| legacy UUID | legacy UUID | child links to the slug's span/root ids | +| legacy UUID | hex | slug span/root ids are ignored → child is a fresh **UUID** root; object routing from the slug is still honored | + +**Back-compat round trip** + +- `span.export()` followed by `start_span(parent=)` MUST round-trip + correctly under the default hex ids (8-byte span id, 16-byte trace id): the + child shares the parent's trace id and is parented to the parent's span id.