Harden instrumentation-state (Stage 1): self-verifying telemetry completeness + order-independent tests by Pendu · Pull Request #151 · traceopt-ai/traceml

Pendu · 2026-06-10T12:35:02Z

Addresses #141 (Stage 1). Follows up the DataLoader-completeness discussion from #88 / #135 (now merged; this branch is rebased onto current main). Stage 2 (the negative-control conformance test, unblocked by #146 removing the legacy import-time auto-init) is a separate follow-up.

What changed

Additive instrumentation-state hardening. No change to the training path, public API, or wire format (+453 lines, 7 files, all additions):

DataLoader-fetch completeness guard (HF): test_hf_init_enables_dataloader_and_h2d_patches proves init() requests the DataLoader patch; this guard goes one step further and proves the stream actually emits during a real Trainer + TraceMLTrainerCallback run, with _traceml_internal:dataloader_next events landing in flushed StepTimeBatches (tests/integrations/test_hf_trainer.py). A config flag can't catch a broken patch, a renamed event, or a buffer that never flushes.
Cross-integration conformance gate: each integration declares the telemetry streams it owes (StreamContract); a parametrized test runs a tiny end-to-end CPU run via the integration's documented init() entry point and asserts every declared stream emitted (tests/integrations/test_telemetry_conformance.py). Wires HF now; Lightning skips until it's a test dependency.
Fail-loud capability assertion: when the callback is active but the init config leaves owed streams dark, emit one best-effort warning naming them, once per train() call in on_train_begin; never raises, never touches the per-step hot path (src/traceml_ai/integrations/_capability.py, wired into TraceMLTrainerCallback).
Global-state reset between tests: public reset_optimizer_timing() plus an autouse conftest teardown that clears the global optimizer-hook handles/flag and restores default trace-recording state (.../hooks/optimizer_hooks.py, tests/conftest.py). Makes the suite order-independent.
Fix: test_hf_trainer_callback_grad_accum_folds_microbatches is red on current main (run the file standalone: 1 failed, 6 passed; it counts forward/backward auto-timer events but nothing installs the patches before it in file order, and CI doesn't run tests/integrations/, so it's invisible there). One init() call makes it self-initializing. Fittingly, the new capability warning named the cause in the failure log: ... these telemetry streams will NOT be captured: backward, dataloader_fetch, forward, h2d.

Why

The callback-based HF integration (#88 / #135) originally emitted DataLoader-fetch timing only incidentally, and nothing verified completeness; the gap was caught by maintainer review, not by the system. Separately, a global optimizer-hook flag was never reset, so wrap_optimizer tests passed or failed depending on suite order, and the grad-accum test on main fails for the same class of reason today. One root cause: ungoverned global instrumentation state. This PR makes telemetry completeness a self-verifying property and the suite order-independent, without touching the training path.

How I tested

PYTHONPATH=src python3 -m pytest -q
# 359 passed, 2 skipped in 9.08s

# previously order-dependent pairs, now green in any order:
PYTHONPATH=src python3 -m pytest tests/integrations/test_hf_trainer.py \
  "tests/sdk/test_init_and_wrappers.py::test_wrap_optimizer_preserves_identity_and_times_step" -q
# 9 passed

PYTHONPATH=src python3 -m pytest \
  "tests/integrations/test_hf_trainer.py::test_hf_trainer_callback_grad_accum_folds_microbatches" -q
# 1 passed  (fails standalone on main)

Each new test was written fails-first / passes-after (TDD). black/ruff clean on touched files.

Runtime impact

No training-path runtime impact
May affect training-path overhead
May affect distributed launch, telemetry, or aggregator behavior
Not applicable

Notes: Additive. New tests, a test-only reset hook, and a best-effort warning that fires once per train() call in on_train_begin (outside the step loop), only when an integration is active but unconfigured.

Documentation

Stage 1 is internal hardening; the public init() + callback flow is unchanged.

Risk checklist

Does not add unnecessary CUDA synchronizations
Does not add blocking I/O on the training path
Fails safely without crashing user training
Keeps new dependencies optional unless discussed
Redacts or avoids sensitive training-environment data in examples

Screenshots or output

$ PYTHONPATH=src python3 -m pytest -q
........................................................................ [100%]
359 passed, 2 skipped in 9.08s

The capability warning in action (from the pre-fix grad-accum failure log):

WARNING traceml_ai.integrations._capability: [TraceML] HuggingFace
TraceMLTrainerCallback is active but these telemetry streams will NOT be
captured: backward, dataloader_fetch, forward, h2d. Call
traceml_ai.init(mode='auto') before training to enable them.

…t-ai#88-class stream lands)

…gate

…A-28 order-dependence)

…elemetry streams will be dark

…recording reset

…ent)

Pendu added 6 commits June 10, 2026 14:19

test(hf): add DataLoader-fetch completeness guard (proves the traceop…

152913b

…t-ai#88-class stream lands)

test(integrations): add StreamContract cross-integration conformance …

95c8497

…gate

fix(tests): reset global optimizer-hook state between tests (fixes TR…

a87c2a4

…A-28 order-dependence)

feat(integrations): fail-loud capability assertion — warn when owed t…

2cf6469

…elemetry streams will be dark

adapt hardening to upstream 95483c7: hf init() path, traceopt-ai#143 …

27e026a

…recording reset

fix(tests): make HF grad-accum test self-initializing (order-independ…

43ea61e

…ent)

Pendu mentioned this pull request Jun 10, 2026

Harden instrumentation-state: make integration telemetry completeness self-verifying #141

Open

4 tasks

style: match pinned black 24.10.0 line-wrap in test_capability.py

6cc8780

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Harden instrumentation-state (Stage 1): self-verifying telemetry completeness + order-independent tests#151

Harden instrumentation-state (Stage 1): self-verifying telemetry completeness + order-independent tests#151
Pendu wants to merge 7 commits into
traceopt-ai:mainfrom
Pendu:feat/instrumentation-state-hardening

Pendu commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Pendu commented Jun 10, 2026

What changed

Why

How I tested

Runtime impact

Documentation

Risk checklist

Screenshots or output

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant