Skip to content

Harden instrumentation-state (Stage 1): self-verifying telemetry completeness + order-independent tests#151

Open
Pendu wants to merge 7 commits into
traceopt-ai:mainfrom
Pendu:feat/instrumentation-state-hardening
Open

Harden instrumentation-state (Stage 1): self-verifying telemetry completeness + order-independent tests#151
Pendu wants to merge 7 commits into
traceopt-ai:mainfrom
Pendu:feat/instrumentation-state-hardening

Conversation

@Pendu

@Pendu Pendu commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Addresses #141 (Stage 1). Follows up the DataLoader-completeness discussion from #88 / #135 (now merged; this branch is rebased onto current main). Stage 2 (the negative-control conformance test, unblocked by #146 removing the legacy import-time auto-init) is a separate follow-up.

What changed

Additive instrumentation-state hardening. No change to the training path, public API, or wire format (+453 lines, 7 files, all additions):

  • DataLoader-fetch completeness guard (HF): test_hf_init_enables_dataloader_and_h2d_patches proves init() requests the DataLoader patch; this guard goes one step further and proves the stream actually emits during a real Trainer + TraceMLTrainerCallback run, with _traceml_internal:dataloader_next events landing in flushed StepTimeBatches (tests/integrations/test_hf_trainer.py). A config flag can't catch a broken patch, a renamed event, or a buffer that never flushes.
  • Cross-integration conformance gate: each integration declares the telemetry streams it owes (StreamContract); a parametrized test runs a tiny end-to-end CPU run via the integration's documented init() entry point and asserts every declared stream emitted (tests/integrations/test_telemetry_conformance.py). Wires HF now; Lightning skips until it's a test dependency.
  • Fail-loud capability assertion: when the callback is active but the init config leaves owed streams dark, emit one best-effort warning naming them, once per train() call in on_train_begin; never raises, never touches the per-step hot path (src/traceml_ai/integrations/_capability.py, wired into TraceMLTrainerCallback).
  • Global-state reset between tests: public reset_optimizer_timing() plus an autouse conftest teardown that clears the global optimizer-hook handles/flag and restores default trace-recording state (.../hooks/optimizer_hooks.py, tests/conftest.py). Makes the suite order-independent.
  • Fix: test_hf_trainer_callback_grad_accum_folds_microbatches is red on current main (run the file standalone: 1 failed, 6 passed; it counts forward/backward auto-timer events but nothing installs the patches before it in file order, and CI doesn't run tests/integrations/, so it's invisible there). One init() call makes it self-initializing. Fittingly, the new capability warning named the cause in the failure log: ... these telemetry streams will NOT be captured: backward, dataloader_fetch, forward, h2d.

Why

The callback-based HF integration (#88 / #135) originally emitted DataLoader-fetch timing only incidentally, and nothing verified completeness; the gap was caught by maintainer review, not by the system. Separately, a global optimizer-hook flag was never reset, so wrap_optimizer tests passed or failed depending on suite order, and the grad-accum test on main fails for the same class of reason today. One root cause: ungoverned global instrumentation state. This PR makes telemetry completeness a self-verifying property and the suite order-independent, without touching the training path.

How I tested

PYTHONPATH=src python3 -m pytest -q
# 359 passed, 2 skipped in 9.08s

# previously order-dependent pairs, now green in any order:
PYTHONPATH=src python3 -m pytest tests/integrations/test_hf_trainer.py \
  "tests/sdk/test_init_and_wrappers.py::test_wrap_optimizer_preserves_identity_and_times_step" -q
# 9 passed

PYTHONPATH=src python3 -m pytest \
  "tests/integrations/test_hf_trainer.py::test_hf_trainer_callback_grad_accum_folds_microbatches" -q
# 1 passed  (fails standalone on main)

Each new test was written fails-first / passes-after (TDD). black/ruff clean on touched files.

  • Unit tests
  • Integration or smoke test
  • Example training script
  • Docs-only change
  • Not run; reason:

Runtime impact

  • No training-path runtime impact
  • May affect training-path overhead
  • May affect distributed launch, telemetry, or aggregator behavior
  • Not applicable

Notes: Additive. New tests, a test-only reset hook, and a best-effort warning that fires once per train() call in on_train_begin (outside the step loop), only when an integration is active but unconfigured.

Documentation

  • README updated
  • User docs updated
  • Examples updated
  • API docs updated
  • Not needed

Stage 1 is internal hardening; the public init() + callback flow is unchanged.

Risk checklist

  • Does not add unnecessary CUDA synchronizations
  • Does not add blocking I/O on the training path
  • Fails safely without crashing user training
  • Keeps new dependencies optional unless discussed
  • Redacts or avoids sensitive training-environment data in examples

Screenshots or output

$ PYTHONPATH=src python3 -m pytest -q
........................................................................ [100%]
359 passed, 2 skipped in 9.08s

The capability warning in action (from the pre-fix grad-accum failure log):

WARNING traceml_ai.integrations._capability: [TraceML] HuggingFace
TraceMLTrainerCallback is active but these telemetry streams will NOT be
captured: backward, dataloader_fetch, forward, h2d. Call
traceml_ai.init(mode='auto') before training to enable them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant