Skip to content

feat(smart_grid): add 25 additional transformer scenarios#292

Open
eggrollofchaos wants to merge 6 commits into
IBM:mainfrom
HPML6998-S26-Team13:feature/smart-grid-additional-scenarios
Open

feat(smart_grid): add 25 additional transformer scenarios#292
eggrollofchaos wants to merge 6 commits into
IBM:mainfrom
HPML6998-S26-Team13:feature/smart-grid-additional-scenarios

Conversation

@eggrollofchaos
Copy link
Copy Markdown

@eggrollofchaos eggrollofchaos commented May 11, 2026

Summary

Adds the remaining SmartGridBench transformer-maintenance scenarios on top of #287, expanding the Smart Grid local corpus from 36 records to 61 records.

This PR adds:

  • SGT-036..SGT-050: domain-coverage gap-fill scenarios across FMSR, IoT, TSFM, work-order, and multi-tool workflows.
  • SGT-051..SGT-060: capability-targeted discrimination checks covering calibration/abstention, prompt-premise contradiction, cross-tool reconciliation, strict output formatting, and truncated-tool-result discipline.
  • Raw JSON tests that assert the full AOB-FMSR-001 + SGT-001..SGT-060 ID set and guard the new benchmark_design / ground_truth.must_NOT_include rubric fields.
  • README and provenance-doc updates so the corpus count and evaluator-facing metadata are documented.

Relationship to #287

This is a follow-on to #287 and is intentionally based on the #287 branch head. Until #287 lands, GitHub will show the domain/server port plus this additional-scenario commit together in this PR's diff. The new commit here is only:

  • 5b7d8b0 feat(smart_grid): add 25 additional transformer scenarios

If maintainers prefer, #287 can be reviewed first and this PR can then be rebased so the visible diff contains only the 25-scenario expansion.

Size and split offer

This follow-on is larger than IBM's preferred small-PR guideline because it updates one canonical scenario array, its count/metadata tests, and the matching provenance documentation as one invariant-preserving unit. The new PR-specific code/data delta is the single 25-scenario expansion commit on top of #287. If maintainers prefer a smaller review path after #287 lands, I can split this into separate gap-fill (SGT-036..SGT-050) and capability-targeted (SGT-051..SGT-060) scenario PRs.

Data policy

No raw or processed SmartGridBench CSV files are included. The added records use the same SG_DATA_DIR runtime data contract already documented in #287.

Validation

  • uv run pytest src/servers/smart_grid/tests/test_scenarios.py -q — 7 passed.
  • uv run pytest src/servers/smart_grid/ -q — 44 passed.
  • uv run ruff format --check src/servers/smart_grid/ — clean.
  • uv run ruff check src/servers/smart_grid/ — clean.
  • src/scenarios/local/smart_grid.json contains 61 unique records: AOB-FMSR-001 plus SGT-001..SGT-060.
  • Capability-rubric fields are present at the expected current-batch floors: 10 records with benchmark_design and 13 records with ground_truth.must_NOT_include. Tests assert these as minimum floors so future additions or harmless reorders do not require test rewrites.

Acknowledgments

Source-project authors (Columbia SmartGridBench, Spring 2026): Akshat Bhandari, Aaron Fan, Tanisha Rathod, Wei Alexander Xin.

References

…rio corpus

Adds the Smart Grid transformer-maintenance domain to AssetOpsBench as a
focused upstream cut from the SmartGridBench source project (Columbia
University, 2026). New surfaces:

- Smart Grid MCP servers under `src/servers/smart_grid/` for IoT, FMSR/DGA,
  TSFM/RUL, and work-order workflows. Nested under a domain-specific
  sub-namespace to coexist with the existing domain-general
  `src/servers/{iot,fmsr,tsfm,wo}` servers (different backends, asset
  types, and data assumptions; PR body documents the design rationale).
- A direct adapter exposing the Smart Grid tools as plain Python callables.
- 36 canonical Smart Grid scenarios + 5 negative-check fixtures in the AOB
  local scenario array convention; extended evaluator metadata documented
  in `docs/smart_grid_data_provenance.md`.
- `SG_DATA_DIR` runtime data-provenance contract and a no-CSV-port policy:
  no raw or processed source-project CSV datasets are shipped.
- Console-script entry points for the four Smart Grid MCP servers.
- Unit tests for the direct adapter, IEC 60599 DGA classification,
  JSON-safe divergent ratios, and scenario shape/uniqueness.

Validation: uv run pytest src/servers/smart_grid/ -- 25 passed.
Scenario JSON contains 36 unique canonical records and 5 unique
negative-check records.

Refs: HPML6998-S26-Team13/hpml-assetopsbench-smart-grid-mcp#46

Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>
Normalize pandas Timestamp values from get_dga_record before returning them through MCP JSON-RPC. Without this, valid DGA lookups over parsed CSV fixtures fail strict JSON serialization even though the in-process Python call succeeds.

Adds a regression test that builds a temporary SG_DATA_DIR fixture, calls get_dga_record, and verifies json.dumps(..., allow_nan=False) succeeds.

This is a follow-up from PR IBM#287 self-review and should remain a separate review-iteration commit on the published branch.

Validation:
- SG_DATA_DIR=/Users/wax/coding/hpml-assetopsbench-smart-grid-mcp/data/processed uv run pytest src/servers/smart_grid/
- uv run ruff format --check src/servers/smart_grid/
- uv run ruff check src/servers/smart_grid/
- SG_DATA_DIR=/Users/wax/coding/hpml-assetopsbench-smart-grid-mcp/data/processed uv run python <19-tool JSON serialization smoke>

Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>
Move the JSON-safe record normalizer from `fmsr/main.py` (where it was added in
`fix(smart_grid): serialize DGA sample dates`) up to `base.py` as the
public canonical helper `json_safe_record`. Replace the latent pre-fix
`_normalize_record` in `wo/main.py` (which only handled `pd.isna`, not
`pd.Timestamp`) with the canonical helper.

`wo._normalize_record` was correct in behavior at the time it ran because
`load_fault_records` does not currently pass `parse_dates`, so no
`pd.Timestamp` ever leaked through. Adding `parse_dates=["report_date"]` (or
similar) later would have silently broken JSON-RPC the same way the DGA path
broke before its fix. Centralizing the boundary normalizer prevents that
regression class.

Verification: `uv run pytest src/servers/smart_grid/` -- 42 passed.
`uv run ruff format --check src/servers/smart_grid/` clean.
`uv run ruff check src/servers/smart_grid/` clean.

Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>
Add `tests/test_json_safety.py` that walks every `@mcp.tool()`-decorated
callable across `iot`, `fmsr`, `tsfm`, and `wo` and asserts
`json.dumps(result, allow_nan=False)` succeeds against a hermetic
`SG_DATA_DIR` fixture. Catches the boundary-contract bug class fixed in
`fmsr.get_dga_record` for any current or future Smart Grid tool, without
per-tool test boilerplate.

The fixture writes minimal CSVs for all six processed-data files, sets
`SG_DATA_DIR` to a `tmp_path`, and resets module-level dataframe caches
across all four servers so each test gets a clean read path.

16 parametrized cases land 42 total tests passing (was 26).

Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>
Expand the Smart Grid scenario corpus on top of IBM PR IBM#287 from 36 records to 61 records by adding SGT-036 through SGT-060 from the SmartGridBench source project. The added batch includes domain-coverage gap-fill scenarios plus capability-targeted discrimination checks with benchmark_design metadata and negative must_NOT_include rubric fields.

Update the Smart Grid provenance docs and README count so the corpus size and new evaluator-facing metadata are documented. Extend the scenario JSON tests to assert the full SGT-001..SGT-060 ID set and guard against silently dropping the capability-targeted rubric fields.

Validation:

- uv run pytest src/servers/smart_grid/tests/test_scenarios.py -q

- uv run pytest src/servers/smart_grid/ -q

- uv run ruff format --check src/servers/smart_grid/

- uv run ruff check src/servers/smart_grid/

Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>
Copy link
Copy Markdown
Author

@eggrollofchaos eggrollofchaos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review

Upstream-context pass after opening. Stacked on #287; new commit specific to #292 is 5b7d8b0.

PR context checked: head 5b7d8b0fed1c1ee744a4e6753282e80c826cd779; top-level comments 0; review records 0; inline comments 0; review threads 0; DCO SUCCESS at 2026-05-11T03:38:24Z; all 5 commits signed off; mergeStateStatus: BLOCKED, reviewDecision: REVIEW_REQUIRED (IBM-maintainer gates, expected).

Critical: none.

High: none.

Medium: none.

Low: none.

Nit

  • N1 — PR size +754 lines exceeds IBM's <300 lines preference without acknowledgment in PR body. #287 had a ## Size and split offer section; #292 doesn't. Adding a short paragraph naming the natural split boundary (gap-fill SGT-036..SGT-050 vs capability-targeted SGT-051..SGT-060) defuses the size question pre-emptively. No code change.
  • N2src/servers/smart_grid/tests/test_scenarios.py:108-112: per-record assertions are truthy-only (assert design.get("target_capability"), raw["id"]). A future regression that stuffs target_capability: " " (whitespace) would still pass. Optional: wrap in isinstance(...) and value.strip() for safer guard.
  • N3 — PR body validation section says "10 records with benchmark_design and 13 records with must_NOT_include" as exact numbers; tests assert >= 10 and >= 13 (forward-compatible floor). Intentional but a reader checking the body against the test might briefly stall. Optional clarification in body wording ("10 records (floor)" or similar).

Verified non-findings

  • README 36 → 61 records update accurate; new phrasing fairly describes corpus shape.
  • Provenance doc additions correctly distinguish SGT-036..SGT-050 (gap-fill) from SGT-051..SGT-060 (capability-targeted); two new evaluator-metadata table rows describe optional fields without overclaiming.
  • Acknowledgments paragraph credits source-project authors alphabetically (Akshat Bhandari, Aaron Fan, Tanisha Rathod, Wei Alexander Xin). Consistent with #287.
  • All expected_tools namespace-prefixed correctly across the 25 new scenarios.
  • No raw or processed CSV/data files included — SG_DATA_DIR runtime contract preserved.
  • Stacking is clean: 5b7d8b0 is the only PR292-specific commit; the four predecessors are #287 content visible because #287 hasn't landed yet.
  • DCO check green; all commits signed off with consistent identity.
  • Branch name feature/smart-grid-additional-scenarios matches <type>/<description> convention.

Verification

  • uv run pytest src/servers/smart_grid/tests/test_scenarios.py -q → 7 passed.
  • uv run pytest src/servers/smart_grid/ -q → 44 passed.
  • uv run ruff format --check src/servers/smart_grid/ → clean.
  • uv run ruff check src/servers/smart_grid/ → clean.
  • JSON corpus: 61 records, unique IDs, exact set {AOB-FMSR-001} ∪ {SGT-001..SGT-060}.
  • PR292 batch (SGT-036..SGT-060): 25 records, all schema-valid, benchmark_design=10/25, must_NOT_include=13/25.

Verdict

LGTM with 3 Nits. None blocks merge. IBM-maintainer review is the remaining external gate.

Address the PR IBM#292 v1 review nit by treating whitespace-only evaluator rubric strings as invalid. The capability-targeted test now shares a small non-empty-string predicate for benchmark_design fields and must_NOT_include entries, so future corpus edits cannot satisfy the preservation guard with blank strings.\n\nValidation:\n- uv run ruff format src/servers/smart_grid/tests/test_scenarios.py\n- uv run ruff check src/servers/smart_grid/tests/test_scenarios.py\n- uv run pytest src/servers/smart_grid/tests/test_scenarios.py -q\n- uv run pytest src/servers/smart_grid/ -q

Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>
Copy link
Copy Markdown
Author

@eggrollofchaos eggrollofchaos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review follow-up

PR context checked at head 1cffe54ff734ed9bb300957f5d06059268276c09; top-level comments 0; review records 1 prior (v1 at 5b7d8b0); inline comments 0; review threads 0; DCO SUCCESS at 2026-05-11T21:41:48Z on v3 head; mergeStateStatus: BLOCKED, reviewDecision: REVIEW_REQUIRED (IBM-maintainer gates, expected).

Body originally posted at v2 head 5b7d8b0 after the PR-body edit closing N1+N3; edited in place after v3 head 1cffe54 landed with the N2 fixup commit, to keep one consolidated follow-up record rather than two.

v1 Nit closure

  • N1 — closed in v2 PR-body edit. New ## Size and split offer section honestly acknowledges size, names natural split boundary, offers to split post-#287.
  • N3 — closed in v2 PR-body edit. Validation bullet rewritten to "current-batch floors" + forward-compat rationale.
  • N2 — closed at v3 commit 1cffe54 (test(smart_grid): tighten rubric string assertions). Added _non_empty_string(value) predicate (isinstance(value, str) and bool(value.strip())); replaced 3 truthy-only assertions uniformly for target_capability, discrimination_hypothesis, and must_NOT_include per-item checks. Surgical 7+/-3-line diff. Helper extraction appropriate (used 3x). Whitespace-string regression now blocked.

Probed and ruled out

  • _non_empty_string edge cases: "" → False; " " → False; "\t\n" → False (handles whitespace beyond regular spaces); "a" → True; non-string values (None, int, dict) → False (defensive). Minimal but correctly typed.
  • All 3 call sites use the predicate uniformly; no inconsistency.
  • must_NOT_include outer assertion (isinstance(excluded, list) and excluded) retained; inner per-item check is now _non_empty_string(item). Two-layer correctness.
  • DCO re-ran on new commit (not carried): SUCCESS at v3 head.
  • PR body unchanged since v2 edit; ## Size and split offer + clarified ## Validation still present.
  • Stacked-on-#287 scope discipline preserved. New commit 1cffe54 is PR292-specific test-only change; no scope drift.
  • No new top-level comments, review threads, or inline comments since v1.

Verification at v3 head

  • uv run pytest src/servers/smart_grid/tests/test_scenarios.py -q → 7 passed.
  • uv run ruff format --check src/servers/smart_grid/tests/test_scenarios.py → already formatted.
  • uv run ruff check src/servers/smart_grid/tests/test_scenarios.py → all checks passed.
  • Commit chain verified: a5b35a93fb6943e8b3ab0c5067b95b7d8b01cffe54.

Summary counts

  • Critical: 0
  • High: 0
  • Medium: 0
  • Low: 0
  • Nit: 0 (N1+N3 closed in v2 body edit; N2 closed in v3 commit 1cffe54)

Verdict

LGTM — final-confirmation clean at v3 head 1cffe54. All three v1 Nits closed via the right vehicles: N1+N3 in PR-body edit (no commit cost), N2 in surgical fixup commit. Remaining gate is purely IBM-maintainer external review.

@DhavalRepo18 DhavalRepo18 self-requested a review May 11, 2026 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants