feat(smart_grid): add 25 additional transformer scenarios by eggrollofchaos · Pull Request #292 · IBM/AssetOpsBench

eggrollofchaos · 2026-05-11T03:38:21Z

Summary

Adds the remaining SmartGridBench transformer-maintenance scenarios on top of #287, expanding the Smart Grid local corpus from 36 records to 61 records.

This PR adds:

SGT-036..SGT-050: domain-coverage gap-fill scenarios across FMSR, IoT, TSFM, work-order, and multi-tool workflows.
SGT-051..SGT-060: capability-targeted discrimination checks covering calibration/abstention, prompt-premise contradiction, cross-tool reconciliation, strict output formatting, and truncated-tool-result discipline.
Raw JSON tests that assert the full AOB-FMSR-001 + SGT-001..SGT-060 ID set and guard the new benchmark_design / ground_truth.must_NOT_include rubric fields.
README and provenance-doc updates so the corpus count and evaluator-facing metadata are documented.

Relationship to #287

This is a follow-on to #287 and is intentionally based on the #287 branch head. Until #287 lands, GitHub will show the domain/server port plus this additional-scenario commit together in this PR's diff. The new commit here is only:

5b7d8b0 feat(smart_grid): add 25 additional transformer scenarios

If maintainers prefer, #287 can be reviewed first and this PR can then be rebased so the visible diff contains only the 25-scenario expansion.

Size and split offer

This follow-on is larger than IBM's preferred small-PR guideline because it updates one canonical scenario array, its count/metadata tests, and the matching provenance documentation as one invariant-preserving unit. The new PR-specific code/data delta is the single 25-scenario expansion commit on top of #287. If maintainers prefer a smaller review path after #287 lands, I can split this into separate gap-fill (SGT-036..SGT-050) and capability-targeted (SGT-051..SGT-060) scenario PRs.

Data policy

No raw or processed SmartGridBench CSV files are included. The added records use the same SG_DATA_DIR runtime data contract already documented in #287.

Validation

uv run pytest src/servers/smart_grid/tests/test_scenarios.py -q — 7 passed.
uv run pytest src/servers/smart_grid/ -q — 44 passed.
uv run ruff format --check src/servers/smart_grid/ — clean.
uv run ruff check src/servers/smart_grid/ — clean.
src/scenarios/local/smart_grid.json contains 61 unique records: AOB-FMSR-001 plus SGT-001..SGT-060.
Capability-rubric fields are present at the expected current-batch floors: 10 records with benchmark_design and 13 records with ground_truth.must_NOT_include. Tests assert these as minimum floors so future additions or harmless reorders do not require test rewrites.

Acknowledgments

Source-project authors (Columbia SmartGridBench, Spring 2026): Akshat Bhandari, Aaron Fan, Tanisha Rathod, Wei Alexander Xin.

References

Depends on: feat(smart_grid): add Smart Grid transformer MCP servers and 36-scenario corpus #287
Source project: https://github.com/HPML6998-S26-Team13/hpml-assetopsbench-smart-grid-mcp

…rio corpus Adds the Smart Grid transformer-maintenance domain to AssetOpsBench as a focused upstream cut from the SmartGridBench source project (Columbia University, 2026). New surfaces: - Smart Grid MCP servers under `src/servers/smart_grid/` for IoT, FMSR/DGA, TSFM/RUL, and work-order workflows. Nested under a domain-specific sub-namespace to coexist with the existing domain-general `src/servers/{iot,fmsr,tsfm,wo}` servers (different backends, asset types, and data assumptions; PR body documents the design rationale). - A direct adapter exposing the Smart Grid tools as plain Python callables. - 36 canonical Smart Grid scenarios + 5 negative-check fixtures in the AOB local scenario array convention; extended evaluator metadata documented in `docs/smart_grid_data_provenance.md`. - `SG_DATA_DIR` runtime data-provenance contract and a no-CSV-port policy: no raw or processed source-project CSV datasets are shipped. - Console-script entry points for the four Smart Grid MCP servers. - Unit tests for the direct adapter, IEC 60599 DGA classification, JSON-safe divergent ratios, and scenario shape/uniqueness. Validation: uv run pytest src/servers/smart_grid/ -- 25 passed. Scenario JSON contains 36 unique canonical records and 5 unique negative-check records. Refs: HPML6998-S26-Team13/hpml-assetopsbench-smart-grid-mcp#46 Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>

Normalize pandas Timestamp values from get_dga_record before returning them through MCP JSON-RPC. Without this, valid DGA lookups over parsed CSV fixtures fail strict JSON serialization even though the in-process Python call succeeds. Adds a regression test that builds a temporary SG_DATA_DIR fixture, calls get_dga_record, and verifies json.dumps(..., allow_nan=False) succeeds. This is a follow-up from PR IBM#287 self-review and should remain a separate review-iteration commit on the published branch. Validation: - SG_DATA_DIR=/Users/wax/coding/hpml-assetopsbench-smart-grid-mcp/data/processed uv run pytest src/servers/smart_grid/ - uv run ruff format --check src/servers/smart_grid/ - uv run ruff check src/servers/smart_grid/ - SG_DATA_DIR=/Users/wax/coding/hpml-assetopsbench-smart-grid-mcp/data/processed uv run python <19-tool JSON serialization smoke> Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>

Move the JSON-safe record normalizer from `fmsr/main.py` (where it was added in `fix(smart_grid): serialize DGA sample dates`) up to `base.py` as the public canonical helper `json_safe_record`. Replace the latent pre-fix `_normalize_record` in `wo/main.py` (which only handled `pd.isna`, not `pd.Timestamp`) with the canonical helper. `wo._normalize_record` was correct in behavior at the time it ran because `load_fault_records` does not currently pass `parse_dates`, so no `pd.Timestamp` ever leaked through. Adding `parse_dates=["report_date"]` (or similar) later would have silently broken JSON-RPC the same way the DGA path broke before its fix. Centralizing the boundary normalizer prevents that regression class. Verification: `uv run pytest src/servers/smart_grid/` -- 42 passed. `uv run ruff format --check src/servers/smart_grid/` clean. `uv run ruff check src/servers/smart_grid/` clean. Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>

Add `tests/test_json_safety.py` that walks every `@mcp.tool()`-decorated callable across `iot`, `fmsr`, `tsfm`, and `wo` and asserts `json.dumps(result, allow_nan=False)` succeeds against a hermetic `SG_DATA_DIR` fixture. Catches the boundary-contract bug class fixed in `fmsr.get_dga_record` for any current or future Smart Grid tool, without per-tool test boilerplate. The fixture writes minimal CSVs for all six processed-data files, sets `SG_DATA_DIR` to a `tmp_path`, and resets module-level dataframe caches across all four servers so each test gets a clean read path. 16 parametrized cases land 42 total tests passing (was 26). Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>

Expand the Smart Grid scenario corpus on top of IBM PR IBM#287 from 36 records to 61 records by adding SGT-036 through SGT-060 from the SmartGridBench source project. The added batch includes domain-coverage gap-fill scenarios plus capability-targeted discrimination checks with benchmark_design metadata and negative must_NOT_include rubric fields. Update the Smart Grid provenance docs and README count so the corpus size and new evaluator-facing metadata are documented. Extend the scenario JSON tests to assert the full SGT-001..SGT-060 ID set and guard against silently dropping the capability-targeted rubric fields. Validation: - uv run pytest src/servers/smart_grid/tests/test_scenarios.py -q - uv run pytest src/servers/smart_grid/ -q - uv run ruff format --check src/servers/smart_grid/ - uv run ruff check src/servers/smart_grid/ Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>

eggrollofchaos

Self-review

Upstream-context pass after opening. Stacked on #287; new commit specific to #292 is 5b7d8b0.

PR context checked: head 5b7d8b0fed1c1ee744a4e6753282e80c826cd779; top-level comments 0; review records 0; inline comments 0; review threads 0; DCO SUCCESS at 2026-05-11T03:38:24Z; all 5 commits signed off; mergeStateStatus: BLOCKED, reviewDecision: REVIEW_REQUIRED (IBM-maintainer gates, expected).

Critical: none.

High: none.

Medium: none.

Low: none.

Nit

N1 — PR size +754 lines exceeds IBM's <300 lines preference without acknowledgment in PR body. #287 had a ## Size and split offer section; #292 doesn't. Adding a short paragraph naming the natural split boundary (gap-fill SGT-036..SGT-050 vs capability-targeted SGT-051..SGT-060) defuses the size question pre-emptively. No code change.
N2 — src/servers/smart_grid/tests/test_scenarios.py:108-112: per-record assertions are truthy-only (assert design.get("target_capability"), raw["id"]). A future regression that stuffs target_capability: " " (whitespace) would still pass. Optional: wrap in isinstance(...) and value.strip() for safer guard.
N3 — PR body validation section says "10 records with benchmark_design and 13 records with must_NOT_include" as exact numbers; tests assert >= 10 and >= 13 (forward-compatible floor). Intentional but a reader checking the body against the test might briefly stall. Optional clarification in body wording ("10 records (floor)" or similar).

Verified non-findings

README 36 → 61 records update accurate; new phrasing fairly describes corpus shape.
Provenance doc additions correctly distinguish SGT-036..SGT-050 (gap-fill) from SGT-051..SGT-060 (capability-targeted); two new evaluator-metadata table rows describe optional fields without overclaiming.
Acknowledgments paragraph credits source-project authors alphabetically (Akshat Bhandari, Aaron Fan, Tanisha Rathod, Wei Alexander Xin). Consistent with #287.
All expected_tools namespace-prefixed correctly across the 25 new scenarios.
No raw or processed CSV/data files included — SG_DATA_DIR runtime contract preserved.
Stacking is clean: 5b7d8b0 is the only PR292-specific commit; the four predecessors are #287 content visible because #287 hasn't landed yet.
DCO check green; all commits signed off with consistent identity.
Branch name feature/smart-grid-additional-scenarios matches <type>/<description> convention.

Verification

uv run pytest src/servers/smart_grid/tests/test_scenarios.py -q → 7 passed.
uv run pytest src/servers/smart_grid/ -q → 44 passed.
uv run ruff format --check src/servers/smart_grid/ → clean.
uv run ruff check src/servers/smart_grid/ → clean.
JSON corpus: 61 records, unique IDs, exact set {AOB-FMSR-001} ∪ {SGT-001..SGT-060}.
PR292 batch (SGT-036..SGT-060): 25 records, all schema-valid, benchmark_design=10/25, must_NOT_include=13/25.

Verdict

LGTM with 3 Nits. None blocks merge. IBM-maintainer review is the remaining external gate.

Address the PR IBM#292 v1 review nit by treating whitespace-only evaluator rubric strings as invalid. The capability-targeted test now shares a small non-empty-string predicate for benchmark_design fields and must_NOT_include entries, so future corpus edits cannot satisfy the preservation guard with blank strings.\n\nValidation:\n- uv run ruff format src/servers/smart_grid/tests/test_scenarios.py\n- uv run ruff check src/servers/smart_grid/tests/test_scenarios.py\n- uv run pytest src/servers/smart_grid/tests/test_scenarios.py -q\n- uv run pytest src/servers/smart_grid/ -q Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>

eggrollofchaos

Self-review follow-up

PR context checked at head 1cffe54ff734ed9bb300957f5d06059268276c09; top-level comments 0; review records 1 prior (v1 at 5b7d8b0); inline comments 0; review threads 0; DCO SUCCESS at 2026-05-11T21:41:48Z on v3 head; mergeStateStatus: BLOCKED, reviewDecision: REVIEW_REQUIRED (IBM-maintainer gates, expected).

Body originally posted at v2 head 5b7d8b0 after the PR-body edit closing N1+N3; edited in place after v3 head 1cffe54 landed with the N2 fixup commit, to keep one consolidated follow-up record rather than two.

v1 Nit closure

N1 — closed in v2 PR-body edit. New ## Size and split offer section honestly acknowledges size, names natural split boundary, offers to split post-#287.
N3 — closed in v2 PR-body edit. Validation bullet rewritten to "current-batch floors" + forward-compat rationale.
N2 — closed at v3 commit 1cffe54 (test(smart_grid): tighten rubric string assertions). Added _non_empty_string(value) predicate (isinstance(value, str) and bool(value.strip())); replaced 3 truthy-only assertions uniformly for target_capability, discrimination_hypothesis, and must_NOT_include per-item checks. Surgical 7+/-3-line diff. Helper extraction appropriate (used 3x). Whitespace-string regression now blocked.

Probed and ruled out

_non_empty_string edge cases: "" → False; " " → False; "\t\n" → False (handles whitespace beyond regular spaces); "a" → True; non-string values (None, int, dict) → False (defensive). Minimal but correctly typed.
All 3 call sites use the predicate uniformly; no inconsistency.
must_NOT_include outer assertion (isinstance(excluded, list) and excluded) retained; inner per-item check is now _non_empty_string(item). Two-layer correctness.
DCO re-ran on new commit (not carried): SUCCESS at v3 head.
PR body unchanged since v2 edit; ## Size and split offer + clarified ## Validation still present.
Stacked-on-#287 scope discipline preserved. New commit 1cffe54 is PR292-specific test-only change; no scope drift.
No new top-level comments, review threads, or inline comments since v1.

Verification at v3 head

uv run pytest src/servers/smart_grid/tests/test_scenarios.py -q → 7 passed.
uv run ruff format --check src/servers/smart_grid/tests/test_scenarios.py → already formatted.
uv run ruff check src/servers/smart_grid/tests/test_scenarios.py → all checks passed.
Commit chain verified: a5b35a9 → 3fb6943 → e8b3ab0 → c5067b9 → 5b7d8b0 → 1cffe54.

Summary counts

Critical: 0
High: 0
Medium: 0
Low: 0
Nit: 0 (N1+N3 closed in v2 body edit; N2 closed in v3 commit 1cffe54)

Verdict

LGTM — final-confirmation clean at v3 head 1cffe54. All three v1 Nits closed via the right vehicles: N1+N3 in PR-body edit (no commit cost), N2 in surgical fixup commit. Remaining gate is purely IBM-maintainer external review.

eggrollofchaos added 5 commits May 9, 2026 21:55

DhavalRepo18 added the External contribution label May 11, 2026

jasdian mentioned this pull request May 11, 2026

Proposal: A judge-calibration subset for AssetOpsBench #296

Closed

eggrollofchaos commented May 11, 2026

View reviewed changes

DhavalRepo18 self-requested a review May 11, 2026 22:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(smart_grid): add 25 additional transformer scenarios#292

feat(smart_grid): add 25 additional transformer scenarios#292
eggrollofchaos wants to merge 6 commits into
IBM:mainfrom
HPML6998-S26-Team13:feature/smart-grid-additional-scenarios

eggrollofchaos commented May 11, 2026 •

edited

Loading

Uh oh!

eggrollofchaos left a comment •

edited

Loading

Uh oh!

eggrollofchaos left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eggrollofchaos commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Relationship to #287

Size and split offer

Data policy

Validation

Acknowledgments

References

Uh oh!

eggrollofchaos left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Self-review

Uh oh!

eggrollofchaos left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Self-review follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eggrollofchaos commented May 11, 2026 •

edited

Loading

eggrollofchaos left a comment •

edited

Loading

eggrollofchaos left a comment •

edited

Loading