HPML Final Project: [Extending AssetOpsBench: Smart Grid Transformer Integration and Automated Scenario Generation]

Course: High Performance Machine Learning Semester: Spring 2026 Instructor: Dr. Kaoutar El Maghraoui

Team Information

Team Number: 8
Members:
- Rohith Kanathur (rk3443) — Transformer Asset Class Integration, Profiling, Optimization, Wandb Logging, Evaluation, Documentation and Report Writing.
- Sagar Chethan Kumar (sc5758) — Scenario generation pipeline implementation (research & digest agent, validate & repair agent, and domain generation)
- Ananya Kapoor (ak5447) — Scenario schema validation, eval unit tests, category alignment fix, --static-only mode, literature survey and report writing.
- Dev Bahl (db3791) — Eval tests and report rendering, baseline vs optimized comparison, CI pipeline, experiment reproduction.

Submission

GitHub repository: https://github.com/Rohith-Kanathur/AssetOpsBench
Final report: deliverables/Team8_HPML_Final_Report.pdf
Final presentation: deliverables/Team8_HPML_Final_Presentation.pptx
Experiment-tracking dashboard: https://wandb.ai/rk3443-columbia-university/assetopsbench

1. Problem Statement

LLM-based AI agents redefine Industry 4.0 asset operations; integrating perception, reasoning, and action across complex industrial systems. Yet evaluating these agents at scale requires large, high-quality benchmark scenarios that are expensive to produce manually. AssetOpsBench, the first unified framework for industrial asset agent evaluation, initially comes with only 141 scenarios. These are handcrafted by SMEs covering a narrow set of HVAC assets (chillers and AHUs), leaving critical classes such as high-voltage power transformers excluded. Beyond asset-class coverage, scenario creation itself does not scale: every new asset type demands physically plausible, causally consistent, tool-reachable, and standards-compliant scenarios authored by subject-matter experts. This project addresses both gaps by extending AssetOpsBench with a new Smart Grid Transformer asset class and introducing a ScenarioGeneratorAgent that automatically generates, repairs, and validates scenarios through asset profiling, budget allocation, and constrained LLM generation stages. The optimizations target inference stage by reducing latency and token generation. The primary performance bottleneck with the unoptimized baseline is sequential and redundant LLM calls, time consuming domain literature retreival and blocking I/O that together dominate end-to-end wall time. The optimizations that were applied to address this include: a two-level cache (in-memory LRU + disk JSON), a thread pool offloading blocking I/O operations to run concurrently rather than sequentially, and parallelized scenario generation across focus groups. This reduces end-to-end pipeline time by up to 8× for 50 scenarios with no measurable degradation in scenario quality (mean quality score: 74.2 ± 1.9 for optimized vs. 73.8 ± 3.0 for the baseline).

2. Model/Application Description

Model: meta-llama/llama-3-3-70b-instruct served via IBM WatsonX, used as the LLM backbone across all five pipeline stages: asset profiling, budget allocation, per-domain scenario generation, validation, and repair. A lightweight supervised regression model (trained on the Mendeley transformer health dataset) is used for the predict_health_index tool in the Smart Grid Transformer asset class.
Framework: LiteLLM for unified LLM API access across providers; [Model Context Protocol (MCP)] for structured tool interfaces exposing the five industrial agent servers (IoT, FMSR, TSFM, WO, Vibration). PyTorch Profiler is used for per-phase instrumentation. Weights & Biases for experiment tracking.
Dataset: Three data sources: (1) existing AssetOpsBench chiller/AHU scenarios as few-shot style examples, (2) academic literature retrieved at runtime from ArXiv and SemanticScholar for asset profile grounding, and (3) the Mendeley Transformer Health Dataset for training the health index predictor.
Custom components: ScenarioGeneratorAgent pipeline with a validate-and-repair loop enforcing; four new MCP tools for Smart Grid Transformer diagnostics grounded in IEC 60599 and IEC 60076-7 standards; two-level (L1 in-memory LRU + L2 disk) asset profile cache; AsyncBatchSemaphore for rate-limited parallel focus-group execution, thread pool offloader to offload blocking LLM calls.
Hardware: NVIDIA A100 and H100 GPUs (IBM WatsonX inference cluster)

3. Final Results Summary

All experiments use the Smart Grid Transformer asset in CouchDB-grounded mode. Times are wall-clock seconds measured via timed_section spans and validated against PyTorch Profiler Chrome traces.

Scalability: Pipeline Time vs. Number of Scenarios

Baseline (unoptimized)

Phase	N=10	N=25	N=50
Get Server Descriptions	2.40 s	2.58 s	2.20 s
Build Asset Profile	448.28 s	338.90 s	325.70 s
Allocate Scenario Budget	2.05 s	3.37 s	2.34 s
Generate & Validate Single-Agent	11.40 s	34.79 s	38.83 s
Generate & Validate Multi-Agent	4.05 s	20.67 s	39.32 s
Full Pipeline	468.22 s	400.35 s	408.43 s

Optimized (warm cache)

Phase	N=10	N=25	N=50
Get Server Descriptions	1.56 s	2.42 s	2.16 s
Build Asset Profile	0.00 s	0.00 s	0.00 s
Allocate Scenario Budget	0.00 s	0.00 s	0.00 s
Generate & Validate Single-Agent	8.23 s	14.86 s	15.95 s
Generate & Validate Multi-Agent	10.73 s	17.28 s	18.11 s
Full Pipeline	13.63 s	19.88 s	50.86 s

Speedup: 34× at N=10, 20× at N=25, 8× at N=50. The two-level cache eliminates Phase 1 (Build Asset Profile) entirely on warm runs, phase 1 is the dominant cost in the baseline (80–96% of total time). The decreasing speedup ratio at larger N reflects the time taken to generate scenarios in phase 3.

Scenario Quality: Baseline vs. Optimized

50 Smart Grid Transformer scenarios generated by each pipeline (CouchDB-grounded, N=50), evaluated across 3 independent runs using the three-stage evaluation scheme (Static /20 + LLM Judge /30 + Dry-Run /50 = /100 composite). A score ≥ 70 is considered high quality.

Metric	Baseline	Optimized
Static Score (/20)	18.4 ± 0.6	19.1 ± 0.3
LLM Judge Score (/30)	22.2 ± 3.6	22.4 ± 1.8
Dry-Run Score (/50)	33.1 ± 1.3	32.7 ± 2.5
Quality Score (/100)	73.8 ± 3.0	74.2 ± 1.9

The 0.4-point gap is well within run-to-run variance, confirming that the speed optimizations come at no cost to scenario quality. Both pipelines exceed the 70-point high-quality threshold on average.

Hardware: NVIDIA A100 / H100 GPUs (IBM WatsonX inference cluster)

Headline result: Two-level caching + parallel focus-group execution + thread-pool offloading reduces end-to-end pipeline time from 408 s to 51 s (8× speedup) for 50 scenarios, with mean scenario quality score almost unchanged at 74.2 ± 1.9 for optimized vs. 73.8 ± 3.0 for the baseline.

4. Repository Structure

.
├── README.md
├── LICENSE
├── pyproject.toml
├── deliverables/           # Final report (PDF) and final presentation (PPT/PDF) — same files uploaded to CourseWorks
│   ├── Team8_HPML_Final_Report.pdf
│   └── Team8_HPML_Final_Presentation.pptx
└── src/
    ├── agent/              # LLM agent runners, CLI, and plan-execute orchestration
    ├── couchdb/            # CouchDB setup, Docker Compose, and asset data initialisation
    ├── evaluation/         # Evaluation utilities
    ├── llm/                # LiteLLM wrapper and base LLM abstractions
    ├── observability/      # Tracing, run spans, and file exporters
    ├── scenarios/          # Base scenario models, grounding, retrieval, and prompts
    ├── scenarios_evaluation/   # Three-stage scenario quality evaluator (static + LLM judge + dry-run)
    ├── scenarios_optimization/ # Optimized scenario generation 
    ├── scenarios_profiling/    # PyTorch-profiler-instrumented scenario generation pipeline
    ├── scenarios_testing/      # Scenario generation smoke tests
    ├── scenarios_wandb/        # W&B-integrated optimized scenario generation
    └── servers/            # MCP tool servers (IoT, FMSR, TSFM, WO, Vibration, Utilities)
└── outputs/ # Sample profiling and wandb outputs. (Detailed outputs present in the PPT and report)

5. Reproducibility Instructions

A. Environment Setup

# Clone
git clone https://github.com/Rohith-Kanathur/AssetOpsBench.git
cd AssetOpsBench

Run from the repo root:

uv sync

Activate Virtual Environment

source .venv/bin/activate   # macOS / Linux

Copy .env.public to .env

cp .env.public .env
# Then edit .env and set WATSONX_APIKEY, WATSONX_PROJECT_ID, WATSONX_URL

Start CouchDB

docker compose -f src/couchdb/docker-compose.yaml up -d

Verify CouchDB is running:

curl -X GET http://localhost:5984/

B. Experiment Tracking Dashboard

Weights & Biases is used for run-level tracking. Logging is done for scenario generation and evaluation stages. Metrics captured: phase wise timing, llm call metrics, cache hit/miss metrics, valid/invalid scenario count and scenario evaluation scores.

🔗 Dashboard: [https://wandb.ai/rk3443-columbia-university/assetopsbench] Platform used: [Weights & Biases]

C. Dataset

Dataset used for Health Index Prediction FMSR tool is available here: https://data.mendeley.com/datasets/rz75w3fkxy/1

D. Training

To reproduce the baseline:

uv run python -m scenarios_profiling.generator Transformer --num-scenarios 50 --data-in-couchdb --profile --profile-dir profiling_output/exp1_latency_baseline

To reproduce the optimized run:

uv run python -m scenarios_optimization.generator Transformer --num-scenarios 50 --data-in-couchdb --profile --profile-dir profiling_output/exp1_latency_optimized

To reproduce the run with wandb logging, run:

uv run python -m scenarios_wandb.generator Transformer \
  --data-in-couchdb \
  --num-scenarios 50 \
  --wandb \
  --wandb-project assetopsbench \
  --wandb-run-name transformer-50-openform

E. Evaluation

uv run python src/scenarios_evaluation eval_scenarios_wandb.py --wandb --wandb-project assetopsbench --wandb-run-name transformer-eval-50

Ensure that scenarios.json file is present in the same location as eval_scenarios_wandb.py.

F. Profiling

To visualize the profiler traces referenced in the report:

Run the pipelines as suggested in section D. Training.
Profiling traces will be available in profiling_output/
Open the chrome trace json files on https://ui.perfetto.dev/

G. Quickstart: Reproduce the Headline Result

To reproduce the headline result in Section 3: Final Results Summary:

Run the baseline once:

uv run python -m scenarios_profiling.generator Transformer --num-scenarios 50 --data-in-couchdb --profile --profile-dir profiling_output/exp1_latency_baseline

Run the optimized variant twice (To see the effect of caching):

uv run python -m scenarios_optimization.generator Transformer --num-scenarios 50 --data-in-couchdb --profile --profile-dir profiling_output/exp1_latency_optimized

Open the chrome traces present in profiling_output/ on https://ui.perfetto.dev/.

6. Results and Observations

Exp 1 - Scalability (Baseline vs. Optimized, N=10/25/50): The unoptimized baseline spends 80–96% of its total wall time on Phase 1 (Build Asset Profile: 326–448 s), dominated by sequential academic retrieval, PDF fetching, and LLM synthesis. The optimized pipeline with a warm two-level cache reduces this phase to 0 s, yielding end-to-end speedups of 34× (N=10), 20× (N=25), and 8× (N=50). The decreasing ratio at larger N reflects the growing share of scenario-generation phases, which scale with budget but remain well below baseline even at N=50.
Exp 2 - Parallelism (C=1→5 parallel focus groups, N=50): Increasing C from fully serial (C=1, 87.4 s) to C=5 (47.3 s) delivers a 1.85× speedup. Phase 3 (Generate & Validate Single-Agent) shows the sharpest gains: 43.9 s → 17.6 s. Returns diminish beyond C=3, suggesting that LLM API rate limits and network latency become the bottleneck at higher concurrency, not local CPU or I/O.
Exp 3 - Caching (cold vs. warm cache, N=50): A single cold run primes the disk cache; all subsequent warm runs skip Phase 1 entirely (247.4 s → 0 s), reducing total pipeline time from 302.5 s to 65.1 s, a 4.6× speedup with no changes to generation logic. The minor increase in Phase 4 (35.3 s → 45.6 s warm) is within normal LLM API latency variance.
Exp 4 - Combined optimizations under cold cache (Baseline vs. Optimized, both cold): Isolating the non-caching gains; thread-pool I/O offloading and parallel focus execution in the optimized variant still achieves a 1.52× end-to-end speedup (408.4 s → 269.0 s). Phase 1 improves by 33% (325.7 s → 216.9 s) from concurrent PDF/DB/retrieval requests; Phase 3 improves by 2.65× (38.8 s → 14.7 s) from parallel focus-group generation.
Quality (Baseline vs. Optimized, N=50, n=3 runs): Speed gains come at zero quality cost. Composite quality scores are statistically indistinguishable: 74.2 ± 1.9 (optimized) vs. 73.8 ± 3.0 (baseline). Both exceed the 70-point high-quality threshold.

Detailed results are present in the deliverables/.

7. Notes

All source code lives under src/. See Section 4 for a directory-level description of each package.
Profiler Chrome traces from reported experiments will be created under profiling_output/. Load any chrome_trace.json at https://ui.perfetto.dev for a visual flame chart.
All secrets (API keys, WatsonX credentials, W&B tokens) are loaded from environment variables. Copy .env.public to .env and fill in WATSONX_APIKEY, WATSONX_PROJECT_ID, and WATSONX_URL before running any pipeline command.
The health index regression model for the Smart Grid Transformer FMSR tool is trained on the Mendeley Transformer Health Dataset (open access).
LLM inference runs on remote IBM WatsonX infrastructure. GPU utilisation metrics captured by W&B therefore reflect only local host activity and do not measure true model-serving compute cost.
The W&B project for this work is public: https://wandb.ai/rk3443-columbia-university/assetopsbench.

AI Use Disclosure

Did your team use any AI tool in completing this project?

No, we did not use any AI tool.
Yes, we used AI assistance as described below.

Tool(s) used: e.g., Claude, Cursor

Specific purpose: AI assistance was used for targeted debugging and refinement tasks only. All the core design decisions, experimental methodology, and reported results were produced by the team only.

Sections affected:

W&B Logging: W&B metrics were silently dropped when wandb.log() was called from non-main asyncio threads during parallel focus-group generation. Used AI help to diagnose the root cause and restructure logging into a deferred batch that is flushed from the main thread after asyncio.run() completes.
Thread Pool Offloading Optimization: Thread pool offloading introduced a subtle issue where run_in_executor was being called with a None loop reference causing RuntimeError: cannot schedule new futures after shutdown. Used AI assistance to identify the lifecycle mismatch and add a guard that re-uses the running loop's default executor rather than creating a new one per call.
src/scenarios_profiling/generator/prompts/ and eval_scenarios.py: Used AI assistance to improve the LLM judge prompt and prompts used in scenario generation for clarity and reduction of ambiguous instructions that were causing incorrect outputs.
Report Writing: Light prose refinement in the abstract, introduction, and discussion sections to improve grammar, clarity, and conciseness. All core ideas, experimental methodology, analysis, and conclusions were executed and written entirely by the team members. AI assistance was limited to sentence-level rewording only and did not influence any reported results.

How we verified correctness: All W&B logging fixes were verified by running the generation pipeline multiple times and confirming every expected metric (phase timings, cache hit rates, batch results, LLM call records) appeared on the W&B dashboard. The thread pool fix was verified by running the optimized pipeline many consecutive times without a RuntimeError. Prompt improvements were validated by re-running the full three-stage evaluation suite and confirming quality scores remained stable . All reported experiment numbers were produced by running the pipelines ourselves end-to-end. No AI-generated numbers appear anywhere in the results. Report prose changes were reviewed line-by-line by team members to ensure no false information was reported.

By submitting this project, the team confirms that the analysis, interpretations, and conclusions are our own, and that any AI assistance is fully disclosed above.

8. Bonus Opportunities

Open source Pull Request Submitted to AssetOpsBench: IBM#288

The scenarios generated from our pipeline are available on Huggingface: https://huggingface.co/datasets/rohithkanathur/assetopsbench-transformer-scenarios-dataset

License

Released under the MIT License. See LICENSE.

Citation

This is built on AssetOpsBench under the guidance of Dr. Dhaval Patel

@misc{patel2026assetopsbenchbenchmarkingaiagents,
      title={AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance}, 
      author={Dhaval Patel and Shuxin Lin and James Rayfield and Nianjun Zhou and Chathurangi Shyalika and Suryanarayana R Yarrabothula and Roman Vaculin and Natalia Martinez and Fearghal O'donncha and Jayant Kalagnanam},
      year={2026},
      eprint={2506.03828},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.03828}, 
}

Contact

Open a GitHub Issue or email [rk3443@columbia.edu] or [sagar.chethankumar@columbia.edu].

HPML Spring 2026 — Dr. Kaoutar El Maghraoui — Columbia University

Name		Name	Last commit message	Last commit date
Latest commit History 685 Commits
.github		.github
deliverables		deliverables
outputs		outputs
src		src
.all-contributorsrc		.all-contributorsrc
.env.public		.env.public
.gitignore		.gitignore
.python-version		.python-version
.whitesource		.whitesource
CONTRIBUTING.md		CONTRIBUTING.md
INSTRUCTIONS.md		INSTRUCTIONS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
renovate.json		renovate.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HPML Final Project: [Extending AssetOpsBench: Smart Grid Transformer Integration and Automated Scenario Generation]

Team Information

Submission

1. Problem Statement

2. Model/Application Description

3. Final Results Summary

Scalability: Pipeline Time vs. Number of Scenarios

Scenario Quality: Baseline vs. Optimized

4. Repository Structure

5. Reproducibility Instructions

A. Environment Setup

B. Experiment Tracking Dashboard

C. Dataset

D. Training

E. Evaluation

F. Profiling

G. Quickstart: Reproduce the Headline Result

6. Results and Observations

7. Notes

AI Use Disclosure

8. Bonus Opportunities

License

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HPML Final Project: [Extending AssetOpsBench: Smart Grid Transformer Integration and Automated Scenario Generation]

Team Information

Submission

1. Problem Statement

2. Model/Application Description

3. Final Results Summary

Scalability: Pipeline Time vs. Number of Scenarios

Scenario Quality: Baseline vs. Optimized

4. Repository Structure

5. Reproducibility Instructions

A. Environment Setup

B. Experiment Tracking Dashboard

C. Dataset

D. Training

E. Evaluation

F. Profiling

G. Quickstart: Reproduce the Headline Result

6. Results and Observations

7. Notes

AI Use Disclosure

8. Bonus Opportunities

License

Citation

Contact

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages