HPML Final Project: [Extending AssetOpsBench: Smart Grid Transformer Integration and Automated Scenario Generation]
Course: High Performance Machine Learning Semester: Spring 2026 Instructor: Dr. Kaoutar El Maghraoui
- Team Number: 8
- Members:
- Rohith Kanathur (rk3443) — Transformer Asset Class Integration, Profiling, Optimization, Wandb Logging, Evaluation, Documentation and Report Writing.
- Sagar Chethan Kumar (sc5758) — Scenario generation pipeline implementation (research & digest agent, validate & repair agent, and domain generation)
- Ananya Kapoor (ak5447) — Scenario schema validation, eval unit tests, category alignment fix, --static-only mode, literature survey and report writing.
- Dev Bahl (db3791) — Eval tests and report rendering, baseline vs optimized comparison, CI pipeline, experiment reproduction.
- GitHub repository: https://github.com/Rohith-Kanathur/AssetOpsBench
- Final report:
deliverables/Team8_HPML_Final_Report.pdf - Final presentation:
deliverables/Team8_HPML_Final_Presentation.pptx - Experiment-tracking dashboard: https://wandb.ai/rk3443-columbia-university/assetopsbench
LLM-based AI agents redefine Industry 4.0 asset operations; integrating perception, reasoning, and action across complex industrial systems. Yet evaluating these agents at scale requires large, high-quality benchmark scenarios that are expensive to produce manually. AssetOpsBench, the first unified framework for industrial asset agent evaluation, initially comes with only 141 scenarios. These are handcrafted by SMEs covering a narrow set of HVAC assets (chillers and AHUs), leaving critical classes such as high-voltage power transformers excluded. Beyond asset-class coverage, scenario creation itself does not scale: every new asset type demands physically plausible, causally consistent, tool-reachable, and standards-compliant scenarios authored by subject-matter experts. This project addresses both gaps by extending AssetOpsBench with a new Smart Grid Transformer asset class and introducing a ScenarioGeneratorAgent that automatically generates, repairs, and validates scenarios through asset profiling, budget allocation, and constrained LLM generation stages. The optimizations target inference stage by reducing latency and token generation. The primary performance bottleneck with the unoptimized baseline is sequential and redundant LLM calls, time consuming domain literature retreival and blocking I/O that together dominate end-to-end wall time. The optimizations that were applied to address this include: a two-level cache (in-memory LRU + disk JSON), a thread pool offloading blocking I/O operations to run concurrently rather than sequentially, and parallelized scenario generation across focus groups. This reduces end-to-end pipeline time by up to 8× for 50 scenarios with no measurable degradation in scenario quality (mean quality score: 74.2 ± 1.9 for optimized vs. 73.8 ± 3.0 for the baseline).
- Model:
meta-llama/llama-3-3-70b-instructserved via IBM WatsonX, used as the LLM backbone across all five pipeline stages: asset profiling, budget allocation, per-domain scenario generation, validation, and repair. A lightweight supervised regression model (trained on the Mendeley transformer health dataset) is used for thepredict_health_indextool in the Smart Grid Transformer asset class. - Framework: LiteLLM for unified LLM API access across providers; [Model Context Protocol (MCP)] for structured tool interfaces exposing the five industrial agent servers (IoT, FMSR, TSFM, WO, Vibration). PyTorch Profiler is used for per-phase instrumentation. Weights & Biases for experiment tracking.
- Dataset: Three data sources: (1) existing AssetOpsBench chiller/AHU scenarios as few-shot style examples, (2) academic literature retrieved at runtime from ArXiv and SemanticScholar for asset profile grounding, and (3) the Mendeley Transformer Health Dataset for training the health index predictor.
- Custom components:
ScenarioGeneratorAgentpipeline with a validate-and-repair loop enforcing; four new MCP tools for Smart Grid Transformer diagnostics grounded in IEC 60599 and IEC 60076-7 standards; two-level (L1 in-memory LRU + L2 disk) asset profile cache; AsyncBatchSemaphore for rate-limited parallel focus-group execution, thread pool offloader to offload blocking LLM calls. - Hardware: NVIDIA A100 and H100 GPUs (IBM WatsonX inference cluster)
All experiments use the Smart Grid Transformer asset in CouchDB-grounded mode. Times are wall-clock seconds measured via timed_section spans and validated against PyTorch Profiler Chrome traces.
Baseline (unoptimized)
| Phase | N=10 | N=25 | N=50 |
|---|---|---|---|
| Get Server Descriptions | 2.40 s | 2.58 s | 2.20 s |
| Build Asset Profile | 448.28 s | 338.90 s | 325.70 s |
| Allocate Scenario Budget | 2.05 s | 3.37 s | 2.34 s |
| Generate & Validate Single-Agent | 11.40 s | 34.79 s | 38.83 s |
| Generate & Validate Multi-Agent | 4.05 s | 20.67 s | 39.32 s |
| Full Pipeline | 468.22 s | 400.35 s | 408.43 s |
Optimized (warm cache)
| Phase | N=10 | N=25 | N=50 |
|---|---|---|---|
| Get Server Descriptions | 1.56 s | 2.42 s | 2.16 s |
| Build Asset Profile | 0.00 s | 0.00 s | 0.00 s |
| Allocate Scenario Budget | 0.00 s | 0.00 s | 0.00 s |
| Generate & Validate Single-Agent | 8.23 s | 14.86 s | 15.95 s |
| Generate & Validate Multi-Agent | 10.73 s | 17.28 s | 18.11 s |
| Full Pipeline | 13.63 s | 19.88 s | 50.86 s |
Speedup: 34× at N=10, 20× at N=25, 8× at N=50. The two-level cache eliminates Phase 1 (Build Asset Profile) entirely on warm runs, phase 1 is the dominant cost in the baseline (80–96% of total time). The decreasing speedup ratio at larger N reflects the time taken to generate scenarios in phase 3.
50 Smart Grid Transformer scenarios generated by each pipeline (CouchDB-grounded, N=50), evaluated across 3 independent runs using the three-stage evaluation scheme (Static /20 + LLM Judge /30 + Dry-Run /50 = /100 composite). A score ≥ 70 is considered high quality.
| Metric | Baseline | Optimized |
|---|---|---|
| Static Score (/20) | 18.4 ± 0.6 | 19.1 ± 0.3 |
| LLM Judge Score (/30) | 22.2 ± 3.6 | 22.4 ± 1.8 |
| Dry-Run Score (/50) | 33.1 ± 1.3 | 32.7 ± 2.5 |
| Quality Score (/100) | 73.8 ± 3.0 | 74.2 ± 1.9 |
The 0.4-point gap is well within run-to-run variance, confirming that the speed optimizations come at no cost to scenario quality. Both pipelines exceed the 70-point high-quality threshold on average.
Hardware: NVIDIA A100 / H100 GPUs (IBM WatsonX inference cluster)
Headline result: Two-level caching + parallel focus-group execution + thread-pool offloading reduces end-to-end pipeline time from 408 s to 51 s (8× speedup) for 50 scenarios, with mean scenario quality score almost unchanged at 74.2 ± 1.9 for optimized vs. 73.8 ± 3.0 for the baseline.
.
├── README.md
├── LICENSE
├── pyproject.toml
├── deliverables/ # Final report (PDF) and final presentation (PPT/PDF) — same files uploaded to CourseWorks
│ ├── Team8_HPML_Final_Report.pdf
│ └── Team8_HPML_Final_Presentation.pptx
└── src/
├── agent/ # LLM agent runners, CLI, and plan-execute orchestration
├── couchdb/ # CouchDB setup, Docker Compose, and asset data initialisation
├── evaluation/ # Evaluation utilities
├── llm/ # LiteLLM wrapper and base LLM abstractions
├── observability/ # Tracing, run spans, and file exporters
├── scenarios/ # Base scenario models, grounding, retrieval, and prompts
├── scenarios_evaluation/ # Three-stage scenario quality evaluator (static + LLM judge + dry-run)
├── scenarios_optimization/ # Optimized scenario generation
├── scenarios_profiling/ # PyTorch-profiler-instrumented scenario generation pipeline
├── scenarios_testing/ # Scenario generation smoke tests
├── scenarios_wandb/ # W&B-integrated optimized scenario generation
└── servers/ # MCP tool servers (IoT, FMSR, TSFM, WO, Vibration, Utilities)
└── outputs/ # Sample profiling and wandb outputs. (Detailed outputs present in the PPT and report)
# Clone
git clone https://github.com/Rohith-Kanathur/AssetOpsBench.git
cd AssetOpsBenchRun from the repo root:
uv syncActivate Virtual Environment
source .venv/bin/activate # macOS / LinuxCopy .env.public to .env
cp .env.public .env
# Then edit .env and set WATSONX_APIKEY, WATSONX_PROJECT_ID, WATSONX_URLStart CouchDB
docker compose -f src/couchdb/docker-compose.yaml up -dVerify CouchDB is running:
curl -X GET http://localhost:5984/Weights & Biases is used for run-level tracking. Logging is done for scenario generation and evaluation stages. Metrics captured: phase wise timing, llm call metrics, cache hit/miss metrics, valid/invalid scenario count and scenario evaluation scores.
🔗 Dashboard: [https://wandb.ai/rk3443-columbia-university/assetopsbench] Platform used: [Weights & Biases]
Dataset used for Health Index Prediction FMSR tool is available here: https://data.mendeley.com/datasets/rz75w3fkxy/1
To reproduce the baseline:
uv run python -m scenarios_profiling.generator Transformer --num-scenarios 50 --data-in-couchdb --profile --profile-dir profiling_output/exp1_latency_baselineTo reproduce the optimized run:
uv run python -m scenarios_optimization.generator Transformer --num-scenarios 50 --data-in-couchdb --profile --profile-dir profiling_output/exp1_latency_optimizedTo reproduce the run with wandb logging, run:
uv run python -m scenarios_wandb.generator Transformer \
--data-in-couchdb \
--num-scenarios 50 \
--wandb \
--wandb-project assetopsbench \
--wandb-run-name transformer-50-openformuv run python src/scenarios_evaluation eval_scenarios_wandb.py --wandb --wandb-project assetopsbench --wandb-run-name transformer-eval-50Ensure that scenarios.json file is present in the same location as eval_scenarios_wandb.py.
To visualize the profiler traces referenced in the report:
- Run the pipelines as suggested in section
D. Training. - Profiling traces will be available in
profiling_output/ - Open the chrome trace json files on
https://ui.perfetto.dev/
To reproduce the headline result in Section 3: Final Results Summary:
Run the baseline once:
uv run python -m scenarios_profiling.generator Transformer --num-scenarios 50 --data-in-couchdb --profile --profile-dir profiling_output/exp1_latency_baselineRun the optimized variant twice (To see the effect of caching):
uv run python -m scenarios_optimization.generator Transformer --num-scenarios 50 --data-in-couchdb --profile --profile-dir profiling_output/exp1_latency_optimizedOpen the chrome traces present in profiling_output/ on https://ui.perfetto.dev/.
-
Exp 1 - Scalability (Baseline vs. Optimized, N=10/25/50): The unoptimized baseline spends 80–96% of its total wall time on Phase 1 (Build Asset Profile: 326–448 s), dominated by sequential academic retrieval, PDF fetching, and LLM synthesis. The optimized pipeline with a warm two-level cache reduces this phase to 0 s, yielding end-to-end speedups of 34× (N=10), 20× (N=25), and 8× (N=50). The decreasing ratio at larger N reflects the growing share of scenario-generation phases, which scale with budget but remain well below baseline even at N=50.
-
Exp 2 - Parallelism (C=1→5 parallel focus groups, N=50): Increasing C from fully serial (C=1, 87.4 s) to C=5 (47.3 s) delivers a 1.85× speedup. Phase 3 (Generate & Validate Single-Agent) shows the sharpest gains: 43.9 s → 17.6 s. Returns diminish beyond C=3, suggesting that LLM API rate limits and network latency become the bottleneck at higher concurrency, not local CPU or I/O.
-
Exp 3 - Caching (cold vs. warm cache, N=50): A single cold run primes the disk cache; all subsequent warm runs skip Phase 1 entirely (247.4 s → 0 s), reducing total pipeline time from 302.5 s to 65.1 s, a 4.6× speedup with no changes to generation logic. The minor increase in Phase 4 (35.3 s → 45.6 s warm) is within normal LLM API latency variance.
-
Exp 4 - Combined optimizations under cold cache (Baseline vs. Optimized, both cold): Isolating the non-caching gains; thread-pool I/O offloading and parallel focus execution in the optimized variant still achieves a 1.52× end-to-end speedup (408.4 s → 269.0 s). Phase 1 improves by 33% (325.7 s → 216.9 s) from concurrent PDF/DB/retrieval requests; Phase 3 improves by 2.65× (38.8 s → 14.7 s) from parallel focus-group generation.
-
Quality (Baseline vs. Optimized, N=50, n=3 runs): Speed gains come at zero quality cost. Composite quality scores are statistically indistinguishable: 74.2 ± 1.9 (optimized) vs. 73.8 ± 3.0 (baseline). Both exceed the 70-point high-quality threshold.
Detailed results are present in the deliverables/.
- All source code lives under
src/. See Section 4 for a directory-level description of each package. - Profiler Chrome traces from reported experiments will be created under
profiling_output/. Load anychrome_trace.jsonat https://ui.perfetto.dev for a visual flame chart. - All secrets (API keys, WatsonX credentials, W&B tokens) are loaded from environment variables. Copy
.env.publicto.envand fill inWATSONX_APIKEY,WATSONX_PROJECT_ID, andWATSONX_URLbefore running any pipeline command. - The health index regression model for the Smart Grid Transformer FMSR tool is trained on the Mendeley Transformer Health Dataset (open access).
- LLM inference runs on remote IBM WatsonX infrastructure. GPU utilisation metrics captured by W&B therefore reflect only local host activity and do not measure true model-serving compute cost.
- The W&B project for this work is public: https://wandb.ai/rk3443-columbia-university/assetopsbench.
Did your team use any AI tool in completing this project?
- No, we did not use any AI tool.
- Yes, we used AI assistance as described below.
Tool(s) used: e.g., Claude, Cursor
Specific purpose: AI assistance was used for targeted debugging and refinement tasks only. All the core design decisions, experimental methodology, and reported results were produced by the team only.
Sections affected:
-
W&B Logging: W&B metrics were silently dropped when wandb.log() was called from non-main asyncio threads during parallel focus-group generation. Used AI help to diagnose the root cause and restructure logging into a deferred batch that is flushed from the main thread after asyncio.run() completes.
-
Thread Pool Offloading Optimization: Thread pool offloading introduced a subtle issue where run_in_executor was being called with a None loop reference causing RuntimeError: cannot schedule new futures after shutdown. Used AI assistance to identify the lifecycle mismatch and add a guard that re-uses the running loop's default executor rather than creating a new one per call.
-
src/scenarios_profiling/generator/prompts/ and eval_scenarios.py: Used AI assistance to improve the LLM judge prompt and prompts used in scenario generation for clarity and reduction of ambiguous instructions that were causing incorrect outputs.
-
Report Writing: Light prose refinement in the abstract, introduction, and discussion sections to improve grammar, clarity, and conciseness. All core ideas, experimental methodology, analysis, and conclusions were executed and written entirely by the team members. AI assistance was limited to sentence-level rewording only and did not influence any reported results.
How we verified correctness: All W&B logging fixes were verified by running the generation pipeline multiple times and confirming every expected metric (phase timings, cache hit rates, batch results, LLM call records) appeared on the W&B dashboard. The thread pool fix was verified by running the optimized pipeline many consecutive times without a RuntimeError. Prompt improvements were validated by re-running the full three-stage evaluation suite and confirming quality scores remained stable . All reported experiment numbers were produced by running the pipelines ourselves end-to-end. No AI-generated numbers appear anywhere in the results. Report prose changes were reviewed line-by-line by team members to ensure no false information was reported.
By submitting this project, the team confirms that the analysis, interpretations, and conclusions are our own, and that any AI assistance is fully disclosed above.
Open source Pull Request Submitted to AssetOpsBench: IBM#288
The scenarios generated from our pipeline are available on Huggingface: https://huggingface.co/datasets/rohithkanathur/assetopsbench-transformer-scenarios-dataset
Released under the MIT License. See LICENSE.
This is built on AssetOpsBench under the guidance of Dr. Dhaval Patel
@misc{patel2026assetopsbenchbenchmarkingaiagents,
title={AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance},
author={Dhaval Patel and Shuxin Lin and James Rayfield and Nianjun Zhou and Chathurangi Shyalika and Suryanarayana R Yarrabothula and Roman Vaculin and Natalia Martinez and Fearghal O'donncha and Jayant Kalagnanam},
year={2026},
eprint={2506.03828},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.03828},
}Open a GitHub Issue or email [rk3443@columbia.edu] or [sagar.chethankumar@columbia.edu].
HPML Spring 2026 — Dr. Kaoutar El Maghraoui — Columbia University