Summary
Current AssetOpsBench scenarios are single-shot and stateless. Every scenario is an independent
(query, expected_answer) pair. The benchmark does not evaluate whether agents correctly maintain
and propagate context across the multiple reasoning steps they take to answer a question.
This is a significant gap for industrial AI: an agent that confabulates a sensor ID
mid-workflow — or attributes one asset's failure mode to a different asset — produces wrong
maintenance recommendations even if its final sentence sounds plausible.
What this contribution adds
20 new memory_chain scenarios across 5 categories
| Category |
What is tested |
asset_identity_persistence |
Asset name / sensor ID found by IoTAgent must be carried correctly into subsequent steps |
failure_mode_propagation |
Failure mode codes returned by FMSRAgent must flow accurately into WO and TSFM steps |
temporal_context_retention |
Timestamps, forecast horizons, and anomaly windows must be cited correctly |
multi_entity_tracking |
When analysing ≥ 2 assets, findings must not be cross-attributed |
cross_agent_context_flow |
Full IoT → FMSR → TSFM → WO chain must share consistent context end-to-end |
How grading works
Memory scenarios reuse the existing 6-dimension evaluation_agent() grader (no new
dependencies). The key difference is in the characteristic_form: instead of a fixed expected
answer, it describes qualitative criteria that the agent's final answer must satisfy —
specifically what context from prior steps must be present and consistent.
For example:
"The work order must reference the specific chiller asset name found during IoT discovery
and include a sensor ID that was listed for that exact asset. The failure mode cited must be
one returned by the FMSR query for that asset specifically."
This naturally engages the existing hallucinations, data_retrieval_accuracy, and
agent_sequence_correct dimensions against memory-specific failure modes.
New files (self-contained, no new dependencies)
aobench/scenario-server/src/scenario_server/handlers/aob_memory/
__init__.py
aob_memory.py # new handler, follows exact pattern of aob_iot.py
memory_scenarios.jsonl # 20 memory-chain scenarios
One-line addition to app.py to register the handler.
Note: For long-term consistency with other handlers, memory_scenarios.jsonl should
eventually be uploaded to the ibm-research/AssetOpsBench HuggingFace dataset and loaded
via hf_hub_download. The bundled file approach used here keeps the PR self-contained for
initial review.
Why this matters
Industrial agents fail in practice not just because they pick wrong tools, but because they
lose track of which specific asset they are working on mid-workflow. MemoryBench creates a
reproducible way to measure and compare that capability across agent architectures
(MetaAgent vs AgentHive) and across different LLMs.
Related
cc @dPatidar-ds
Summary
Current AssetOpsBench scenarios are single-shot and stateless. Every scenario is an independent
(query, expected_answer) pair. The benchmark does not evaluate whether agents correctly maintain
and propagate context across the multiple reasoning steps they take to answer a question.
This is a significant gap for industrial AI: an agent that confabulates a sensor ID
mid-workflow — or attributes one asset's failure mode to a different asset — produces wrong
maintenance recommendations even if its final sentence sounds plausible.
What this contribution adds
20 new
memory_chainscenarios across 5 categoriesasset_identity_persistencefailure_mode_propagationtemporal_context_retentionmulti_entity_trackingcross_agent_context_flowHow grading works
Memory scenarios reuse the existing 6-dimension
evaluation_agent()grader (no newdependencies). The key difference is in the
characteristic_form: instead of a fixed expectedanswer, it describes qualitative criteria that the agent's final answer must satisfy —
specifically what context from prior steps must be present and consistent.
For example:
This naturally engages the existing
hallucinations,data_retrieval_accuracy, andagent_sequence_correctdimensions against memory-specific failure modes.New files (self-contained, no new dependencies)
One-line addition to
app.pyto register the handler.Why this matters
Industrial agents fail in practice not just because they pick wrong tools, but because they
lose track of which specific asset they are working on mid-workflow. MemoryBench creates a
reproducible way to measure and compare that capability across agent architectures
(MetaAgent vs AgentHive) and across different LLMs.
Related
cc @dPatidar-ds