Skip to content

feat(scenarios): MemoryBench — evaluating agent memory in industrial maintenance workflows #248

@Jkanishkha0305

Description

@Jkanishkha0305

Summary

Current AssetOpsBench scenarios are single-shot and stateless. Every scenario is an independent
(query, expected_answer) pair. The benchmark does not evaluate whether agents correctly maintain
and propagate context across the multiple reasoning steps they take to answer a question.

This is a significant gap for industrial AI: an agent that confabulates a sensor ID
mid-workflow — or attributes one asset's failure mode to a different asset — produces wrong
maintenance recommendations even if its final sentence sounds plausible.

What this contribution adds

20 new memory_chain scenarios across 5 categories

Category What is tested
asset_identity_persistence Asset name / sensor ID found by IoTAgent must be carried correctly into subsequent steps
failure_mode_propagation Failure mode codes returned by FMSRAgent must flow accurately into WO and TSFM steps
temporal_context_retention Timestamps, forecast horizons, and anomaly windows must be cited correctly
multi_entity_tracking When analysing ≥ 2 assets, findings must not be cross-attributed
cross_agent_context_flow Full IoT → FMSR → TSFM → WO chain must share consistent context end-to-end

How grading works

Memory scenarios reuse the existing 6-dimension evaluation_agent() grader (no new
dependencies). The key difference is in the characteristic_form: instead of a fixed expected
answer, it describes qualitative criteria that the agent's final answer must satisfy —
specifically what context from prior steps must be present and consistent.

For example:

"The work order must reference the specific chiller asset name found during IoT discovery
and include a sensor ID that was listed for that exact asset. The failure mode cited must be
one returned by the FMSR query for that asset specifically."

This naturally engages the existing hallucinations, data_retrieval_accuracy, and
agent_sequence_correct dimensions against memory-specific failure modes.

New files (self-contained, no new dependencies)

aobench/scenario-server/src/scenario_server/handlers/aob_memory/
    __init__.py
    aob_memory.py            # new handler, follows exact pattern of aob_iot.py
    memory_scenarios.jsonl   # 20 memory-chain scenarios

One-line addition to app.py to register the handler.

Note: For long-term consistency with other handlers, memory_scenarios.jsonl should
eventually be uploaded to the ibm-research/AssetOpsBench HuggingFace dataset and loaded
via hf_hub_download. The bundled file approach used here keeps the PR self-contained for
initial review.

Why this matters

Industrial agents fail in practice not just because they pick wrong tools, but because they
lose track of which specific asset they are working on mid-workflow. MemoryBench creates a
reproducible way to measure and compare that capability across agent architectures
(MetaAgent vs AgentHive) and across different LLMs.

Related

cc @dPatidar-ds

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions