feat(scenarios): MemoryBench — evaluating agent memory in industrial maintenance workflows

## Summary

Current AssetOpsBench scenarios are single-shot and stateless. Every scenario is an independent
(query, expected_answer) pair. The benchmark does not evaluate whether agents correctly maintain
and propagate context across the multiple reasoning steps they take to answer a question.

This is a significant gap for industrial AI: an agent that confabulates a sensor ID
mid-workflow — or attributes one asset's failure mode to a different asset — produces wrong
maintenance recommendations even if its final sentence sounds plausible.

## What this contribution adds

### 20 new `memory_chain` scenarios across 5 categories

| Category | What is tested |
|---|---|
| `asset_identity_persistence` | Asset name / sensor ID found by IoTAgent must be carried correctly into subsequent steps |
| `failure_mode_propagation` | Failure mode codes returned by FMSRAgent must flow accurately into WO and TSFM steps |
| `temporal_context_retention` | Timestamps, forecast horizons, and anomaly windows must be cited correctly |
| `multi_entity_tracking` | When analysing ≥ 2 assets, findings must not be cross-attributed |
| `cross_agent_context_flow` | Full IoT → FMSR → TSFM → WO chain must share consistent context end-to-end |

### How grading works

Memory scenarios reuse the **existing 6-dimension `evaluation_agent()` grader** (no new
dependencies). The key difference is in the `characteristic_form`: instead of a fixed expected
answer, it describes **qualitative criteria** that the agent's final answer must satisfy —
specifically what context from prior steps must be present and consistent.

For example:
> *"The work order must reference the specific chiller asset name found during IoT discovery
> and include a sensor ID that was listed for that exact asset. The failure mode cited must be
> one returned by the FMSR query for that asset specifically."*

This naturally engages the existing `hallucinations`, `data_retrieval_accuracy`, and
`agent_sequence_correct` dimensions against memory-specific failure modes.

### New files (self-contained, no new dependencies)

```
aobench/scenario-server/src/scenario_server/handlers/aob_memory/
    __init__.py
    aob_memory.py            # new handler, follows exact pattern of aob_iot.py
    memory_scenarios.jsonl   # 20 memory-chain scenarios
```

One-line addition to `app.py` to register the handler.

> **Note:** For long-term consistency with other handlers, `memory_scenarios.jsonl` should
> eventually be uploaded to the `ibm-research/AssetOpsBench` HuggingFace dataset and loaded
> via `hf_hub_download`. The bundled file approach used here keeps the PR self-contained for
> initial review.

## Why this matters

Industrial agents fail in practice not just because they pick wrong tools, but because they
lose track of *which specific asset they are working on* mid-workflow. MemoryBench creates a
reproducible way to measure and compare that capability across agent architectures
(MetaAgent vs AgentHive) and across different LLMs.

## Related

- Implemented in PR #247
- Same pattern as PR #203 (graph-native scenarios): new scenario category, registered as a new handler alongside existing ones
- Directly addresses the Memory + Workflow direction flagged as a priority for AssetOpsBench

cc @dPatidar-ds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scenarios): MemoryBench — evaluating agent memory in industrial maintenance workflows #248

Summary

What this contribution adds

20 new `memory_chain` scenarios across 5 categories

How grading works

New files (self-contained, no new dependencies)

Why this matters

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Category	What is tested
`asset_identity_persistence`	Asset name / sensor ID found by IoTAgent must be carried correctly into subsequent steps
`failure_mode_propagation`	Failure mode codes returned by FMSRAgent must flow accurately into WO and TSFM steps
`temporal_context_retention`	Timestamps, forecast horizons, and anomaly windows must be cited correctly
`multi_entity_tracking`	When analysing ≥ 2 assets, findings must not be cross-attributed
`cross_agent_context_flow`	Full IoT → FMSR → TSFM → WO chain must share consistent context end-to-end

feat(scenarios): MemoryBench — evaluating agent memory in industrial maintenance workflows #248

Description

Summary

What this contribution adds

20 new memory_chain scenarios across 5 categories

How grading works

New files (self-contained, no new dependencies)

Why this matters

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

20 new `memory_chain` scenarios across 5 categories