Skip to content

Add Tier 2 differentiators: cost metrics, root cause, replay, mutation, fingerprint#66

Merged
pratyush618 merged 7 commits intomainfrom
feature/tier2-differentiators
Apr 7, 2026
Merged

Add Tier 2 differentiators: cost metrics, root cause, replay, mutation, fingerprint#66
pratyush618 merged 7 commits intomainfrom
feature/tier2-differentiators

Conversation

@pratyush618
Copy link
Copy Markdown
Collaborator

Summary

Five Tier 2 differentiating features — no competing eval tool offers any of these:

  • Cost-Normalized Metrics (agenteval-metrics/cost): CostNormalizedMetric, LatencyNormalizedMetric, CostEfficiencyAnalyzer with Pareto frontier computation — 29 tests
  • Regression Root Cause Analysis (agenteval-reporting/regression/rootcause): RootCauseAnalyzer clusters regressed cases by failure pattern, detects output/tool/cost/latency changes, ranks by impact — 11 tests
  • Deterministic Replay (agenteval-replay): Record agent+judge interactions, replay without API calls ($0 regression tests). RecordingJudgeModel/ReplayJudgeModel decorators, RecordingStore persistence, ReplaySuite orchestrator — 32 tests
  • Mutation Testing (agenteval-mutation): Sealed Mutator interface with 5 built-in mutators (weaken constraints, remove safety, inject contradiction, etc.), MutationSuite orchestrator measures eval detection rate — 22 tests
  • Capability Fingerprinting (agenteval-fingerprint): CapabilityProfiler evaluates agents across 8 dimensions, CapabilityComparison for side-by-side profiles, CapabilityReporter for console output — 17 tests

58 new files, ~4,850 lines, 111 new tests — all passing.

Test plan

  • mvn test -pl agenteval-metrics — cost metrics pass (29 new)
  • mvn test -pl agenteval-reporting — root cause analysis pass (11 new)
  • mvn test -pl agenteval-replay — replay module pass (32 tests)
  • mvn test -pl agenteval-mutation — mutation module pass (22 tests)
  • mvn test -pl agenteval-fingerprint — fingerprint module pass (17 tests)
  • All pre-commit hooks pass (checkstyle, editorconfig, spotbugs)
  • Verify full reactor build: mvn clean install -Denforcer.skip=true

CostNormalizedMetric, LatencyNormalizedMetric, CostEfficiencyAnalyzer,
ParetoFrontier in agenteval-metrics/cost package, 29 tests.
RootCauseAnalyzer clusters regressed cases by failure pattern,
detects output/tool/cost/latency changes, ranks by impact, 11 tests.
RecordingJudgeModel/AgentWrapper decorators, ReplayJudgeModel/AgentWrapper
for $0 regression tests, RecordingStore persistence, ReplaySuite
orchestrator, 32 tests.
Sealed Mutator interface with 5 built-in mutators, PluggableMutator,
MutationSuite orchestrator, AgentFactory, 22 tests.
CapabilityDimension enum (8 dimensions), CapabilityProfiler orchestrator,
CapabilityComparison, CapabilityReporter, 17 tests.
Update README module structure, add 6 doc pages under docs/advanced
for contract testing, chaos engineering, statistical analysis,
deterministic replay, mutation testing, and capability fingerprinting.
@pratyush618 pratyush618 merged commit 0c55ea5 into main Apr 7, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant