From e434a6d72438a050f205d2631f38ced7292a8f33 Mon Sep 17 00:00:00 2001 From: Pratyush Sharma <56130065+pratyush618@users.noreply.github.com> Date: Sun, 29 Mar 2026 11:06:50 +0530 Subject: [PATCH 1/2] docs: use full logo with text in README --- README.md | 2 +- docs/static/img/logo-full.png | Bin 0 -> 3727 bytes 2 files changed, 1 insertion(+), 1 deletion(-) create mode 100644 docs/static/img/logo-full.png diff --git a/README.md b/README.md index f4bfd50..19451e2 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@

- AgentEval + AgentEval

diff --git a/docs/static/img/logo-full.png b/docs/static/img/logo-full.png new file mode 100644 index 0000000000000000000000000000000000000000..ca33178dc2b04b04433f0ecaf719ba2996cdc56d GIT binary patch literal 3727 zcmV;A4sh{_P)&fyjgNYT$Bu7?Ne;fX#||X7S(6Ry%?#|o-XaN5a-20}ga6y;z513)>Yf=% zt&*#D!J2M;)KBlTUX|2Eky~!L<(6A+x#gByZn@=_+w^UXLV&nmt52cN0qs*&({Pk1 zhHB99;nOKNR)h7pr+q&Gh?t~RQws#gB3Xz)#urJ66M?|UATX6g#CBW|oD5_$u4avC z-=S(fnO_i@qu+YdnQtTpg45ZExEU1$B7y)+FOql^aQw&syj+{Dr$r)et5z-WZkW3w z+O2=20b;!)I42fv6*3hNhJnhc8f>i6Z3c~*qk=#q6+&y2Z|tR4m;gxA%dafb%3it@ zDT372??AQ*f?!J`wL+aG2qIm|V!N*$z$z`(y0U+(?q54&PYYyYDqP|R`;8oQFgy{6 z7Y4U>>uSis@Khi#v{?fQ;)r`M>#em&4a84_y`*@!O!xpG_o*o!Hfh9`NCq8=;6xI^ zX-2!_g6J8ADau<1wvpj1XG#!Q3P`%Dg;Q8L9zr58c>iaqT5u;2%r-I*Nc;XbYLI;b zqU4}X*}Nmh2LWFNRLPDuWyJ8o|E$iF>yDyCcNSG%0+2uy7--m}62ZGUFWBdZ(LwgQ zz^bv9_b6ezKxJb0)fme*XmmeYCzUHcmx7T()}vkItc_aLy(=S zWkeDPBt>HnL{cOT;zaT)Z4f7t*_MO$1X;HPqE^%P9gi#mB(X|k zl|g1UX0P&4H-lu543a@+3-V64bkleR@}Rw%AWexx)_0@^B9{UJ86<;b8_BAbL9&fx ziDcEv5*ay-&k~uBM8pz6%B4u8wC|@~fQ^VnfZ$TRmPDDnv9E&y&>=yp`+4)KCIQ0kmb-n$>^P$CCEpzfiia(?0Gmf%S>&)6vmS54k#M-!}ld zu44!Sjk6pIMBEyr{Z~@!wK7Ns83?4|1R&@-1YRrLcM1^jFA2Op8TB&e#^O2YUOB1P z(0N{~-q*(hbzYcQA~=Ls{|xwHKpqbbBIRKsd$E!OH1M_$l%k}39qfI#Q9>Y#kw^x~AR}HN8+!&x z7i2vN2#gO>j&6&%>jjI$fxyj}S6EHswuF5sc~p?mb&DV672_)}ycn5-%Mvfg7y5mk zLfmhlIQmP_mkxYzWsXGLzg<(6A+x#gByZn@=_TW-1KmRoMQ zEwp{%6<*c10m9Kz68B z$#1ol@;G@A6bQk})uGyH7YK|3LT?%t#4Tfr+taQm=cln)^U2-ILdYa z`d9Rjzk>W{^ot*Z&&#Nd+@|LbujP}y10vnF`c5H0V;6pfIv$%Y_@@iKwK@BH)=_hk zKVR$HF7(%Q7SZ&LBJ7)f7~+E`0BN7X>IZo!S+Uwl3TtzHou?cR&@Ku?vL{CIh~*#n zx^ze2nLQlUE9EUnBw9fzCSPi84eSLem2085)l1RvIOCy8B;3O#$?Ojq}`r;8yn< zWzY}wB_d5HTbbmIy(i%9-F#f87esm9TE9hbw@9rxl~lhufJP!LBh_Iv_TVKR048*3 zW0`p z2->qU0juS;0+2_RwHZB)^+OlP5#7Q1IwjOM6YW0qHgOOv3;KZwf;4sw?95aL6}jZ^ zwW+fM4DKj-3?~s$^jNzSS~KEY?UG1I*C7C6_9ceNxgj z-ra`nDyAALAYy#D<^S=rrCItyeF1SrzpWw9v{6V40)&Onw|^nK7RG)^7<%(sB{nQR zV-DIRunB@*9KnXZ1lDdl?R~0Gu?vK*%&KtC7crmGQY9%bxtqLf2@Qx{461VGOLY#Y zcc9g-T79GifzWGtDZB<*`ytFaEiWnf&h6U|VyugQ`b6(U_YKW5Trv|CkxS~M@dzb( zJLO=Is~1}OXf-90qF~#CeQTG-Rj5NR!!GvO$clf@h3^-dE&9}7*}szfxVG*4nQ;hG zTRSSv;twD9fgnU4rqsEl*;F9sp5dsNdDag?#sH{&kFe*a#mVw}140p=RqJCA(N3k# zn5FL(!!9-?Vk|_o_VVn7vb&3)e!95$rI~x4+i=+)RRi6Et+EUI_l?LgWc^*}HfI|n9W+&=9^7JKFLldau}e_Mz*q~9@4 zYxpGK1~cl|4^Ec?8xaZteV-ddbLa(;oggTjK7qC5_mZ5hJCkw05`_rds`3GzS(wY- zxH?lf9%;GXc3NVu`xqL5q}xVTR6o&fsHDiJ2Bu#~--KsKIuqm}7>``^jjhRVxC{d} z&^17M$Lx)e7$#$G|s~ozG{=$4+B(Y(M1<(?r10u2zp5Amvj1L?4*UJy!VRUwE+x?mCw zss^KbP>3@{;N%`+^~gYF0@Z4iOd)C``j8j#@%o-1Iymt_Nx^_N3h)9VSv9Z~NsNJz zT)~A&bReOP(EaEHk)Ce|#*O7}kaMYqPwYuqHY&?4Y4qfjg`s4t@ADtFq){mMQNm#TzE%H1Hy zzlFtVFOo8W(P=f!07f97f4KCV!$)fbz&UXY1UhS0EzLlbUFqg-8!1^OIQ8S25WxV~ z*Jz}RDmLi)AWJ0lr)*uvCSmK8h~2;J5NP0|UbH~UJwTLBpqe0|JlJGjD-ixDh-!5Z zA788Sh?m`~;l+&~44!X`pX1XCvISDLK+boA{6>xn{agS+1;RC`f`h6P@-R!PN-b4O z(YyFnp(DEtO!}s&`|)bLY#B)vAEWexsFKK87!Xtr$wXY_Ri%2HyQneT4?ly~lb}xu z#0^qKXQ?U*5{k%{RQo~5z)*z6y86nxPAgRLmqxiHDe#ohrQ^EH8@ONkYQ;$Th*|GW`b5;o8 z*XYRjdF&0Yhz%U~C7Lmg^s4niAx94iwNERdQzAUbq_b=y({$A?)oM%!*$R&xF{)a@ zR?j?`*rQN{AF;4YB8C15VPCbt*6A9-6`HeP2Da=6SqC@L3CU^ON3FCZ953LZgmCgL zqVAVXE4JYjq7o^gF`ZR<#<2rr6{+$oOKnb5zu~cM^O})drKMU<`}>XEs8ZFA_LVnD z83%dvx6b84?V1t#&!EaMJr56Buh(_nw{>4U4)k}ut-pF#x9Ekpjmd3k5?B3wecbk5 z^ZtIFEvZ)ea(~FvI4Ogixiub`E(p3g1~to=W>aOvEA51Zgj7Lj>W0=cZ1sFcva5!< zj$+G+;01=n+7ZQaU0|7HQSf+10z3tZT=s>8*P|a%DMAf}Q6?-9J>zFq$!rq_F%!*X tESn@ literal 0 HcmV?d00001 From f3dab5e0892e059232c1c5bc76431ccc8aa8920f Mon Sep 17 00:00:00 2001 From: Pratyush Sharma <56130065+pratyush618@users.noreply.github.com> Date: Sun, 29 Mar 2026 11:12:41 +0530 Subject: [PATCH 2/2] chore: update README.md --- README.md | 8 +- agent-eval-feature-spec.md | 761 ------------------------------------- 2 files changed, 5 insertions(+), 764 deletions(-) delete mode 100644 agent-eval-feature-spec.md diff --git a/README.md b/README.md index 19451e2..ee56684 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,14 @@

- Java AI Agent Evaluation & Testing Library — JUnit 5-native, local-first, framework-agnostic evaluation for AI agents. -

- [![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE) [![Java](https://img.shields.io/badge/java-21+-orange.svg)](https://openjdk.org/projects/jdk/21/) [![Build](https://img.shields.io/badge/build-Maven%20%7C%20Gradle-green.svg)](#build) +

+ +

+ Java AI Agent Evaluation & Testing Library — JUnit 5-native, local-first, framework-agnostic evaluation for AI agents. +

--- diff --git a/agent-eval-feature-spec.md b/agent-eval-feature-spec.md deleted file mode 100644 index e9dc2ca..0000000 --- a/agent-eval-feature-spec.md +++ /dev/null @@ -1,761 +0,0 @@ -# Java AI Agent Evaluation & Testing Library — Feature Specification - -> **Working Name:** TBD (candidates: `agenteval`, `agentest`, `jeval`, `evalkit`) -> **Language:** Java 21+ (LTS baseline) -> **License:** Apache 2.0 -> **Build:** Maven Central artifact, Gradle/Maven compatible -> **Philosophy:** Library, not framework. JUnit-native. Local-first. Framework-agnostic. - ---- - -## 1. Core Test Case Model - -The foundational data model that captures everything needed to evaluate an AI interaction. - -### 1.1 `AgentTestCase` — Single-Turn Evaluation - -| Field | Type | Required | Description | -|-------|------|----------|-------------| -| `input` | `String` | Yes | The user query or prompt sent to the agent | -| `actualOutput` | `String` | Yes | The agent's actual response | -| `expectedOutput` | `String` | No | Ground truth / ideal response for comparison | -| `retrievalContext` | `List` | No | Documents/chunks retrieved by RAG pipeline | -| `context` | `List` | No | Ground truth context (what should have been retrieved) | -| `toolCalls` | `List` | No | Tools invoked by the agent during execution | -| `expectedToolCalls` | `List` | No | Expected tool invocations for comparison | -| `reasoningTrace` | `List` | No | Agent's chain-of-thought / planning steps | -| `latencyMs` | `long` | No | End-to-end execution time | -| `tokenUsage` | `TokenUsage` | No | Input/output/total token counts | -| `cost` | `BigDecimal` | No | Estimated cost of the interaction | -| `metadata` | `Map` | No | Arbitrary key-value pairs for filtering/grouping | - -### 1.2 `ConversationTestCase` — Multi-Turn Evaluation - -| Field | Type | Required | Description | -|-------|------|----------|-------------| -| `turns` | `List` | Yes | Ordered list of conversation turns | -| `conversationId` | `String` | No | Identifier for the conversation session | -| `systemPrompt` | `String` | No | System prompt used across all turns | - -### 1.3 Supporting Types - -- **`ToolCall`** — `name: String`, `arguments: Map`, `result: String`, `durationMs: long` -- **`ReasoningStep`** — `type: enum(PLAN, THOUGHT, OBSERVATION, ACTION)`, `content: String`, `toolCall: ToolCall?` -- **`TokenUsage`** — `inputTokens: int`, `outputTokens: int`, `totalTokens: int` -- **`EvalScore`** — `value: double` (0.0–1.0), `threshold: double`, `passed: boolean`, `reason: String`, `metricName: String` - -### 1.4 Builder Pattern - -All test case types use immutable builders: - -```java -var testCase = AgentTestCase.builder() - .input("What is our refund policy?") - .actualOutput(agent.run("What is our refund policy?")) - .expectedOutput("Full refund within 30 days of purchase.") - .retrievalContext(List.of(doc1, doc2)) - .build(); -``` - ---- - -## 2. Metrics — Response Quality - -Metrics that evaluate the quality of the agent's text output. All metrics implement the `EvalMetric` interface and return an `EvalScore` (0.0–1.0). - -### 2.1 Answer Relevancy - -- Measures whether the output is relevant to the input question -- Uses LLM-as-judge to generate synthetic questions from the output, then measures similarity to the original input -- Penalizes off-topic content and irrelevant information -- **Config:** threshold (default 0.7), strictMode (boolean) - -### 2.2 Faithfulness - -- Measures whether claims in the output are supported by the retrieval context -- Extracts individual claims from the output, then verifies each against the provided context -- Core metric for RAG pipelines — catches hallucinated facts that aren't grounded in source documents -- **Config:** threshold (default 0.7) - -### 2.3 Hallucination Detection - -- Measures whether the output contains fabricated information relative to provided context -- Different from faithfulness: specifically targets invented entities, false statistics, non-existent citations -- Can operate with or without ground truth context -- **Config:** threshold (default 0.5), contextRequired (boolean) - -### 2.4 Correctness (G-Eval) - -- General-purpose metric using the G-Eval framework -- Takes custom evaluation criteria as natural language instructions -- Compares actual output against expected output using LLM-as-judge with chain-of-thought -- The most flexible metric — can be configured for any evaluation dimension -- **Config:** criteria (String), evaluationSteps (List), threshold (default 0.5) - -### 2.5 Semantic Similarity - -- Measures embedding-based cosine similarity between actual and expected output -- Does not require LLM-as-judge (uses embedding model only) -- Fast and deterministic — good for regression testing -- **Config:** threshold (default 0.7), embeddingModel (configurable) - -### 2.6 Toxicity - -- Detects harmful, offensive, or inappropriate content in the output -- Covers categories: hate speech, threats, sexual content, self-harm, profanity -- Uses LLM-as-judge with specialized safety rubric -- **Config:** threshold (default 0.5), categories (Set) - -### 2.7 Bias Detection - -- Detects biased content across dimensions: gender, race, religion, political, socioeconomic -- Evaluates whether the output unfairly favors or discriminates against any group -- **Config:** threshold (default 0.5), dimensions (Set) - -### 2.8 Conciseness - -- Measures whether the output is appropriately concise without losing essential information -- Penalizes verbosity, repetition, and filler content -- **Config:** threshold (default 0.5) - -### 2.9 Coherence - -- Measures logical flow, consistency, and readability of the output -- Checks for contradictions within the response -- **Config:** threshold (default 0.7) - ---- - -## 3. Metrics — RAG-Specific - -Metrics specifically designed for evaluating Retrieval-Augmented Generation pipelines. - -### 3.1 Contextual Precision - -- Measures whether the retrieved documents that are actually relevant are ranked higher than irrelevant ones -- Requires ground truth expected output to determine relevance -- Higher score = relevant documents appear earlier in the retrieval results -- **Config:** threshold (default 0.7) - -### 3.2 Contextual Recall - -- Measures whether all relevant information needed to produce the expected output was actually retrieved -- Aligns sentences in the expected output to sentences in the retrieval context -- Low score = the retrieval pipeline missed important source documents -- **Config:** threshold (default 0.7) - -### 3.3 Contextual Relevancy - -- Measures what proportion of the retrieved context is actually relevant to the input -- Penalizes retrieval of irrelevant/noisy documents that dilute useful context -- **Config:** threshold (default 0.7) - -### 3.4 Retrieval Completeness - -- Checks whether all ground truth context documents were retrieved -- Set-based comparison: were the right documents fetched? -- Supports both exact match and fuzzy/semantic matching modes -- **Config:** threshold (default 0.8), matchMode (EXACT | SEMANTIC) - ---- - -## 4. Metrics — Agent-Specific - -Metrics designed for evaluating autonomous agent behavior, including tool use and multi-step reasoning. - -### 4.1 Tool Selection Accuracy - -- Measures whether the agent selected the correct tools for the task -- Compares actual tool calls against expected tool calls (by name) -- Handles cases where tool order matters and where it doesn't -- **Config:** threshold (default 0.8), orderMatters (boolean, default false) - -### 4.2 Tool Argument Correctness - -- Measures whether the arguments passed to tools were correct -- Supports type-safe generic assertions: `assertToolArg(SearchTool.class, args -> args.query().contains("refund"))` -- Deep comparison of argument maps against expected values -- **Config:** threshold (default 0.8), strictMode (boolean — fail on extra args) - -### 4.3 Tool Result Utilization - -- Measures whether the agent actually used the results returned by tool calls in its final output -- Detects cases where the agent calls a tool but ignores its response -- **Config:** threshold (default 0.7) - -### 4.4 Plan Quality - -- Evaluates whether the agent's generated plan is logical, complete, and efficient -- Checks: does the plan address all aspects of the task? Are steps in a sensible order? Are there redundant steps? -- Requires reasoning trace capture from the agent -- **Config:** threshold (default 0.7) - -### 4.5 Plan Adherence - -- Evaluates whether the agent followed its own plan during execution -- Compares the planned steps against the actual execution trace -- Detects deviations, skipped steps, and unplanned actions -- **Config:** threshold (default 0.7) - -### 4.6 Task Completion - -- Binary + graded evaluation of whether the agent accomplished the stated goal -- LLM-as-judge determines if the final outcome satisfies the original task -- Can incorporate custom success criteria -- **Config:** threshold (default 0.5), successCriteria (String, optional natural language) - -### 4.7 Trajectory Optimality - -- Measures whether the agent took an efficient path to the solution -- Compares the number and type of steps taken against a reference optimal trajectory -- Penalizes unnecessary tool calls, redundant LLM invocations, and circular reasoning -- **Config:** threshold (default 0.5), maxSteps (int, optional) - -### 4.8 Step-Level Error Localization - -- Identifies which specific step in the agent's execution chain caused a failure -- Evaluates each reasoning step and tool call individually, flagging the first point of divergence -- Produces a diagnostic report pointing to the root cause -- **Config:** threshold (default 0.5) - ---- - -## 5. Metrics — Conversation-Specific - -Metrics for evaluating multi-turn agent interactions. - -### 5.1 Conversation Coherence - -- Measures whether responses maintain logical consistency across turns -- Detects self-contradictions between earlier and later responses -- **Config:** threshold (default 0.7) - -### 5.2 Context Retention - -- Measures whether the agent remembers and correctly uses information from earlier turns -- Tests: does the agent recall user preferences, prior answers, and established facts? -- **Config:** threshold (default 0.7) - -### 5.3 Topic Drift Detection - -- Measures whether the agent stays on topic across the conversation -- Detects when responses diverge from the user's original intent -- **Config:** threshold (default 0.5) - -### 5.4 Conversation Resolution - -- Evaluates whether the multi-turn conversation reached a satisfactory conclusion -- Determines if the user's original goal was ultimately accomplished -- **Config:** threshold (default 0.5), successCriteria (String) - ---- - -## 6. Custom Metrics - -Infrastructure for users to define their own evaluation metrics. - -### 6.1 G-Eval Custom Metric Builder - -- Define arbitrary evaluation criteria in natural language -- The library generates chain-of-thought evaluation prompts automatically -- Supports specifying which test case fields to include in evaluation - -```java -var customMetric = GEval.builder() - .name("TechnicalAccuracy") - .criteria("Evaluate whether the response contains technically accurate Java code examples") - .evaluationSteps(List.of( - "Check if code snippets compile", - "Verify API usage is correct for the stated Java version", - "Check for deprecated method usage" - )) - .threshold(0.8) - .build(); -``` - -### 6.2 Deterministic Custom Metrics - -- Implement the `EvalMetric` interface for rule-based checks -- No LLM calls required — pure Java logic -- Useful for: regex matching, schema validation, JSON structure checks, keyword presence - -```java -public class ContainsDisclaimerMetric implements EvalMetric { - @Override - public EvalScore evaluate(AgentTestCase testCase) { - boolean hasDisclaimer = testCase.getActualOutput() - .contains("not financial advice"); - return EvalScore.of(hasDisclaimer ? 1.0 : 0.0, 0.5, "Disclaimer check"); - } -} -``` - -### 6.3 Composite Metrics - -- Combine multiple metrics with weighted averaging -- Define pass/fail logic: ALL must pass, ANY must pass, or WEIGHTED average - -```java -var composite = CompositeMetric.builder() - .name("OverallQuality") - .add(new AnswerRelevancy(), 0.4) - .add(new Faithfulness(), 0.4) - .add(new Conciseness(), 0.2) - .strategy(CompositeStrategy.WEIGHTED_AVERAGE) - .threshold(0.7) - .build(); -``` - ---- - -## 7. LLM-as-Judge Engine - -The evaluation backbone — manages LLM calls used by metrics to score test cases. - -### 7.1 Provider Support - -- **OpenAI** — GPT-4o, GPT-4o-mini, GPT-4.1, etc. -- **Anthropic** — Claude Sonnet, Claude Haiku -- **Google** — Gemini Flash, Gemini Pro -- **Ollama** — Any locally hosted model (llama3, mistral, etc.) -- **Azure OpenAI** — Enterprise endpoint support -- **Amazon Bedrock** — AWS-native model access -- **Custom** — Implement `JudgeModel` interface for any HTTP-compatible LLM - -### 7.2 Configuration - -```java -AgentEval.configure() - .judgeModel(JudgeModels.anthropic("claude-sonnet-4-20250514")) - .embeddingModel(EmbeddingModels.openai("text-embedding-3-small")) - .maxConcurrentJudgeCalls(4) - .retryOnRateLimit(true, maxRetries: 3) - .cachJudgeResults(true) // avoid re-evaluating identical test cases - .build(); -``` - -### 7.3 Cost Management - -- Token usage tracking per metric evaluation -- Estimated cost reporting per test run -- Budget limits — abort eval run if cost exceeds threshold -- Judge result caching to avoid redundant LLM calls across runs - -### 7.4 Judge Prompt Templates - -- Research-backed prompt templates for each metric (G-Eval, etc.) -- Templates are open and customizable — override any metric's judge prompt -- Chain-of-thought prompting with structured output parsing -- All templates available as resources for inspection and modification - ---- - -## 8. JUnit 5 Integration - -First-class integration with JUnit 5, the standard Java testing framework. - -### 8.1 Extension - -- `@ExtendWith(AgentEvalExtension.class)` on test classes -- Automatically collects results, manages lifecycle, generates reports -- Integrates with JUnit's test lifecycle hooks (beforeAll, afterAll, etc.) - -### 8.2 Annotations - -| Annotation | Target | Description | -|-----------|--------|-------------| -| `@AgentTest` | Method | Marks a test method as an agent evaluation | -| `@Metric` | Method | Applies a metric with configurable threshold | -| `@Metrics` | Method | Container for multiple `@Metric` annotations | -| `@DatasetSource` | Method | Loads test cases from a file (JSON/CSV) | -| `@GoldenSet` | Parameter | Injects golden dataset into parameterized test | -| `@JudgeModel` | Class/Method | Overrides the judge LLM for specific tests | -| `@EvalTimeout` | Method | Sets max time for evaluation to complete | -| `@Tag("eval")` | Class/Method | Standard JUnit tag for selective execution | - -### 8.3 Assertion API - -Fluent assertion API compatible with AssertJ style: - -```java -AgentAssertions.assertThat(testCase) - .meetsMetric(new AnswerRelevancy(0.7)) - .meetsMetric(new Faithfulness(0.8)) - .hasToolCalls() - .toolCallCount(2) - .calledTool("SearchOrders") - .neverCalledTool("DeleteOrder") - .outputContains("refund") - .outputMatchesSchema(RefundResponse.class); -``` - -### 8.4 Parameterized Dataset Tests - -```java -@ParameterizedTest -@DatasetSource("src/test/resources/golden-set.json") -@Metric(value = AnswerRelevancy.class, threshold = 0.7) -@Metric(value = Faithfulness.class, threshold = 0.8) -void evaluateAcrossDataset(AgentTestCase testCase) { - var response = agent.run(testCase.getInput()); - testCase.setActualOutput(response); -} -``` - -### 8.5 Selective Execution - -- Run only eval tests: `mvn test -Dgroups=eval` -- Run only fast (deterministic) metrics: `mvn test -Dgroups=eval-fast` -- Skip eval tests in quick builds: `mvn test -DexcludeGroups=eval` - ---- - -## 9. Standalone Evaluation Runner - -For batch evaluation outside of JUnit (scripting, notebooks, CI scripts). - -### 9.1 Programmatic API - -```java -var results = AgentEval.evaluate( - dataset, - List.of( - new AnswerRelevancy(0.7), - new Faithfulness(0.8), - new ToolSelectionAccuracy(0.9) - ) -); - -results.summary(); // Console summary -results.averageScore(); // Overall average -results.passRate(); // Percentage of test cases that passed all metrics -results.failedCases(); // Stream of failed test cases with reasons -``` - -### 9.2 Parallel Execution - -- Evaluates multiple test cases concurrently using virtual threads -- Configurable concurrency limit (default: number of available processors) -- Thread-safe metric implementations - -### 9.3 Progress Reporting - -- Real-time console progress bar for long-running evaluations -- Callback interface for custom progress handling -- Estimated time remaining based on throughput - ---- - -## 10. Dataset Management - -Infrastructure for managing evaluation datasets. - -### 10.1 Format Support - -| Format | Read | Write | Notes | -|--------|------|-------|-------| -| JSON | Yes | Yes | Primary format, supports full test case model | -| CSV | Yes | Yes | Flat structure, good for simple input/output pairs | -| JSONL | Yes | Yes | Streaming-friendly, one test case per line | -| YAML | Yes | No | Human-readable alternative to JSON | - -### 10.2 Golden Set Management - -- Load from files, classpath resources, or URLs -- Filter and slice datasets by metadata tags -- Split into train/test for prompt optimization workflows -- Version tracking — associate datasets with commit hashes or release tags - -### 10.3 Dataset Builder - -```java -var dataset = EvalDataset.builder() - .name("refund-queries-v2") - .addCase(AgentTestCase.builder() - .input("How do I get a refund?") - .expectedOutput("You can request a refund within 30 days...") - .metadata(Map.of("category", "refund", "difficulty", "easy")) - .build()) - .addCase(...) - .build(); - -dataset.save("src/test/resources/refund-queries-v2.json"); -``` - -### 10.4 Synthetic Dataset Generation (P2) - -- Generate test cases from existing documents using LLM -- Produce variations of existing golden set entries (paraphrasing, edge cases) -- Generate adversarial inputs designed to expose weaknesses - ---- - -## 11. Reporting & Output - -### 11.1 Console Report - -- Summary table: metric name, average score, pass rate, min/max -- Failed test case details with LLM-judge explanations -- Color-coded output (pass=green, fail=red, warning=yellow) -- Execution time and cost summary - -### 11.2 JUnit XML Report - -- Standard JUnit XML format compatible with all CI systems -- Each metric evaluation maps to a test case in the report -- Jenkins, GitHub Actions, GitLab CI all render these natively - -### 11.3 JSON Report - -- Machine-readable full results export -- Contains all scores, explanations, metadata, and configuration -- Useful for custom dashboards or historical tracking - -### 11.4 HTML Report (P2) - -- Self-contained single-file HTML report -- Metric scorecards with distribution charts -- Drill-down into individual test cases -- Side-by-side comparison of actual vs expected output - -### 11.5 Regression Comparison (P2) - -- Compare two evaluation runs side-by-side -- Identify metrics that improved or degraded -- Highlight specific test cases that changed pass/fail status -- Useful for validating prompt changes or model swaps - ---- - -## 12. Framework Integrations - -Optional modules that auto-capture agent execution details from popular Java AI frameworks. - -### 12.1 Spring AI Integration (`agentest-spring-ai`) - -- Auto-capture `ChatClient` responses, tool calls, and token usage -- Intercept `Advisor` chain for RAG context extraction -- Spring Boot auto-configuration — add dependency, everything wires up -- Support for Spring AI's `@Tool` annotated methods - -### 12.2 LangChain4j Integration (`agentest-langchain4j`) - -- Auto-capture `AiService` proxy method calls and responses -- Extract tool calls from `AiMessage.toolExecutionRequests()` -- Capture `ContentRetriever` results for RAG evaluation -- Support for `@Tool` annotated methods - -### 12.3 LangGraph4j Integration (`agentest-langgraph4j`) - -- Capture graph execution traces (node transitions, state snapshots) -- Map graph nodes to reasoning steps for trajectory analysis -- Support for checkpoint-based state inspection - -### 12.4 MCP Integration - -- Capture MCP tool calls made through the MCP Java SDK -- Verify MCP server responses are correctly utilized -- Test MCP tool argument schemas against expected values - ---- - -## 13. Red Teaming & Adversarial Testing (P2) - -Automated security and robustness testing for AI agents. - -### 13.1 Prompt Injection Tests - -- Library of known prompt injection attack patterns -- Generates adversarial inputs that attempt to override system prompts -- Verifies the agent resists injection attempts -- Categories: direct injection, indirect injection, jailbreak attempts - -### 13.2 Data Leakage Tests - -- Tests whether the agent leaks system prompts, internal instructions, or sensitive context -- Generates extraction attempts ("repeat your instructions", "what were you told?") -- Verifies the agent does not expose PII from training data or context - -### 13.3 Boundary Testing - -- Tests agent behavior at input boundaries: empty input, extremely long input, special characters, unicode edge cases -- Verifies graceful degradation rather than crashes or nonsensical output -- Tests tool call behavior with edge-case arguments - -### 13.4 Robustness Testing - -- Tests consistency: does the agent give similar answers to paraphrased questions? -- Tests stability: does the agent handle ambiguous inputs gracefully? -- Tests refusal: does the agent appropriately decline out-of-scope requests? - ---- - -## 14. Operational Metrics - -Non-AI quality metrics that matter for production agents. - -### 14.1 Latency Tracking - -- End-to-end response time measurement -- Per-tool-call latency breakdown -- Threshold-based pass/fail: "response must complete within 5 seconds" - -### 14.2 Token Usage Tracking - -- Input/output/total token counts per interaction -- Budget enforcement: fail if a single interaction exceeds N tokens -- Aggregated token usage across dataset evaluation - -### 14.3 Cost Tracking - -- Per-interaction cost estimation based on model pricing -- Aggregated cost per evaluation run -- Cost comparison between model configurations - -### 14.4 Error Rate - -- Track agent error/exception rates across evaluation dataset -- Categorize errors: LLM timeout, tool failure, parsing error, etc. - ---- - -## 15. Configuration - -### 15.1 Programmatic Configuration - -```java -AgentEvalConfig config = AgentEvalConfig.builder() - .judgeModel(JudgeModels.openai("gpt-4o-mini")) - .embeddingModel(EmbeddingModels.openai("text-embedding-3-small")) - .defaultThreshold(0.7) - .parallelism(4) - .cacheResults(true) - .costBudget(BigDecimal.valueOf(5.00)) // max $5 per eval run - .build(); -``` - -### 15.2 File-Based Configuration - -Support `agentest.yaml` / `agentest.properties` in project root: - -```yaml -agentest: - judge: - provider: anthropic - model: claude-sonnet-4-20250514 - embedding: - provider: openai - model: text-embedding-3-small - defaults: - threshold: 0.7 - parallelism: 4 - cache: - enabled: true - directory: .agentest-cache -``` - -### 15.3 Environment Variable Overrides - -- `AGENTEST_JUDGE_PROVIDER` / `AGENTEST_JUDGE_MODEL` -- `OPENAI_API_KEY` / `ANTHROPIC_API_KEY` — standard env vars -- CI-friendly: no config files needed if env vars are set - ---- - -## 16. Module Structure - -``` -agentest-bom/ — Bill of Materials for dependency management -agentest-core/ — Test case model, metric interfaces, scoring engine -agentest-metrics/ — All built-in metric implementations -agentest-judge/ — LLM-as-judge engine, provider integrations -agentest-embeddings/ — Embedding model integrations (for semantic metrics) -agentest-junit5/ — JUnit 5 extension, annotations, assertion API -agentest-datasets/ — Dataset loading, management, generation -agentest-reporting/ — Console, JUnit XML, JSON, HTML report generation -agentest-spring-ai/ — Spring AI auto-capture integration -agentest-langchain4j/ — LangChain4j auto-capture integration -agentest-langgraph4j/ — LangGraph4j graph execution capture -agentest-mcp/ — MCP Java SDK tool call capture -agentest-redteam/ — Adversarial testing and prompt injection library -``` - ---- - -## 17. Priority Tiers - -### P0 — MVP (First Release) - -- Core test case model (`AgentTestCase`, `ToolCall`, `EvalScore`) -- 5 response quality metrics: AnswerRelevancy, Faithfulness, Correctness (G-Eval), Hallucination, Toxicity -- 2 agent metrics: ToolSelectionAccuracy, TaskCompletion -- LLM-as-judge engine with OpenAI + Anthropic support -- JUnit 5 extension with `@AgentTest`, `@Metric` annotations -- Fluent assertion API -- JSON dataset loading -- Console + JUnit XML reporting -- Programmatic configuration - -### P1 — Fast Follow - -- Remaining response quality metrics: Bias, Conciseness, Coherence, SemanticSimilarity -- All RAG metrics: ContextualPrecision, ContextualRecall, ContextualRelevancy -- Remaining agent metrics: ToolArgumentCorrectness, PlanQuality, PlanAdherence, TrajectoryOptimality -- Conversation metrics: Coherence, ContextRetention -- Ollama / local model support for judge -- Spring AI integration module -- LangChain4j integration module -- CSV/JSONL dataset support -- JSON report output -- Cost tracking and budget limits -- File-based configuration (YAML) - -### P2 — Growth - -- LangGraph4j integration -- MCP integration -- HTML report generation -- Regression comparison (diff two runs) -- Synthetic dataset generation -- Red teaming / adversarial testing module -- Custom embedding model support -- Conversation metrics: TopicDrift, Resolution -- Step-level error localization -- Tool result utilization metric -- Parallel evaluation with virtual threads -- Progress bar and ETA for batch runs - -### P3 — Ecosystem - -- Maven plugin for `mvn agentest:evaluate` -- Gradle plugin -- GitHub Actions integration (post results as PR comments) -- Golden set versioning with Git integration -- Benchmark mode (compare multiple models/prompts across same dataset) -- Multi-model judge (evaluate with multiple judges, take consensus) -- Snapshot testing (save and compare outputs across releases) -- IntelliJ IDEA plugin (run eval tests with inline metric results) - ---- - -## 18. Non-Goals - -Things this library explicitly does **not** aim to do: - -- **Not an observability platform.** No dashboards, no production monitoring, no trace storage. Use Langfuse, LangSmith, or OpenTelemetry for that. -- **Not a cloud service.** No accounts, no SaaS, no data leaves your machine. Optional future: export to external platforms. -- **Not a framework for building agents.** Use Spring AI, LangChain4j, or LangGraph4j for that. This library tests what you've already built. -- **Not a benchmark suite.** Spring AI Bench benchmarks coding agents on Spring tasks. This library tests your custom agents against your custom criteria. -- **Not a training/fine-tuning tool.** This evaluates, it doesn't train. Evaluation results can inform fine-tuning decisions, but that's a different tool. - ---- - -## 19. Technical Constraints - -- **Java 21+ baseline** — leverages records, sealed interfaces, pattern matching, virtual threads -- **Zero required runtime dependencies** on Spring or LangChain4j — core is standalone -- **JUnit 5.10+** for extension API -- **Jackson** for JSON serialization (aligned with MCP Java SDK choice) -- **SLF4J** for logging (no specific implementation required) -- **Java HTTP Client** (built-in `java.net.http`) for LLM API calls — no OkHttp/Apache HttpClient dependency -- **No reflection-heavy magic** — prefer explicit configuration over classpath scanning