docs: add multi-turn conversation evaluation support proposal (SUP-0001)#40
docs: add multi-turn conversation evaluation support proposal (SUP-0001)#40lbfsc wants to merge 1 commit into
Conversation
…posals - Introduced CONTRIBUTING.md to outline the proposal process and requirements. - Created README.md to provide an overview of enhancement proposals and link to the contributing guide. - Implemented init-proposal.sh script to bootstrap new proposals with metadata and a template. - Added proposal-template.md.template for consistent proposal structure and content.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a0bbaeeb3e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| exit 0 | ||
| ;; | ||
| -s|--status) | ||
| STATUS=$(printf '%s' "$2" | tr '[:upper:]' '[:lower:]') |
There was a problem hiding this comment.
Validate required values for option flags
The option parser reads $2 directly for --status, --author, and --output without first checking that a value exists, so invoking the script with a dangling flag (for example init-proposal.sh --status) crashes under set -u with an unbound variable error instead of returning a user-facing usage error. This makes the bootstrap tool brittle for normal CLI misuse and is easy to fix by guarding [[ $# -ge 2 ]] (or equivalent) before dereferencing $2 in each flag branch.
Useful? React with 👍 / 👎.
| @@ -0,0 +1,132 @@ | |||
| --- | |||
| title: {{title}} | |||
There was a problem hiding this comment.
Quote title in YAML front matter template
The template writes title as an unquoted YAML scalar, so common titles containing : (for example My: Proposal) produce invalid front matter (mapping values are not allowed when parsed). This makes the generated proposal file syntactically broken for valid human titles; quoting and escaping the title field is needed to keep generated markdown parseable.
Useful? React with 👍 / 👎.
Feedback: Add Scenario-Driven Multi-Turn EvaluationThe current proposal is valuable because it makes multi-turn evaluation real: each turn is executed separately, the Agent session is resumed between turns, and the evaluator can inspect intermediate responses through However, I think there is an important user-experience risk worth calling out: asking users to author static multi-turn cases can become difficult and fragile for many real-world Skills. Concern: Static
|
| Mode | Best For | User Responsibility |
|---|---|---|
input.turns |
Deterministic scripted multi-turn tests, protocol checks, guardrail verification, exact regressions | Author each turn and per-turn assertions |
input.conversation with chatter_agent |
Adaptive business workflows, clarification flows, goal-oriented task completion | Describe the objective, available knowledge, and judge criteria |
In other words, input.turns is a low-level deterministic mechanism, while chatter_agent is a higher-level scenario mechanism.
Additional Design Considerations
To make chatter_agent mode reliable enough for evaluation, the proposal may need to define several constraints:
max_turnsshould be required to control cost and prevent infinite conversations.- The ChatterAgent should have a clear stop condition, such as task completed, task impossible, or max turns reached.
- The transcript should mark which messages came from the ChatterAgent and which came from the Agent-With-Skill.
- The Judge should receive the full transcript exactly like static multi-turn mode.
- The final report should include the generated user turns so failures remain debuggable.
- The ChatterAgent should be instructed not to evaluate the Agent directly, but only to play the simulated user role.
This approach would reduce the burden of authoring multi-turn cases while still preserving the current proposal's core architecture: real turn-by-turn execution, session continuity, transcript collection, and reuse of existing judge types.
| - turn_response_contains: # Assert turn 2 response contains rejection keywords | ||
| turn: 2 | ||
| contains_any: ["need to complete first", "cannot skip", "execute in order"] | ||
| failure: |
There was a problem hiding this comment.
Why do we need failure criteria?
What happens if a response meets neither success nor failure criteria?
|
|
||
| | Proposal | Title | Status | Last Updated | | ||
| | :---------------------------------------------------: | :--------------------------------: | :----: | :----------: | | ||
| | [Proposal-0001](0001-multi-turn-conversation-eval.md) | Multi-Turn Conversation Evaluation | draft | 2026-05-21 | No newline at end of file |
There was a problem hiding this comment.
🟠 Major: README 索引行与提案 front matter 不一致(status / 日期 / 标题)
索引行写 status: draft / Last Updated: 2026-05-21 / 标题 Multi-Turn Conversation Evaluation;而提案 front matter(0001-...eval.md:2-7)是 status: provisional / 2026-05-19 / Multi-Turn Conversation Evaluation Support。PR 描述也自称 provisional,三处口径打架。索引是状态看板,错误状态会误导评审者判断生命周期阶段。
建议:以提案 front matter 为准,同步本行三列(provisional / 2026-05-19 / 带 “Support” 的标题)。
来源:Claude Code /code-review skill @ a0bbaee,已按 docs/review-heuristics.md 过滤
| status: provisional | ||
| --- | ||
|
|
||
| # SUP-0001: Multi-Turn Conversation Evaluation Support |
There was a problem hiding this comment.
🟠 Major: 命名方案分裂 — SUP-0001 vs 模板生成的 Proposal-NNNN
首个手写提案用 SUP-0001 前缀,但 init-proposal.sh 经 proposal-template.md.template:10(# Proposal-{{proposal_id}}:)生成的所有后续提案会是 Proposal-NNNN,README 链接文字又叫 Proposal-0001。第一篇就和工具产物分叉了。
建议:二选一定标准 —— 要么模板改成 SUP-{{proposal_id}}(推荐,“SUP” 是本仓库专有缩写,比通用 “Proposal” 更可检索),要么本提案标题改回 Proposal-0001。建议与维护者确认偏好。
来源:Claude Code /code-review skill @ a0bbaee,已按 docs/review-heuristics.md 过滤
| result.SessionResult.Turns = turnsExecuted | ||
|
|
||
| // Check if any turn failed or all were skipped | ||
| if hasFailedTurn(turnResults) { |
There was a problem hiding this comment.
🟠 Major: 硬执行错误(TurnError)被误报为 SKIP,与本提案 “no silent fallback” 原则矛盾
hasFailedTurn()(:907)只匹配 TurnFailed,不含 TurnError;allSkipped()(:917)仅判断“无 TurnCompleted”。当 turn 1 因鉴权失败 / 限流 / 非零退出而 TurnError 时 → hasFailedTurn=false 且 allSkipped=true → 结果落到 StatusSkip。真正的基础设施 ERROR 被当成 SKIP,掩盖失败并污染通过率统计。而 Risks 表(:309)明确写 “resume failure → Mark as ERROR … no silent fallback”,但代码路径里根本没有把 TurnError 映射到 ERROR 的分支。
建议:在 finalize 中显式优先判断 TurnError → StatusError。
来源:Claude Code /code-review skill @ a0bbaee,已按 docs/review-heuristics.md 过滤
| | Engine | Session ID Generation | Session ID Storage | | ||
| | ----------- | -------------------------------------- | ---------------------------------------------------------------- | | ||
| | claude_code | `uuid.New()` passed via `--session-id` | `claudePrintJSONResult.SessionID` field, parsed from JSON output | | ||
| | codex | Auto-generated by codex CLI | Most recent session filename under `~/.codex/sessions/` | |
There was a problem hiding this comment.
🟠 Major: codex 用 “~/.codex/sessions/ 最新文件” 做 session 检测,并发跑 case 会串号
Non-Goals 只排除了“并行 turn”,没排除“并行 case”,而 skill-up 是会并发跑 case 的。两个 codex case 近乎同时 Run,靠取全局最新 .jsonl 会拿到另一个 case 刚写的 session 文件 → A 去 resume 了 B 的会话,上下文交叉污染。机制上“刚启动的会话”与“挑到的文件”之间没有任何关联键。
建议:在设计中明确 codex session 与 case 的关联方式(启动前后 diff 目录、或推动 codex 暴露 session-id),或显式声明 codex 多轮仅支持串行执行。
来源:Claude Code /code-review skill @ a0bbaee,已按 docs/review-heuristics.md 过滤
| } | ||
|
|
||
| // 7. Execute capture | ||
| for _, cap := range turn.Capture { |
There was a problem hiding this comment.
🟠 Major: capture 失败静默返回 "",导致 {{变量}} 原样泄漏进下一轮 prompt
正则 / JSONPath 没匹配时 extractCapturedValue 返回 "",被这里的 if value != "" 守卫丢弃 → capturedVars 无此键 → renderTemplate(:777)找不到可替换项 → 下一轮 content 里残留字面量 {{table_name}} 直接发给 Agent。模型随机性(Caveats :299 自己强调过)下这很常见,且无 ERROR、无诊断,又一次违反 “no silent fallback”。
建议:capture 失败时记录诊断 / 标记 turn,或对“引用了未捕获变量”的模板做未替换检测并报错。
来源:Claude Code /code-review skill @ a0bbaee,已按 docs/review-heuristics.md 过滤
| // "response": "...", | ||
| // "transcript": { "tool_calls": [...], "tool_results": [...] } | ||
| // } | ||
| func extractByJSONPath(path, response string, sr *agent.SessionResult) string { |
There was a problem hiding this comment.
🟠 Major: JSONPath 示例路径与 extractByJSONPath 实际构造的根对象不一致
本函数把 tool 数据嵌在 transcript 下({response, transcript:{tool_calls, tool_results}}),所以正确路径应是 $.transcript.tool_results[0].id。而 CaptureRule 示例(:356)写的是 $.tool_results[0].id,照此必然解析失败 → 返回 "" → 叠加上一条的静默失败。文档内两处形态互相矛盾,照示例写的每条 JSONPath capture 都会落空。
建议:统一根对象 schema 与示例路径(二选一)。
来源:Claude Code /code-review skill @ a0bbaee,已按 docs/review-heuristics.md 过滤
| FilesNotExist []string `json:"files_not_exist,omitempty" yaml:"files_not_exist,omitempty"` | ||
|
|
||
| // New: per-turn assertions | ||
| TurnResponseContains *TurnResponseContainsRule `json:"turn_response_contains,omitempty" yaml:"turn_response_contains,omitempty"` |
There was a problem hiding this comment.
🟠 Major: 新增 Rule 字段只有 yaml tag,缺 json tag,破坏既有双 tag 约定与 judge debug JSON 路径
TurnResponseContains 有 json+yaml 双 tag,但下面三个(TurnResponseNotContains / ToolCalledInTurn / ToolNotCalledInTurn)只有 yaml。已核实仓库现状:internal/config/schema.go:132-152 所有 Rule/JudgeConfig 字段一律双 tag,且 internal/cli/judge_debug.go 走 JSON 反序列化。yaml-only 字段在 judge debug 路径下只能靠 Go 默认按字段名(TurnResponseNotContains)大小写不敏感匹配,而非预期的 snake_case 键 → 用户写 turn_response_not_contains 时字段静默为 nil,断言被悄悄丢弃造成误 PASS。不对称很隐蔽。
建议:按既有约定给三个新字段补 json:"...,omitempty" 双 tag。
来源:Claude Code /code-review skill @ a0bbaee,已按 docs/review-heuristics.md 过滤
zpzjzj
left a comment
There was a problem hiding this comment.
代码评审报告
| 项目 | 内容 |
|---|---|
| 评审来源 | Claude Code /code-review skill(effort=medium)+ 3 路 finder + 跨文件证据补强 @ a0bbaee |
| 分支 | docs/add-multi-turn-conversation-eval-support-proposal → main |
| 变更规模 | +2403 行 / 5 文件(1984 行设计文档 + 197 行 shell 脚本 + 3 支撑文档) |
性质说明:这是 proposal / 文档 PR。真正随本 PR 落地执行的代码只有
init-proposal.sh;文档里大段 Go 是设计示意代码,尚未实现。故 Go 部分按「实现前需在设计中澄清的设计缺陷」对待,不作为 Blocker。本 PR 无 Blocker。
问题汇总
- 🔴 Blocker: 0 个
- 🟠 Major: 7 个(2 个文档事实性问题 + 5 个设计层缺陷,均已 inline 标注)
- 🔵 Info: 4 个(见下)
整体评价
关键改进建议
- 文档对齐:README 索引行的 status/日期/标题与提案 front matter 同步(
provisional/2026-05-19/ 带 “Support”)。 - 命名定标准:统一
SUP-NNNN或Proposal-NNNN,让模板产物与首篇提案一致。 - 错误语义:设计中补
TurnError → ERROR分支,并让 capture 失败可观测,落实 Risks 表已声明的 “no silent fallback”。 - 并发安全:明确 codex session 与 case 的关联(避免“最新文件”串号)。
- schema 一致性:新增 Rule 字段补齐
json双 tag;统一 JSONPath 根对象 schema 与示例。
🔵 Info 级建议(init-proposal.sh,dev 辅助脚本,健壮性)
:112—-s/-a/-o作为末位参数缺值时,set -u下访问$2触发unbound variable崩溃,而非干净的 “option requires an argument” 提示。:173—--output foo.md(basename 无数字前缀)时PROPOSAL_ID取成foo.md,且不补零(42-x.md→42而非0042),渲染出脏标题。:157— 纯符号 / 纯非 ASCII 标题经slugify后 slug 为空,生成0001-.md这种悬挂分隔符文件名且静默创建。:195〔待核实〕—--output指向不存在目录时,重定向直接报 shell 错误并在set -e下退出,无友好提示(未mkdir -p父目录)。
本评审由 Claude Code 内建 /code-review skill 生成原始 finding,经 pr-review-curator skill 的 docs/review-heuristics.md 过滤:原始 14 条意见做了“采纳 9 / 降级 3 / 驳回 2”处理。驳回项含一条疑似 off-by-one(经核实 1-indexed 下标正确)与一条 finder 自评最弱的 uniqueness 循环(实际行为正确);与 PR 上已有 Codex 评论(静态 turns 难维护的高层 UX 论述)无重叠。
Summary
This PR introduces the proposal process and infrastructure for skill-up Enhancement Proposals (SUPs), along with the first proposal SUP-0001: Multi-Turn Conversation Evaluation Support.
Currently, skill-up's evaluator concatenates all turns into a single instruction and sends it to the Agent Engine in one shot — there is no actual turn-by-turn interaction, intermediate assertions, or conditional branching. This PR proposes and designs full multi-turn conversation evaluation capabilities.
Changes
proposals/CONTRIBUTING.md: Documents the proposal process, lifecycle statuses, and how to create and review proposals.proposals/README.md: Index of all skill-up Enhancement Proposals; currently lists SUP-0001.proposals/init-proposal.sh: Shell script to bootstrap new proposals from the template (sequential numbering, YAML front matter, slug generation).proposals/proposal-template.md.template: Reusable template with standard sections (Summary, Motivation, Goals, Design Details, Test Plan, etc.).proposals/0001-multi-turn-conversation-eval.md: SUP-0001 (status:provisional) — full design for multi-turn conversation evaluation, covering:SessionResumerinterfacepost_conditionchecks (must_contain_any / must_contain_all / must_not_contain) withskip_remainingandfailstrategiescapturerules (regex + JSONPath) with{{variable}}template substitution across turnsturn_response_contains,turn_response_not_contains,tool_called_in_turn,tool_not_called_in_turnclaude_code(--resume) andcodex(codex exec resume);qoderclifalls back to single-shot modeinput.promptsingle-turn cases are unaffectedRelated issues
Test Plan
Unit tests, integration tests, and E2E tests are described in detail in the proposal document (
proposals/0001-multi-turn-conversation-eval.md#test-plan). Implementation will follow in subsequent PRs across the phases defined in the proposal.