Skip to content

docs: add multi-turn conversation evaluation support proposal (SUP-0001)#40

Open
lbfsc wants to merge 1 commit into
mainfrom
docs/add-multi-turn-conversation-eval-support-proposal
Open

docs: add multi-turn conversation evaluation support proposal (SUP-0001)#40
lbfsc wants to merge 1 commit into
mainfrom
docs/add-multi-turn-conversation-eval-support-proposal

Conversation

@lbfsc
Copy link
Copy Markdown
Collaborator

@lbfsc lbfsc commented May 21, 2026

Summary

This PR introduces the proposal process and infrastructure for skill-up Enhancement Proposals (SUPs), along with the first proposal SUP-0001: Multi-Turn Conversation Evaluation Support.

Currently, skill-up's evaluator concatenates all turns into a single instruction and sends it to the Agent Engine in one shot — there is no actual turn-by-turn interaction, intermediate assertions, or conditional branching. This PR proposes and designs full multi-turn conversation evaluation capabilities.

Changes

  • proposals/CONTRIBUTING.md: Documents the proposal process, lifecycle statuses, and how to create and review proposals.
    • proposals/README.md: Index of all skill-up Enhancement Proposals; currently lists SUP-0001.
    • proposals/init-proposal.sh: Shell script to bootstrap new proposals from the template (sequential numbering, YAML front matter, slug generation).
    • proposals/proposal-template.md.template: Reusable template with standard sections (Summary, Motivation, Goals, Design Details, Test Plan, etc.).
    • proposals/0001-multi-turn-conversation-eval.md: SUP-0001 (status: provisional) — full design for multi-turn conversation evaluation, covering:
    • Turn-by-turn evaluator execution engine with session resume via SessionResumer interface
    • post_condition checks (must_contain_any / must_contain_all / must_not_contain) with skip_remaining and fail strategies
    • capture rules (regex + JSONPath) with {{variable}} template substitution across turns
    • Per-turn Judge assertions: turn_response_contains, turn_response_not_contains, tool_called_in_turn, tool_not_called_in_turn
    • Agent implementations: claude_code (--resume) and codex (codex exec resume); qodercli falls back to single-shot mode
    • Full backward compatibility: existing input.prompt single-turn cases are unaffected

Related issues

Test Plan

Unit tests, integration tests, and E2E tests are described in detail in the proposal document (proposals/0001-multi-turn-conversation-eval.md#test-plan). Implementation will follow in subsequent PRs across the phases defined in the proposal.

…posals

- Introduced CONTRIBUTING.md to outline the proposal process and requirements.
- Created README.md to provide an overview of enhancement proposals and link to the contributing guide.
- Implemented init-proposal.sh script to bootstrap new proposals with metadata and a template.
- Added proposal-template.md.template for consistent proposal structure and content.
@lbfsc lbfsc requested a review from hittyt as a code owner May 21, 2026 06:40
@lbfsc lbfsc requested review from jwx0925, lijunfeng722 and zpzjzj May 21, 2026 06:40
@lbfsc lbfsc self-assigned this May 21, 2026
@lbfsc lbfsc added the documentation Improvements or additions to documentation label May 21, 2026
@lbfsc lbfsc moved this from Backlog to In review in skill-up project May 21, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0bbaeeb3e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

exit 0
;;
-s|--status)
STATUS=$(printf '%s' "$2" | tr '[:upper:]' '[:lower:]')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate required values for option flags

The option parser reads $2 directly for --status, --author, and --output without first checking that a value exists, so invoking the script with a dangling flag (for example init-proposal.sh --status) crashes under set -u with an unbound variable error instead of returning a user-facing usage error. This makes the bootstrap tool brittle for normal CLI misuse and is easy to fix by guarding [[ $# -ge 2 ]] (or equivalent) before dereferencing $2 in each flag branch.

Useful? React with 👍 / 👎.

@@ -0,0 +1,132 @@
---
title: {{title}}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Quote title in YAML front matter template

The template writes title as an unquoted YAML scalar, so common titles containing : (for example My: Proposal) produce invalid front matter (mapping values are not allowed when parsed). This makes the generated proposal file syntactically broken for valid human titles; quoting and escaping the title field is needed to keep generated markdown parseable.

Useful? React with 👍 / 👎.

@lijunfeng722
Copy link
Copy Markdown
Collaborator

Feedback: Add Scenario-Driven Multi-Turn Evaluation

The current proposal is valuable because it makes multi-turn evaluation real: each turn is executed separately, the Agent session is resumed between turns, and the evaluator can inspect intermediate responses through post_condition, capture, and per-turn judge assertions.

However, I think there is an important user-experience risk worth calling out: asking users to author static multi-turn cases can become difficult and fragile for many real-world Skills.

Concern: Static turns Are Hard to Maintain

Multi-turn conversation is not just a fixed sequence of user messages. In many cases, the next user message depends on the Agent's previous response. Because LLM outputs are probabilistic, the expected response at each turn may vary across different Agents, different models, and even repeated runs with the same Agent and model.

Even if the framework provides loose assertion patterns such as contains_any, must_contain_any, regex capture, or template variables, users may still spend a lot of time tuning case YAML instead of describing the behavior they actually want to evaluate.

For example, suppose we have a Membership Management Skill that can create and update member profiles. A member may have fields such as name, gender, age, phone number, membership level, and tags.

The user's actual evaluation goal might be:

The Agent should create a new member named Alice. If required information is missing, it should ask for clarification. After receiving the missing information, it should create the member and then support a follow-up update to the member's membership level.

With the current static turns approach, the case author may try to write:

id: membership-create-and-upgrade
title: Create a member after clarification, then upgrade membership level

input:
  turns:
    - role: user
      content: "Create a member named Alice"
      post_condition:
        must_contain_any:
          - "age"
          - "gender"
          - "phone"
          - "missing information"
          - "please provide"
        on_fail: fail

    - role: user
      content: "Alice is 28, female, phone number is 13800000000"
      post_condition:
        must_contain_any:
          - "created"
          - "member ID"
          - "success"
        on_fail: fail

    - role: user
      content: "Upgrade her to Gold member"

judge:
  type: rule_based
  success:
    - tool_called:
        name: create_member
    - tool_called:
        name: update_member_level

But different Agents may reasonably behave in different ways:

  • One Agent may ask only for age and phone number.
  • Another Agent may ask for gender, age, phone number, and initial membership level.
  • Another Agent may create a draft member first and ask for confirmation.
  • Another Agent may create the member directly with defaults and then ask whether more profile fields should be completed.
  • Another Agent may call a tool successfully but summarize the result with wording that does not match the configured keywords.

The case author then needs to encode many acceptable branches using loose matching, optional post-conditions, capture rules, and retries. This can quickly turn multi-turn case writing into prompt/output chasing.

This does not mean input.turns is wrong. It is still very useful for deterministic protocol tests, such as phase gating, double confirmation, exact regression cases, and cases where each user turn must be precisely controlled. But for more open-ended business workflows, a higher-level mode may be easier for users to author and maintain.

Proposal: Add a ChatterAgent-Driven Conversation Mode

As a complementary mode to static input.turns, we could introduce a scenario-driven multi-turn evaluation mode powered by another Agent, tentatively called ChatterAgent.

In this mode, the case author does not need to predefine every user turn. Instead, each case describes:

  • the objective that the simulated user wants to complete;
  • the information available to the simulated user;
  • optional behavior constraints for how the simulated user should interact;
  • the maximum number of turns allowed.

Then the ChatterAgent talks with the Agent-With-Skill until the task appears complete, impossible, or the turn limit is reached. After that, the framework reuses the existing judge.type logic to evaluate the resulting transcript, tool calls, files, workspace state, or final answer.

Conceptually:

Case objective and chatter context
  -> ChatterAgent simulates the user
      -> talks with Agent-With-Skill for up to max_turns
          -> full transcript
              -> existing judge(type: rule_based / agent_judge / etc.)

The important point is that this should not introduce a separate evaluation framework. The ChatterAgent is responsible for generating adaptive user turns. The existing Judge remains responsible for deciding whether the Skill succeeded.

Possible Generic Schema

The schema should avoid fields that are too specific to a membership example, such as user_facts at the top level. A more generic design could place all simulated-user configuration under chatter, with knowledge as an arbitrary structured YAML object.

Also, task and chatter_instructions may overlap. A clearer split is:

  • objective: what the simulated user is trying to accomplish;
  • knowledge: information the simulated user knows and may provide during the conversation;
  • behavior: optional interaction constraints or style.

Example:

id: membership-create-and-upgrade
title: Create a member after clarification, then upgrade membership level

input:
  conversation:
    mode: chatter_agent
    max_turns: 6

    chatter:
      role: "business user"
      objective: >
        Create a new member named Alice, then upgrade her membership level to Gold.
        If the Agent asks for missing information, provide the required information.

      knowledge:
        member:
          name: Alice
          gender: female
          age: 28
          phone: "13800000000"
          initial_level: Silver
          target_level: Gold

      behavior:
        - "Act like a normal user, not an evaluator."
        - "Do not reveal all information at once unless the Agent asks for it."
        - "Do not invent information outside the provided knowledge."
        - "Stop once the task appears completed."

judge:
  type: rule_based
  success:
    - tool_called:
        name: create_member
        args:
          name: Alice
          age: 28
          gender: female
    - tool_called:
        name: update_member_level
        args:
          level: Gold

This keeps the case closer to the user's intent:

Can the Skill complete this task through conversation?

instead of forcing the user to answer:

Can I predict every intermediate Agent response and manually maintain the exact next user turn?

Suggested Positioning

The two modes can be complementary:

Mode Best For User Responsibility
input.turns Deterministic scripted multi-turn tests, protocol checks, guardrail verification, exact regressions Author each turn and per-turn assertions
input.conversation with chatter_agent Adaptive business workflows, clarification flows, goal-oriented task completion Describe the objective, available knowledge, and judge criteria

In other words, input.turns is a low-level deterministic mechanism, while chatter_agent is a higher-level scenario mechanism.

Additional Design Considerations

To make chatter_agent mode reliable enough for evaluation, the proposal may need to define several constraints:

  • max_turns should be required to control cost and prevent infinite conversations.
  • The ChatterAgent should have a clear stop condition, such as task completed, task impossible, or max turns reached.
  • The transcript should mark which messages came from the ChatterAgent and which came from the Agent-With-Skill.
  • The Judge should receive the full transcript exactly like static multi-turn mode.
  • The final report should include the generated user turns so failures remain debuggable.
  • The ChatterAgent should be instructed not to evaluate the Agent directly, but only to play the simulated user role.

This approach would reduce the burden of authoring multi-turn cases while still preserving the current proposal's core architecture: real turn-by-turn execution, session continuity, transcript collection, and reuse of existing judge types.

- turn_response_contains: # Assert turn 2 response contains rejection keywords
turn: 2
contains_any: ["need to complete first", "cannot skip", "execute in order"]
failure:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need failure criteria?
What happens if a response meets neither success nor failure criteria?

Comment thread proposals/README.md

| Proposal | Title | Status | Last Updated |
| :---------------------------------------------------: | :--------------------------------: | :----: | :----------: |
| [Proposal-0001](0001-multi-turn-conversation-eval.md) | Multi-Turn Conversation Evaluation | draft | 2026-05-21 | No newline at end of file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Major: README 索引行与提案 front matter 不一致(status / 日期 / 标题)

索引行写 status: draft / Last Updated: 2026-05-21 / 标题 Multi-Turn Conversation Evaluation;而提案 front matter(0001-...eval.md:2-7)是 status: provisional / 2026-05-19 / Multi-Turn Conversation Evaluation Support。PR 描述也自称 provisional,三处口径打架。索引是状态看板,错误状态会误导评审者判断生命周期阶段。

建议:以提案 front matter 为准,同步本行三列(provisional / 2026-05-19 / 带 “Support” 的标题)。


来源:Claude Code /code-review skill @ a0bbaee,已按 docs/review-heuristics.md 过滤

status: provisional
---

# SUP-0001: Multi-Turn Conversation Evaluation Support
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Major: 命名方案分裂 — SUP-0001 vs 模板生成的 Proposal-NNNN

首个手写提案用 SUP-0001 前缀,但 init-proposal.shproposal-template.md.template:10# Proposal-{{proposal_id}}:)生成的所有后续提案会是 Proposal-NNNN,README 链接文字又叫 Proposal-0001。第一篇就和工具产物分叉了。

建议:二选一定标准 —— 要么模板改成 SUP-{{proposal_id}}(推荐,“SUP” 是本仓库专有缩写,比通用 “Proposal” 更可检索),要么本提案标题改回 Proposal-0001。建议与维护者确认偏好。


来源:Claude Code /code-review skill @ a0bbaee,已按 docs/review-heuristics.md 过滤

result.SessionResult.Turns = turnsExecuted

// Check if any turn failed or all were skipped
if hasFailedTurn(turnResults) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Major: 硬执行错误(TurnError)被误报为 SKIP,与本提案 “no silent fallback” 原则矛盾

hasFailedTurn():907)只匹配 TurnFailed,不含 TurnErrorallSkipped():917)仅判断“无 TurnCompleted”。当 turn 1 因鉴权失败 / 限流 / 非零退出而 TurnError 时 → hasFailedTurn=falseallSkipped=true → 结果落到 StatusSkip。真正的基础设施 ERROR 被当成 SKIP,掩盖失败并污染通过率统计。而 Risks 表(:309)明确写 “resume failure → Mark as ERROR … no silent fallback”,但代码路径里根本没有把 TurnError 映射到 ERROR 的分支。

建议:在 finalize 中显式优先判断 TurnError → StatusError


来源:Claude Code /code-review skill @ a0bbaee,已按 docs/review-heuristics.md 过滤

| Engine | Session ID Generation | Session ID Storage |
| ----------- | -------------------------------------- | ---------------------------------------------------------------- |
| claude_code | `uuid.New()` passed via `--session-id` | `claudePrintJSONResult.SessionID` field, parsed from JSON output |
| codex | Auto-generated by codex CLI | Most recent session filename under `~/.codex/sessions/` |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Major: codex 用 “~/.codex/sessions/ 最新文件” 做 session 检测,并发跑 case 会串号

Non-Goals 只排除了“并行 turn”,没排除“并行 case”,而 skill-up 是会并发跑 case 的。两个 codex case 近乎同时 Run,靠取全局最新 .jsonl 会拿到另一个 case 刚写的 session 文件 → A 去 resume 了 B 的会话,上下文交叉污染。机制上“刚启动的会话”与“挑到的文件”之间没有任何关联键。

建议:在设计中明确 codex session 与 case 的关联方式(启动前后 diff 目录、或推动 codex 暴露 session-id),或显式声明 codex 多轮仅支持串行执行。


来源:Claude Code /code-review skill @ a0bbaee,已按 docs/review-heuristics.md 过滤

}

// 7. Execute capture
for _, cap := range turn.Capture {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Major: capture 失败静默返回 "",导致 {{变量}} 原样泄漏进下一轮 prompt

正则 / JSONPath 没匹配时 extractCapturedValue 返回 "",被这里的 if value != "" 守卫丢弃 → capturedVars 无此键 → renderTemplate:777)找不到可替换项 → 下一轮 content 里残留字面量 {{table_name}} 直接发给 Agent。模型随机性(Caveats :299 自己强调过)下这很常见,且无 ERROR、无诊断,又一次违反 “no silent fallback”。

建议:capture 失败时记录诊断 / 标记 turn,或对“引用了未捕获变量”的模板做未替换检测并报错。


来源:Claude Code /code-review skill @ a0bbaee,已按 docs/review-heuristics.md 过滤

// "response": "...",
// "transcript": { "tool_calls": [...], "tool_results": [...] }
// }
func extractByJSONPath(path, response string, sr *agent.SessionResult) string {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Major: JSONPath 示例路径与 extractByJSONPath 实际构造的根对象不一致

本函数把 tool 数据嵌在 transcript 下({response, transcript:{tool_calls, tool_results}}),所以正确路径应是 $.transcript.tool_results[0].id。而 CaptureRule 示例(:356)写的是 $.tool_results[0].id,照此必然解析失败 → 返回 "" → 叠加上一条的静默失败。文档内两处形态互相矛盾,照示例写的每条 JSONPath capture 都会落空。

建议:统一根对象 schema 与示例路径(二选一)。


来源:Claude Code /code-review skill @ a0bbaee,已按 docs/review-heuristics.md 过滤

FilesNotExist []string `json:"files_not_exist,omitempty" yaml:"files_not_exist,omitempty"`

// New: per-turn assertions
TurnResponseContains *TurnResponseContainsRule `json:"turn_response_contains,omitempty" yaml:"turn_response_contains,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Major: 新增 Rule 字段只有 yaml tag,缺 json tag,破坏既有双 tag 约定与 judge debug JSON 路径

TurnResponseContainsjson+yaml 双 tag,但下面三个(TurnResponseNotContains / ToolCalledInTurn / ToolNotCalledInTurn)只有 yaml。已核实仓库现状:internal/config/schema.go:132-152 所有 Rule/JudgeConfig 字段一律双 tag,且 internal/cli/judge_debug.go 走 JSON 反序列化。yaml-only 字段在 judge debug 路径下只能靠 Go 默认按字段名(TurnResponseNotContains)大小写不敏感匹配,而非预期的 snake_case 键 → 用户写 turn_response_not_contains 时字段静默为 nil,断言被悄悄丢弃造成误 PASS。不对称很隐蔽。

建议:按既有约定给三个新字段补 json:"...,omitempty" 双 tag。


来源:Claude Code /code-review skill @ a0bbaee,已按 docs/review-heuristics.md 过滤

Copy link
Copy Markdown
Collaborator

@zpzjzj zpzjzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码评审报告

项目 内容
评审来源 Claude Code /code-review skill(effort=medium)+ 3 路 finder + 跨文件证据补强 @ a0bbaee
分支 docs/add-multi-turn-conversation-eval-support-proposalmain
变更规模 +2403 行 / 5 文件(1984 行设计文档 + 197 行 shell 脚本 + 3 支撑文档)

性质说明:这是 proposal / 文档 PR。真正随本 PR 落地执行的代码只有 init-proposal.sh;文档里大段 Go 是设计示意代码,尚未实现。故 Go 部分按「实现前需在设计中澄清的设计缺陷」对待,不作为 Blocker。本 PR 无 Blocker。

问题汇总

  • 🔴 Blocker: 0 个
  • 🟠 Major: 7 个(2 个文档事实性问题 + 5 个设计层缺陷,均已 inline 标注)
  • 🔵 Info: 4 个(见下)

整体评价

⚠️ 有条件合并 —— 概念方向清晰、向后兼容设计周全。建议合并前至少修掉两个本 PR 直接可改的文档问题:README 与 front matter 对齐命名方案(SUP- vs Proposal-)定标准。设计层 #3#7 可在文档补充说明或留待实现 PR,但 #3 / #5(静默吞错,违反提案自己声明的 “no silent fallback”)值得在合并前的设计讨论中明确。

关键改进建议

  1. 文档对齐:README 索引行的 status/日期/标题与提案 front matter 同步(provisional / 2026-05-19 / 带 “Support”)。
  2. 命名定标准:统一 SUP-NNNNProposal-NNNN,让模板产物与首篇提案一致。
  3. 错误语义:设计中补 TurnError → ERROR 分支,并让 capture 失败可观测,落实 Risks 表已声明的 “no silent fallback”。
  4. 并发安全:明确 codex session 与 case 的关联(避免“最新文件”串号)。
  5. schema 一致性:新增 Rule 字段补齐 json 双 tag;统一 JSONPath 根对象 schema 与示例。

🔵 Info 级建议(init-proposal.sh,dev 辅助脚本,健壮性)

  • :112-s/-a/-o 作为末位参数缺值时,set -u 下访问 $2 触发 unbound variable 崩溃,而非干净的 “option requires an argument” 提示。
  • :173--output foo.md(basename 无数字前缀)时 PROPOSAL_ID 取成 foo.md,且不补零(42-x.md42 而非 0042),渲染出脏标题。
  • :157 — 纯符号 / 纯非 ASCII 标题经 slugify 后 slug 为空,生成 0001-.md 这种悬挂分隔符文件名且静默创建。
  • :195〔待核实〕— --output 指向不存在目录时,重定向直接报 shell 错误并在 set -e 下退出,无友好提示(未 mkdir -p 父目录)。

本评审由 Claude Code 内建 /code-review skill 生成原始 finding,经 pr-review-curator skill 的 docs/review-heuristics.md 过滤:原始 14 条意见做了“采纳 9 / 降级 3 / 驳回 2”处理。驳回项含一条疑似 off-by-one(经核实 1-indexed 下标正确)与一条 finder 自评最弱的 uniqueness 循环(实际行为正确);与 PR 上已有 Codex 评论(静态 turns 难维护的高层 UX 论述)无重叠。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

4 participants