Skip to content

update(tb-session): add workflow guidance to SKILL.md#16

Open
dafilipaj wants to merge 2 commits intomainfrom
update/tb-session-skill-workflow-guidance
Open

update(tb-session): add workflow guidance to SKILL.md#16
dafilipaj wants to merge 2 commits intomainfrom
update/tb-session-skill-workflow-guidance

Conversation

@dafilipaj
Copy link
Copy Markdown
Contributor

Summary

  • Adds a Workflow section to tb-session SKILL.md with 6 explicit instructions for Claude Code agents
  • Addresses behavioral gaps identified through empirical testing: missing --json flag, inconsistent result presentation, failure to execute resume when asked, excessive tool calls

What changed

Added a ## Workflow section between ## Important and ## Getting started with:

  1. Always use --json — agents were inconsistently using human-readable output
  2. Command selection matrix — maps user intent to the correct tb-session subcommand
  3. Structured field presentation — agents must show session ID, branch, summary, and timestamp for every result
  4. Concrete next-step offers — always offer show <id> or resume <id> after results
  5. MUST resume — when user says "resume", the agent must call tb-session resume <id>, not just describe the session
  6. Efficiency target — aim for 1–3 tool calls per request

Autoresearch results

Improvements were validated using the p-autoresearch framework with 4 eval scenarios × 3 runs each (12 trials per experiment), scored by independent judge agents on 5 metrics.

Experiment progression

Experiment Composite Weighted avg Stddev Delta
exp000 (baseline) 0.718 0.810 0.184
exp001 (+workflow guidance) 0.882 0.930 0.097 +0.164
exp002 (+structured fields) 0.932 0.948 0.032 +0.215

Per-metric improvement (baseline → exp002)

Metric (weight) Baseline Exp002 Change
command_accuracy (0.35) 0.795 1.000 +0.205
result_interpretation (0.20) 0.746 0.867 +0.121
answer_faithfulness (0.20) 0.875 1.000 +0.125
completeness (0.15) 0.688 0.854 +0.166
efficiency (0.10) 0.458 1.000 +0.542

Eval scenarios

  1. Find by content — "Find the session where we worked on the auth middleware refactor"
  2. PR lookup — "What sessions are related to PR #125?"
  3. Resume request — "Resume the session where we were working on the help center skill" (critical — tests the known resume behavioral gap)
  4. Recent activity — "What have we been working on this past week?"

Key findings

  • --json compliance: Baseline 2/3 runs missed it → exp002 12/12 use it
  • Efficiency: Baseline ranged 1–8 tool calls → exp002 all within 1–3
  • Resume follow-through: All experiments consistently called tb-session resume (the baseline already did this with Sonnet — the real-world gap may be model-specific or context-dependent)
  • Remaining gaps: Scenario 001 completeness (0.5) — agents offer "try different keywords" instead of show <id> when no match exists. This is arguably correct behavior.
  • Consistency: Stddev dropped from 0.184 → 0.032 (6× more consistent)

Token usage

Baseline avg Exp002 avg Change
Tokens/trial 18,388 17,238 -6.3%

Despite adding 22 lines of instructions, token usage slightly decreased due to fewer retry searches and more efficient command selection.

Test plan

  • Verify SKILL.md loads correctly via tb-session skill help
  • Test each scenario manually: content search, PR lookup, resume, recent activity
  • Confirm --json is used by default in agent responses
  • Confirm resume is actually called (not just described) when user says "resume"

🤖 Generated with Claude Code

dafilipaj and others added 2 commits April 9, 2026 13:35
…gent usage

Adds a structured Workflow section that instructs Claude Code how to use
tb-session effectively: always use --json, pick the right command per intent,
present results with structured fields, offer concrete next steps, and
actually execute resume when asked.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dafilipaj dafilipaj requested a review from wnbsmart April 9, 2026 11:41
@dafilipaj dafilipaj marked this pull request as ready for review April 9, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant