update(tb-session): add workflow guidance to SKILL.md by dafilipaj · Pull Request #16 · productiveio/cli-toolbox

dafilipaj · 2026-04-09T11:35:51Z

Summary

Adds a Workflow section to tb-session SKILL.md with 6 explicit instructions for Claude Code agents
Addresses behavioral gaps identified through empirical testing: missing --json flag, inconsistent result presentation, failure to execute resume when asked, excessive tool calls

What changed

Added a ## Workflow section between ## Important and ## Getting started with:

Always use --json — agents were inconsistently using human-readable output
Command selection matrix — maps user intent to the correct tb-session subcommand
Structured field presentation — agents must show session ID, branch, summary, and timestamp for every result
Concrete next-step offers — always offer show <id> or resume <id> after results
MUST resume — when user says "resume", the agent must call tb-session resume <id>, not just describe the session
Efficiency target — aim for 1–3 tool calls per request

Autoresearch results

Improvements were validated using the p-autoresearch framework with 4 eval scenarios × 3 runs each (12 trials per experiment), scored by independent judge agents on 5 metrics.

Experiment progression

Experiment	Composite	Weighted avg	Stddev	Delta
exp000 (baseline)	0.718	0.810	0.184	—
exp001 (+workflow guidance)	0.882	0.930	0.097	+0.164
exp002 (+structured fields)	0.932	0.948	0.032	+0.215

Per-metric improvement (baseline → exp002)

Metric (weight)	Baseline	Exp002	Change
command_accuracy (0.35)	0.795	1.000	+0.205
result_interpretation (0.20)	0.746	0.867	+0.121
answer_faithfulness (0.20)	0.875	1.000	+0.125
completeness (0.15)	0.688	0.854	+0.166
efficiency (0.10)	0.458	1.000	+0.542

Eval scenarios

Find by content — "Find the session where we worked on the auth middleware refactor"
PR lookup — "What sessions are related to PR #125?"
Resume request — "Resume the session where we were working on the help center skill" (critical — tests the known resume behavioral gap)
Recent activity — "What have we been working on this past week?"

Key findings

--json compliance: Baseline 2/3 runs missed it → exp002 12/12 use it
Efficiency: Baseline ranged 1–8 tool calls → exp002 all within 1–3
Resume follow-through: All experiments consistently called tb-session resume (the baseline already did this with Sonnet — the real-world gap may be model-specific or context-dependent)
Remaining gaps: Scenario 001 completeness (0.5) — agents offer "try different keywords" instead of show <id> when no match exists. This is arguably correct behavior.
Consistency: Stddev dropped from 0.184 → 0.032 (6× more consistent)

Token usage

	Baseline avg	Exp002 avg	Change
Tokens/trial	18,388	17,238	-6.3%

Despite adding 22 lines of instructions, token usage slightly decreased due to fewer retry searches and more efficient command selection.

Test plan

Verify SKILL.md loads correctly via tb-session skill help
Test each scenario manually: content search, PR lookup, resume, recent activity
Confirm --json is used by default in agent responses
Confirm resume is actually called (not just described) when user says "resume"

🤖 Generated with Claude Code

…gent usage Adds a structured Workflow section that instructs Claude Code how to use tb-session effectively: always use --json, pick the right command per intent, present results with structured fields, offer concrete next steps, and actually execute resume when asked. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dafilipaj and others added 2 commits April 9, 2026 13:35

tb-session: bump version to 0.2.0

e719bce

dafilipaj requested a review from wnbsmart April 9, 2026 11:41

dafilipaj marked this pull request as ready for review April 9, 2026 11:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update(tb-session): add workflow guidance to SKILL.md#16

update(tb-session): add workflow guidance to SKILL.md#16
dafilipaj wants to merge 2 commits intomainfrom
update/tb-session-skill-workflow-guidance

dafilipaj commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dafilipaj commented Apr 9, 2026

Summary

What changed

Autoresearch results

Experiment progression

Per-metric improvement (baseline → exp002)

Eval scenarios

Key findings

Token usage

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant