Skip to content

perf(discovery): Optimize token usage in Socratic question generation #259

@frankbria

Description

@frankbria

Summary

The Socratic discovery system includes full conversation history in each AI prompt. With MAX_DISCOVERY_QUESTIONS=20, this can grow to ~10,000 tokens just for history, plus prompt overhead.

Current Behavior

Each call to _generate_next_discovery_question() builds a prompt containing:

  • Full conversation history (all Q&A pairs)
  • Structured answers by topic
  • Uncovered categories
  • Socratic guidelines

Worst case scenario:

  • 20 turns × ~500 tokens/turn = ~10,000 tokens for history
  • Plus ~1,000 tokens for prompt template and metadata
  • Total: ~11,000 input tokens per question generation

Cost impact:

  • At $3/M input tokens (Claude 3.5 Sonnet): ~$0.033 per call
  • Full 20-question discovery: ~$0.66 in input tokens alone

Proposed Optimizations

Option 1: Conversation Summarization

After N turns (e.g., 10), summarize earlier turns into a condensed summary:

if len(conversation_history) > 10:
    summary = self._summarize_turns(conversation_history[:10])
    recent_turns = conversation_history[10:]
    prompt += f"## Earlier Conversation Summary\n{summary}\n\n"
    prompt += "## Recent Conversation\n"
    for turn in recent_turns:
        ...

Option 2: Truncate to Recent K Turns

Keep only the most recent K turns (e.g., 5-7) plus category summaries:

MAX_HISTORY_TURNS = 7
recent_history = conversation_history[-MAX_HISTORY_TURNS:]

Option 3: Token Budget with tiktoken

Count tokens and truncate when approaching a limit:

import tiktoken

def _build_discovery_question_prompt(self, ...):
    enc = tiktoken.encoding_for_model("claude-3-5-sonnet-20241022")
    MAX_PROMPT_TOKENS = 4000
    
    # Add turns until budget reached
    for turn in reversed(conversation_history):
        turn_text = f"Q: {turn['question']}\nA: {turn['answer']}\n"
        if current_tokens + len(enc.encode(turn_text)) > MAX_PROMPT_TOKENS:
            break
        included_turns.insert(0, turn)

Option 4: Semantic Deduplication

Use embeddings to identify and remove redundant information across turns.

Recommendation

Start with Option 2 (truncate to recent K turns) as it's simplest and provides immediate benefit. Consider Option 1 (summarization) for better context preservation if quality degrades.

Acceptance Criteria

  • Implement token optimization strategy
  • Add token counting/logging to monitor usage
  • Ensure question quality doesn't degrade with truncation
  • Update documentation with token budget guidance
  • Add tests for truncation/summarization logic

Priority

P3 - Nice to have optimization, not blocking for MVP

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions