Skip to content

Coverage signal has a hidden English prior — rename honestly or replace #3

@tomas-samek

Description

@tomas-samek

Summary

Scenario 4 reached 10/10 only after adding stopword filtering + trailing-s stemming to the coverage computation. Removing either drops it to 7/10. That is a linguistic prior for English, not a language-agnostic threshold — but the code and docs do not currently say so.

What to do

  1. Minimum (honest): rename the function and document in-code that coverage is English-biased in its current form. Update README claim about "no embeddings, no linguistic priors" to reflect the stopword plus morphology normalization.
  2. Better: replace both normalizations with something language-agnostic (e.g., IDF-like weighting derived from the trie's own visit counts, or character-n-gram coverage). Must hold or improve Scenario 4 grade without a hardcoded English vocabulary.

Acceptance

  • Either the docs match the code, or the code drops the English-specific rules without regressing Scenario 4.
  • Non-English paraphrase test (see M2) does not silently depend on an English preprocessor.

Links

  • src/mcp/responder.rs — coverage + normalization.
  • docs/design/honest_agent/progress.md Phase B paraphrase entry.
  • tests/honest_agent_paraphrase.rs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions