Skip to content

Task 02: validate 'universal tokenizer' claim with a second modality #4

@tomas-samek

Description

@tomas-samek

Summary

Task 02 described the tokenizer as universal across modalities. Only text has been exercised. Until a second modality is tested, the claim is aspirational.

What to do

  • Pick one non-text modality (time-series numeric, audio frames, byte-level binary blobs, pixel deltas) that is small enough to run in a unit test.
  • Feed it through tokenize_with_silence or an equivalent adapter.
  • Verify the byte-trie builds meaningful depth and that concept binding across modalities behaves plausibly.
  • Either:
    • Promote Task 02 to "universal (text + X)" with test evidence, or
    • Narrow the claim in README and docs to "text-only for PoC".

Acceptance

  • Test file tests/universal_tokenizer_<modality>.rs exists, or README/docs narrowed to reflect reality.

Links

  • docs/design/honest_agent/tasks/02_universal_tokenizer.md
  • src/trie/tokenizer.rs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions