Skip to content

Port: port/pr-17-extend-scanner-lang-support#3

Merged
briansumma merged 4 commits into
developfrom
port/pr-17-extend-scanner-lang-support
May 14, 2026
Merged

Port: port/pr-17-extend-scanner-lang-support#3
briansumma merged 4 commits into
developfrom
port/pr-17-extend-scanner-lang-support

Conversation

@briansumma
Copy link
Copy Markdown
Collaborator

Ported from upstream cytostack/openwolf. Needs review before merge.

Adds Flutter-critical languages (Kotlin, Swift, Objective-C) plus
common gaps (C++ variants, C#, Ruby, PHP, Lua, Vue, Svelte, HTML,
Protobuf, GraphQL, Terraform, shell variants) to both extension sets.

Also brings src/tracker/token-estimator.ts CODE_EXTS back in sync with
src/scanner/anatomy-scanner.ts CODE_EXTENSIONS — the two sets had
drifted apart since only CODE_EXTENSIONS gets the .dart addition from
#10. Adds a one-line "Keep in sync with ..." comment above each so
future additions hit both places.

These sets control the chars-per-token ratio (3.5 for code vs 3.75
fallback) used by estimateTokens; the net effect is ~7% more accurate
token accounting in anatomy.md and detectContentType() consumers for
projects written in these languages.
@briansumma briansumma mentioned this pull request May 14, 2026
@briansumma
Copy link
Copy Markdown
Collaborator Author

Here's my review of PR #3:


PR #3 — Port: Extend Scanner Language Support

Status: DRAFT | Lines: +16 / -2
What it does: Ports upstream PR cytostack#17 from cytostack/openwolf. Expands both the CODE_EXTENSIONS set in anatomy-scanner.ts and the CODE_EXTS set in token-estimator.ts with ~18 new file extensions (Dart, Kotlin, Swift, C++ variants, C#, Ruby, PHP, Lua, Vue, Svelte, HTML, Protobuf, GraphQL, Terraform, and shell variants). Also adds "Keep in sync" comments linking the two parallel sets.


⚠️ Warnings

WR-01: Fragile duplicate Sets — "Keep in sync" comment doesn't prevent drift

Both CODE_EXTENSIONS (anatomy-scanner.ts:42) and CODE_EXTS (token-estimator.ts:4) are byte-for-byte identical sets maintained in two separate files. The sync comment is an improvement over the pre-existing state, but it's unenforceable — any future contributor who adds an extension to one file and misses the other will silently produce divergent behavior:

  • anatomy-scanner.ts would use ratio 3.75 (default) instead of 3.5 for the missed extension
  • token-estimator.ts would return "mixed" instead of "code" for detectContentType()

For a tool where token estimation accuracy is the core value proposition, this silent drift is a real risk. The correct fix is to extract the shared set into a single source of truth:

// src/utils/extensions.ts (new file)
export const CODE_EXTENSIONS = new Set([
  ".ts", ".js", ".tsx", ".jsx", ".py", ".rs", ".go", ".java",
  // ... full list ...
]);

export const PROSE_EXTENSIONS = new Set([".md", ".txt", ".rst", ".adoc"]);

Then import it in both files. This eliminates the sync problem entirely.


WR-02: .html/.htm classified as code applies too-low token ratio

anatomy-scanner.ts:49 / token-estimator.ts:11

HTML is classified as "code" type, applying a 3.5 chars/token ratio. HTML files are markup with prose content, attribute text, and only optionally embedded JS/CSS. The mixed ratio (3.75) or even prose ratio (4.0) would be more accurate.

With the code ratio, OpenWolf will consistently under-count tokens for HTML files, which means users could exceed their intended token budget. Given this is a token-budgeting tool, accuracy matters.

Suggested fix: Remove .html and .htm from CODE_EXTS/CODE_EXTENSIONS and either add a dedicated MARKUP_EXTENSIONS set with its own ratio, or leave HTML to fall through to the mixed default (3.75):

// In anatomy-scanner.ts estimateTokens():
if (CODE_EXTENSIONS.has(ext)) ratio = 3.5;
if (MARKUP_EXTENSIONS.has(ext)) ratio = 3.75; // Already the default, so just don't add to CODE_EXTENSIONS
if (PROSE_EXTENSIONS.has(ext)) ratio = 4.0;

ℹ️ Info

IN-01: No project-level tests

The project has no test suite (no *.test.ts or *.spec.ts files outside node_modules). The extension lists and the estimateTokens/detectContentType functions are untested. Not introduced by this PR, but worth noting before this pattern grows — a simple test asserting the two sets are identical would have caught any future WR-01 drift automatically.


✅ What's Good

  • The "Keep in sync" comments are a meaningful improvement over the prior state
  • Extension additions themselves are correct and well-chosen (Dart, Kotlin, Swift, Terraform, GraphQL, shell variants all reasonable as code type)
  • Both files updated atomically in the same PR — the sets are in sync right now
  • .hpp, .hh, .cc, .cxx filling out C++ variants was the right call

Verdict

Not ready to merge as-is. WR-01 is the primary concern — this PR is the perfect moment to fix the structural duplication rather than cement the comment-based sync pattern. WR-02 is lower priority but worth a conversation before merge. Neither is a showstopper if you want to accept the upstream port as-is, but they'd both be easier to fix now than after more files depend on the pattern.

…HTML ratio

Addresses two PR review findings:

WR-01 — Eliminate duplicate parallel Sets
  Both `anatomy-scanner.ts` and `token-estimator.ts` maintained byte-for-byte
  identical `CODE_EXTENSIONS`/`CODE_EXTS` sets linked only by a "Keep in sync"
  comment — an unenforceable convention that would silently produce divergent
  token-ratio behaviour the moment a future contributor updated one file and
  missed the other.  Extract to `src/utils/extensions.ts` as a single source
  of truth and import from both consumers.

WR-02 — Remove .html/.htm from CODE_EXTENSIONS
  HTML is markup with prose content and attribute text; classifying it as
  `code` applies a 3.5 chars/token ratio that consistently under-counts
  tokens, risking budget overruns.  Dropping .html/.htm from CODE_EXTENSIONS
  lets them fall through to the default `mixed` ratio (3.75), which is more
  accurate.  A comment in extensions.ts records the rationale.
@briansumma
Copy link
Copy Markdown
Collaborator Author

✅ Autofixer applied. Build and tests passed. Promoting to Ready for Review.

@briansumma briansumma marked this pull request as ready for review May 14, 2026 21:25
briansumma added a commit that referenced this pull request May 14, 2026
- isOpenWolfHook: check _managedBy as primary signal, path substring as backward-compat fallback for pre-tag installs
- Add comment documenting that _managedBy is empirically observed passthrough, not a guaranteed Claude Code field (Warning #2)
- Add comment in replaceOpenWolfHooks documenting co-location assumption: one inner hook per outer entry unsupported (Warning #1)
- Reformat HOOK_SETTINGS to multi-line expanded style, respects 80-char line length rule (Info #4)
- Code duplication between init.ts and update.ts already resolved by prior extraction to hook-settings.ts (Info #3)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@briansumma briansumma changed the base branch from main to develop May 14, 2026 23:46
@briansumma briansumma merged commit c080f05 into develop May 14, 2026
@briansumma briansumma deleted the port/pr-17-extend-scanner-lang-support branch May 14, 2026 23:49
briansumma added a commit that referenced this pull request May 14, 2026
* init/update: tag hook entries with _managedBy: "openwolf"

Adds `_managedBy: "openwolf"` to every hook object in `HOOK_SETTINGS`
so Claude Code's settings round-tripper recognizes them as third-party
managed entries and preserves them through `/effort`, `/config`, and
similar rewrites. Without the tag, entries get silently dropped: a
working OpenWolf install can be de-wired by typing `/effort medium`
once, since Claude Code's merge logic only preserves entries it
recognizes as owned (claude-hooks uses the same field, for example).

Also tightens `replaceOpenWolfHooks` to recognize the new tag in
addition to the legacy `.wolf/hooks/` substring — defensive against
future path schema changes, and keeps the dedupe correct for installs
upgrading from a pre-tag version.

The two changes are minimal and backward-compatible: untagged entries
from older installs still match the substring fallback, so upgrades
clean up cleanly. New installs get tagged from the start.

Fixes cytostack#31.

* fix(hook-settings): address PR #8 review comments

- isOpenWolfHook: check _managedBy as primary signal, path substring as backward-compat fallback for pre-tag installs
- Add comment documenting that _managedBy is empirically observed passthrough, not a guaranteed Claude Code field (Warning #2)
- Add comment in replaceOpenWolfHooks documenting co-location assumption: one inner hook per outer entry unsupported (Warning #1)
- Reformat HOOK_SETTINGS to multi-line expanded style, respects 80-char line length rule (Info #4)
- Code duplication between init.ts and update.ts already resolved by prior extraction to hook-settings.ts (Info #3)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: ManniX-ITA <20623405+mann1x@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants