Output safety gates: dedup, insertion safety, sentence boundary#485
Conversation
Move suggestion-quality controls into shared, testable helpers and suppress more bad completions before they reach ghost text: - TrailingDuplicationFilter: folded, multi-shape after-caret duplication check replacing the raw hasPrefix guard in SuggestionTextNormalizer. - InsertionSafetyGate: reject control chars, U+FFFD, and whitespace-only output. - SentenceBoundaryClassifier: disambiguate periods (decimals, list numbers, initials, abbreviations) so phrase acceptance does not stop early. Refactor <think>-block stripping into a helper to keep normalize() under the cyclomatic-complexity threshold. Typo suppression is intentionally left to the more complete PR #353 rather than duplicated here.
023d913 to
c42c482
Compare
Phrase-terminator detection now routes periods through SentenceBoundaryClassifier, which treats a run of single-letter initials like "U.S.A." as non-terminal. The test still asserted the pre-classifier "known limitation" (an early break at "U.S.A."), so it failed once the classifier was wired in. Update it to assert the corrected behavior: acceptance walks past the initials to the real sentence end, yielding "U.S.A. is great.". Matches SentenceBoundaryClassifierTests, which already covers the same rule.
| CotabbyInference: | ||
| url: https://github.com/FuJacob/cotabbyinference.git |
There was a problem hiding this comment.
Dependency pinned to a feature branch
CotabbyInference is now locked to feat/generation-quality-controls instead of main. Branch references are not content-addressed: if that branch is deleted, rebased, or diverges after this PR is merged, every subsequent build will silently fail to resolve the package. The same change appears in project.pbxproj, so both the YAML spec and the generated project file are affected. This should be resolved before merge — either by merging the inference-engine PR first and pointing back to main, or by pinning to a concrete commit SHA that survives branch lifecycle events.
| // common "model re-emits the next few words" case where the two diverge only after a while. | ||
| let overlap = commonPrefixLength(foldedCompletion, foldedTrailing) | ||
| return overlap >= max(minimumFoldedOverlap, foldedCompletion.count / 2) |
There was a problem hiding this comment.
Shape 3 threshold collapses to
minimumFoldedOverlap for short completions
foldedCompletion.count / 2 is integer division, so for completions of 4–6 folded characters the divisor rounds down to 2 or 3, making max(minimumFoldedOverlap, ...) always equal to minimumFoldedOverlap (3). A 4-character completion like "work" shares 3 folded characters (wor) with trailing text "world...", so 3 >= max(3, 4/2) → 3 >= 3 → suppressed, even though the two words are distinct. The intent of shape 3 is to catch re-emitted multi-word runs; that goal is better served by requiring the overlap to be strictly more than half the completion length, e.g. overlap > foldedCompletion.count / 2, or by raising minimumFoldedOverlap for this shape alone.
Summary
Moves last-mile suggestion-quality control into small, testable helpers and suppresses more classes of bad completion before they reach ghost text. The goal is the project principle that a suppressed completion beats a wrong one.
First of two stacked PRs. App-only, no engine change.
TrailingDuplicationFilter: replaces the rawhasPrefix(trailingText)guard inSuggestionTextNormalizerwith a folded (lowercased, alphanumeric-only) check covering three duplication shapes, so a stray leading glyph, a case difference, or a contained suffix no longer slips a duplicate through.InsertionSafetyGate: final gate inSuggestionTextNormalizerthat rejects control characters, U+FFFD replacement glyphs, and whitespace-only output. Does not judge punctuation, so a lone ")" or "." still passes.SentenceBoundaryClassifier: consulted bySuggestionSessionReconciler.endsInSentenceTerminatorso phrase acceptance no longer stops early on decimals ("1.2"), list numbers ("1."), single-letter initials, or common abbreviations ("e.g.", "U.S.").Typo suppression is intentionally not included here. It overlaps the more complete #353 (which adds a spell checker, correction mode, and settings), so this PR defers to that rather than shipping a weaker duplicate.
Validation
Local Team ID caveat: the default signed
testrun cannot load the xctest bundle locally, so the suites are run withCODE_SIGNING_ALLOWED=NO.build-for-testingpasses signed.Linked issues
None. Related: #353 (typo suppression, intentionally not duplicated here).
Risk / rollout notes
xcodegen generate.Greptile Summary
This PR introduces five new helper types that form a multi-layer output safety pipeline, tightening last-mile filtering before suggestions reach ghost text. All new types are pure functions with dedicated test suites (19 tests, 0 failures reported).
SuggestionTextNormalizer:TrailingDuplicationFilterreplaces the rawhasPrefixguard with a folded (lowercased, alphanumeric-only) check across three duplication shapes;InsertionSafetyGaterejects control characters, U+FFFD glyphs, and whitespace-only output as the final step before a suggestion is surfaced.SentenceBoundaryClassifieris integrated intoSuggestionSessionReconciler.endsInSentenceTerminatorso decimals, list numbers, single-letter initials, and common abbreviations no longer cause early phrase breaks.ConfidenceSuppressionPolicyandMidWordContinuationPolicyare wired intoLlamaRuntimeCore/LlamaSuggestionEnginewith all new options disabled by default; theCotabbyInferencepackage is switched to a feature branch to expose the required engine APIs.Confidence Score: 3/5
Not safe to merge as-is: the CotabbyInference dependency is pinned to a feature branch, making all builds after a potential branch deletion or rebase silently broken.
The new helper types are well-structured and the test suite is solid, but the dependency on feat/generation-quality-controls instead of a stable ref is a real build-stability problem. Additionally, TrailingDuplicationFilter Shape 3 uses integer division that collapses the effective threshold to the same 3-character floor for short completions, which can suppress valid 4–6 character suggestions sharing a common prefix with trailing text.
project.yml and Cotabby.xcodeproj/project.pbxproj both reference the feat/generation-quality-controls branch; TrailingDuplicationFilter.swift Shape 3 threshold deserves a second look for short completion inputs.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[LlamaSuggestionEngine] --> B[LlamaRuntimeCore generates tokens accumulates sumLogprob] B --> C{ConfidenceSuppressionPolicy} C -- suppressed --> Z[return empty] C -- passes --> D[SuggestionTextNormalizer] D --> D1[stripThinkBlocks] D1 --> D2[prompt-echo strip] D2 --> D3{TrailingDuplicationFilter shapes 1·2·3} D3 -- duplicate --> Z D3 -- unique --> D4[whitespace trim] D4 --> D5{InsertionSafetyGate} D5 -- unsafe --> Z D5 -- safe --> E[ghost text shown] F[SuggestionSessionReconciler] --> G{endsInSentenceTerminator} G -- exclamation or question --> H[stop phrase] G -- period --> I{SentenceBoundaryClassifier} I -- decimal/initial/abbrev --> J[continue phrase] I -- real sentence end --> HComments Outside Diff (1)
Cotabby/Support/SentenceBoundaryClassifier.swift, line 541-542 (link)The set covers basics but is missing commonly occurring terms:
"prof","jr","sr","corp","dept","ave","blvd","vol","ed"(editor),"est"(established), and"ca"(circa). When any of these appear at the end of a chunk in phrase-acceptance mode, the classifier will fall through toreturn trueand treat their trailing period as a sentence end. Because the classifier deliberately biases toward continuing on ambiguous periods (safe default), extending this list with a few more well-known cases costs nothing in correctness and reduces unnecessary early breaks.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (1): Last reviewed commit: "Generation-time quality controls: token ..." | Re-trigger Greptile