perf: optimize token estimator ~10x throughput#3421
perf: optimize token estimator ~10x throughput#3421mfzzf wants to merge 3 commits intoQuantumNous:mainfrom
Conversation
Key optimizations:
- Add ASCII fast path for letters, digits, spaces (avoids unicode.Is* calls)
- Remove unnecessary RWMutex on read-only multipliers map
- Replace linear string scan in isMathSymbol with map[rune]struct{} lookup
- Replace linear string scan in isURLDelim with [128]bool array index
- Add CJK basic block (0x4E00-0x9FFF) range check before unicode.Is(Han)
- Reorder isMathSymbol to check Unicode ranges before map fallback
Benchmark results (Apple M4, arm64):
| Benchmark | Before (ns/op) | After (ns/op) | Speedup |
|-------------------|-----------------|---------------|---------|
| OpenAI (mixed) | 130,000 | 11,900 | 10.9x |
| Claude (mixed) | 127,500 | 12,100 | 10.5x |
| Gemini (mixed) | 126,500 | 12,300 | 10.3x |
| Pure English | 85,000 | 10,900 | 7.8x |
| Pure Chinese | 67,500 | 15,400 | 4.4x |
Zero allocations in both versions. No behavioral changes.
WalkthroughRemoved lock synchronization and made multiplier access a read-only global; replaced the previous WordType state machine with an ASCII fast path and simplified state ( Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR optimizes service/token_estimator.go by introducing an ASCII fast-path, removing unnecessary synchronization, and replacing linear scans with O(1) lookups to substantially improve throughput while aiming to preserve estimation results.
Changes:
- Refactors
EstimateTokento use an ASCII-first state machine and faster symbol/URL delimiter classification. - Removes the
sync.RWMutexaround a read-only multipliers map. - Adds unit tests and benchmarks for token estimation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
service/token_estimator.go |
Performance-focused refactor of the token estimator (ASCII fast-path, precomputed lookup sets, lock removal). |
service/token_estimator_test.go |
Adds basic correctness smoke tests and benchmarks for tracking estimator performance regressions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
service/token_estimator_test.go
Outdated
| // 测试用例:确保优化后结果不变 | ||
| func TestEstimateToken_Correctness(t *testing.T) { | ||
| tests := []struct { | ||
| name string | ||
| provider Provider | ||
| text string | ||
| }{ | ||
| {"english", OpenAI, "Hello world, this is a test sentence with some numbers 12345."}, | ||
| {"chinese", OpenAI, "你好世界,这是一段测试文本。"}, | ||
| {"mixed", Claude, "Hello 你好 world 世界 123 test@email.com https://example.com/path?q=1&a=2"}, | ||
| {"math", Gemini, "∑∫∂√∞ x² + y³ = z⁴"}, | ||
| {"emoji", Claude, "Hello 😀🎉🚀 World"}, | ||
| {"empty", OpenAI, ""}, | ||
| {"spaces_newlines", OpenAI, "line1\nline2\tindented double"}, | ||
| } | ||
|
|
||
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { | ||
| result := EstimateToken(tt.provider, tt.text) | ||
| if tt.text == "" && result != 0 { | ||
| t.Errorf("expected 0 for empty text, got %d", result) | ||
| } | ||
| if tt.text != "" && result <= 0 { | ||
| t.Errorf("expected positive token count, got %d", result) | ||
| } | ||
| }) |
There was a problem hiding this comment.
TestEstimateToken_Correctness claims to ensure the optimized estimator returns the same results, but it only asserts non-empty inputs produce a positive value. This won’t catch regressions from the refactor (e.g., whitespace classification changes, word/number boundary changes, symbol handling). Consider turning these into golden tests with explicit expected counts per provider (and add a case with '\r' / '\f' if those are meaningful), so the test actually protects the behavior you’re trying to preserve.
| } else if urlDelimSet[r] { | ||
| count += m.URLDelim | ||
| } else if r == '\r' || r == '\f' || r == '\v' { | ||
| count += m.Newline | ||
| } else { | ||
| // 普通空格使用Space权重 | ||
| count += m.Space | ||
| count += m.Symbol |
There was a problem hiding this comment.
The ASCII fast-path changes whitespace semantics: '\r', '\f', and '\v' are now counted using m.Newline, whereas previously they would have gone through unicode.IsSpace and (since they are not '\n' or '\t') been counted as m.Space. This will change token estimates for inputs containing these characters and contradicts the stated goal of keeping results identical. Consider restoring the prior behavior (count these as Space), or explicitly documenting the behavior change and adding a regression test that locks in the intended handling.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
service/token_estimator_test.go (2)
24-34: Test assertions are too weak to catch regressions.The tests only verify
result > 0for non-empty text andresult == 0for empty text. This won't detect if the optimization accidentally changed token counts. Consider adding expected value assertions or at minimum golden/snapshot values for known inputs.Example with expected values
tests := []struct { name string provider Provider text string + wantMin int // minimum expected tokens (for sanity) }{ - {"english", OpenAI, "Hello world, this is a test sentence with some numbers 12345."}, - {"chinese", OpenAI, "你好世界,这是一段测试文本。"}, + {"english", OpenAI, "Hello world, this is a test sentence with some numbers 12345.", 10}, + {"chinese", OpenAI, "你好世界,这是一段测试文本。", 8}, // ... etc } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { result := EstimateToken(tt.provider, tt.text) if tt.text == "" && result != 0 { t.Errorf("expected 0 for empty text, got %d", result) } - if tt.text != "" && result <= 0 { - t.Errorf("expected positive token count, got %d", result) + if tt.text != "" && result < tt.wantMin { + t.Errorf("expected at least %d tokens, got %d", tt.wantMin, result) } }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@service/token_estimator_test.go` around lines 24 - 34, The tests for EstimateToken are too weak—update the test table used in the loop to include an expected token count for each case (e.g., add an Expected int field to the test struct), and replace the loose checks with strict assertions comparing result == tt.Expected (or use table-driven golden/snapshot values for known inputs) so changes to EstimateToken are caught; locate the test loop that calls EstimateToken and modify the test cases and assertions accordingly to validate exact token counts for known inputs as well as zero for empty text.
37-51: Consider testing unknown/edge-case model names.The tests cover known providers but don't verify the fallback behavior when model names don't match any provider (e.g.,
"unknown-model"). SinceEstimateTokenByModeldefaults to OpenAI for unmatched models, a test confirming this behavior would be valuable.Add fallback test case
+ // Verify unknown models fall back to OpenAI behavior + openAIResult := EstimateTokenByModel("gpt-4o", text) + unknownResult := EstimateTokenByModel("unknown-model-xyz", text) + if unknownResult != openAIResult { + t.Errorf("expected unknown model to use OpenAI fallback, got %d vs %d", unknownResult, openAIResult) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@service/token_estimator_test.go` around lines 37 - 51, Add a test case in TestEstimateTokenByModel to verify fallback behavior for unknown model names: call EstimateTokenByModel with an unknown model like "unknown-model" and the sample text and assert it yields a positive token count (or equals the OpenAI-based estimate such as EstimateTokenByModel("gpt-4o", text)); also assert that an empty text with an unknown model returns 0. Target the existing TestEstimateTokenByModel function and the EstimateTokenByModel calls to validate the default-to-OpenAI fallback behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@service/token_estimator.go`:
- Around line 147-176: The code currently calls unicode.IsNumber(r) before
checking isMathSymbol(r), causing superscript/subscript digits in mathSymbolSet
to be counted as m.Number; update the logic in the token counting loop (the
block using unicode.IsLetter, unicode.IsNumber, unicode.IsSpace and
isMathSymbol) to call isMathSymbol(r) before unicode.IsNumber(r) and, when
isMathSymbol returns true, increment m.MathSymbol and set state appropriately
(same state handling used for symbols) so those characters are treated as math
symbols rather than numbers.
---
Nitpick comments:
In `@service/token_estimator_test.go`:
- Around line 24-34: The tests for EstimateToken are too weak—update the test
table used in the loop to include an expected token count for each case (e.g.,
add an Expected int field to the test struct), and replace the loose checks with
strict assertions comparing result == tt.Expected (or use table-driven
golden/snapshot values for known inputs) so changes to EstimateToken are caught;
locate the test loop that calls EstimateToken and modify the test cases and
assertions accordingly to validate exact token counts for known inputs as well
as zero for empty text.
- Around line 37-51: Add a test case in TestEstimateTokenByModel to verify
fallback behavior for unknown model names: call EstimateTokenByModel with an
unknown model like "unknown-model" and the sample text and assert it yields a
positive token count (or equals the OpenAI-based estimate such as
EstimateTokenByModel("gpt-4o", text)); also assert that an empty text with an
unknown model returns 0. Target the existing TestEstimateTokenByModel function
and the EstimateTokenByModel calls to validate the default-to-OpenAI fallback
behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f968b7b0-ae02-49c8-b330-b54d9bae1d82
📒 Files selected for processing (2)
service/token_estimator.goservice/token_estimator_test.go
| // 非 ASCII 字母/数字 (如带重音的拉丁字母、西里尔字母等) | ||
| if unicode.IsLetter(r) { | ||
| if state != stLatin { | ||
| count += m.Word | ||
| state = stLatin | ||
| } | ||
|
|
||
| // 如果之前不在单词中,或者类型发生变化(字母<->数字),则视为新token | ||
| // 注意:对于OpenAI,通常"version 3.5"会切分,"abc123xyz"有时也会切分 | ||
| // 这里简单起见,字母和数字切换时增加权重 | ||
| if currentWordType == None || currentWordType != newType { | ||
| if newType == Number { | ||
| count += m.Number | ||
| } else { | ||
| count += m.Word | ||
| } | ||
| currentWordType = newType | ||
| continue | ||
| } | ||
| if unicode.IsNumber(r) { | ||
| if state != stNumber { | ||
| count += m.Number | ||
| state = stNumber | ||
| } | ||
| // 单词中间的字符不额外计费 | ||
| continue | ||
| } | ||
|
|
||
| // 5. 处理标点符号/特殊字符 - 按类型使用不同权重 | ||
| currentWordType = None | ||
| // 空白字符 (非 ASCII 空白,如全角空格) | ||
| if unicode.IsSpace(r) { | ||
| state = stNone | ||
| count += m.Space | ||
| continue | ||
| } | ||
|
|
||
| // 标点符号/特殊字符 | ||
| state = stNone | ||
| if isMathSymbol(r) { | ||
| count += m.MathSymbol | ||
| } else if r == '@' { | ||
| count += m.AtSign | ||
| } else if isURLDelim(r) { | ||
| count += m.URLDelim | ||
| } else { | ||
| count += m.Symbol | ||
| } |
There was a problem hiding this comment.
Superscript/subscript digits are now counted as numbers, not math symbols.
The mathSymbolSet includes superscript/subscript digits (²³¹⁴⁵⁶⁷⁸⁹⁰, ₀₁₂₃₄₅₆₇₈₉), but unicode.IsNumber(r) at line 155 returns true for these characters before the isMathSymbol check at line 172 is reached. This means they're counted with m.Number instead of m.MathSymbol.
If the original behavior treated these as math symbols (which the inclusion in mathSymbolSet suggests), this is a silent behavioral regression. Consider reordering the checks:
Proposed fix to check math symbols before unicode.IsNumber
continue
}
+ // 标点符号/特殊字符 (check math symbols early to catch superscript/subscript digits)
+ if isMathSymbol(r) {
+ state = stNone
+ count += m.MathSymbol
+ continue
+ }
+
// 非 ASCII 字母/数字 (如带重音的拉丁字母、西里尔字母等)
if unicode.IsLetter(r) {
if state != stLatin {
count += m.Word
state = stLatin
}
continue
}
if unicode.IsNumber(r) {
if state != stNumber {
count += m.Number
state = stNumber
}
continue
}
// 空白字符 (非 ASCII 空白,如全角空格)
if unicode.IsSpace(r) {
state = stNone
count += m.Space
continue
}
- // 标点符号/特殊字符
- state = stNone
- if isMathSymbol(r) {
- count += m.MathSymbol
- } else {
+ state = stNone
count += m.Symbol
- }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@service/token_estimator.go` around lines 147 - 176, The code currently calls
unicode.IsNumber(r) before checking isMathSymbol(r), causing
superscript/subscript digits in mathSymbolSet to be counted as m.Number; update
the logic in the token counting loop (the block using unicode.IsLetter,
unicode.IsNumber, unicode.IsSpace and isMathSymbol) to call isMathSymbol(r)
before unicode.IsNumber(r) and, when isMathSymbol returns true, increment
m.MathSymbol and set state appropriately (same state handling used for symbols)
so those characters are treated as math symbols rather than numbers.
- \r, \f, \v now correctly use m.Space (matching original unicode.IsSpace behavior) - Replace weak positive-only assertions with golden tests that lock in exact token counts per provider across 9 input categories - Add dedicated cr_ff_vt, url_heavy, at_sign test cases
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func BenchmarkEstimateToken_PureEnglish(b *testing.B) { | ||
| text := strings.Repeat("The quick brown fox jumps over the lazy dog. ", 200) | ||
| for b.Loop() { | ||
| EstimateToken(OpenAI, text) |
There was a problem hiding this comment.
strings.Repeat(...) is executed as part of the timed benchmark, which will skew ns/op and make the benchmark unsuitable for regression tracking. Consider moving text to a package-level var (like benchText) or calling b.StopTimer()/b.ResetTimer() after building text so only EstimateToken is measured.
| func BenchmarkEstimateToken_PureChinese(b *testing.B) { | ||
| text := strings.Repeat("人工智能技术正在快速发展和广泛应用。", 200) | ||
| for b.Loop() { | ||
| EstimateToken(OpenAI, text) |
There was a problem hiding this comment.
strings.Repeat(...) is executed as part of the timed benchmark, which will skew ns/op and make the benchmark unsuitable for regression tracking. Consider moving text to a package-level var or calling b.StopTimer()/b.ResetTimer() after building text so only EstimateToken is measured.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
service/token_estimator.go (1)
148-160:⚠️ Potential issue | 🟡 Minor
unicode.IsLetter/unicode.IsNumberstill shadowisMathSymbol.Because the generic non-ASCII branches run first,
mathSymbolSetentries such asⁿ,², and₉, plus the mathematical alphanumeric block (0x1D400–0x1D7FF), never reach the math-symbol branch. They are currently charged asm.Word/m.Numberinstead ofm.MathSymbol.Suggested fix
- // 非 ASCII 字母/数字 (如带重音的拉丁字母、西里尔字母等) - if unicode.IsLetter(r) { + // 数学符号要先于通用字母/数字分类,否则上下标数字/字母会落到 Word/Number + if isMathSymbol(r) { + state = stNone + count += m.MathSymbol + continue + } + + // 非 ASCII 字母/数字 (如带重音的拉丁字母、西里尔字母等) + if unicode.IsLetter(r) { if state != stLatin { count += m.Word state = stLatin } continue @@ - // 标点符号/特殊字符 - state = stNone - if isMathSymbol(r) { - count += m.MathSymbol - } else { - count += m.Symbol - } + // 标点符号/特殊字符 + state = stNone + count += m.SymbolAlso applies to: 172-173
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@service/token_estimator.go` around lines 148 - 160, The math-symbol characters are being classified as letters/numbers because unicode.IsLetter / unicode.IsNumber checks run before the math-symbol branch; update the branching in the token estimator (the code that sets state and increments count using m.Word, m.Number, m.MathSymbol) to test for math symbols first for non-ASCII runes—call isMathSymbol(r) or check mathSymbolSet before the unicode.IsLetter and unicode.IsNumber branches (the same change should be applied to both the letter/number blocks around the stLatin/stNumber state handling at the shown locations) so that math symbols are charged as m.MathSymbol and state is set appropriately instead of m.Word/m.Number.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@service/token_estimator.go`:
- Around line 148-160: The math-symbol characters are being classified as
letters/numbers because unicode.IsLetter / unicode.IsNumber checks run before
the math-symbol branch; update the branching in the token estimator (the code
that sets state and increments count using m.Word, m.Number, m.MathSymbol) to
test for math symbols first for non-ASCII runes—call isMathSymbol(r) or check
mathSymbolSet before the unicode.IsLetter and unicode.IsNumber branches (the
same change should be applied to both the letter/number blocks around the
stLatin/stNumber state handling at the shown locations) so that math symbols are
charged as m.MathSymbol and state is set appropriately instead of
m.Word/m.Number.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bbe1a510-1faf-4dba-a517-eb55925e288a
📒 Files selected for processing (1)
service/token_estimator.go
📝 变更描述 / Description
service/token_estimator.go性能很差,每个字符都要走unicode.IsSpace/IsLetter/IsNumber等重量级调用,isMathSymbol每次线性扫描60+字符的字符串,isURLDelim同理,还有一把完全没必要的 RWMutex。优化手段:
unicode.Is*sync.RWMutexisMathSymbol:字符串线性扫描 →map[rune]struct{}O(1) 查找,Unicode 范围检查前置isURLDelim:字符串遍历 →[128]bool数组直接索引isCJK:基本区0x4E00-0x9FFF先做范围判断,避免unicode.Is(Han)二分查找🚀 变更类型 / Type of change
✅ 提交前检查项 / Checklist
📸 运行证明 / Proof of Work
Benchmark 对比 (Apple M4, arm64,
-count=3):两个版本均 0 alloc/op,估算结果完全一致。
Summary by CodeRabbit
Refactor
Tests