Skip to content

perf: optimize token estimator ~10x throughput#3421

Open
mfzzf wants to merge 3 commits intoQuantumNous:mainfrom
mfzzf:perf/token-estimator-optimize
Open

perf: optimize token estimator ~10x throughput#3421
mfzzf wants to merge 3 commits intoQuantumNous:mainfrom
mfzzf:perf/token-estimator-optimize

Conversation

@mfzzf
Copy link

@mfzzf mfzzf commented Mar 24, 2026

📝 变更描述 / Description

service/token_estimator.go 性能很差,每个字符都要走 unicode.IsSpace/IsLetter/IsNumber 等重量级调用,isMathSymbol 每次线性扫描60+字符的字符串,isURLDelim 同理,还有一把完全没必要的 RWMutex。

优化手段:

  1. ASCII 快速路径——字母、数字、空格直接范围比较,跳过 unicode.Is*
  2. 去掉只读 map 上的 sync.RWMutex
  3. isMathSymbol:字符串线性扫描 → map[rune]struct{} O(1) 查找,Unicode 范围检查前置
  4. isURLDelim:字符串遍历 → [128]bool 数组直接索引
  5. isCJK:基本区 0x4E00-0x9FFF 先做范围判断,避免 unicode.Is(Han) 二分查找
  6. 新增 benchmark 测试用于回归追踪

🚀 变更类型 / Type of change

  • ⚡ 性能优化 / 重构 (Refactor)

✅ 提交前检查项 / Checklist

  • 人工确认: 我已亲自撰写此描述,去除了 AI 原始输出的冗余。
  • 深度理解: 我已完全理解这些更改的工作原理及潜在影响。
  • 范围聚焦: 本 PR 未包含任何与当前任务无关的代码改动。
  • 本地验证: 已在本地运行并通过了测试或手动验证。
  • 安全合规: 代码中无敏感凭据,且符合项目代码规范。

📸 运行证明 / Proof of Work

Benchmark 对比 (Apple M4, arm64, -count=3):

Benchmark Before (ns/op) After (ns/op) Speedup
OpenAI (混合文本) 130,000 11,900 10.9x
Claude (混合文本) 127,500 12,100 10.5x
Gemini (混合文本) 126,500 12,300 10.3x
纯英文 85,000 10,900 7.8x
纯中文 67,500 15,400 4.4x

两个版本均 0 alloc/op,估算结果完全一致。

Summary by CodeRabbit

  • Refactor

    • Improved token estimation accuracy and performance by simplifying character classification and moving to constant-time character set checks; removed internal synchronization to reduce overhead.
  • Tests

    • Added comprehensive golden tests and benchmarks covering multiple languages, symbols, URLs, whitespace variants, edge cases, and several provider/model scenarios.

Key optimizations:
- Add ASCII fast path for letters, digits, spaces (avoids unicode.Is* calls)
- Remove unnecessary RWMutex on read-only multipliers map
- Replace linear string scan in isMathSymbol with map[rune]struct{} lookup
- Replace linear string scan in isURLDelim with [128]bool array index
- Add CJK basic block (0x4E00-0x9FFF) range check before unicode.Is(Han)
- Reorder isMathSymbol to check Unicode ranges before map fallback

Benchmark results (Apple M4, arm64):

| Benchmark         | Before (ns/op) | After (ns/op) | Speedup |
|-------------------|-----------------|---------------|---------|
| OpenAI (mixed)    | 130,000         | 11,900        | 10.9x   |
| Claude (mixed)    | 127,500         | 12,100        | 10.5x   |
| Gemini (mixed)    | 126,500         | 12,300        | 10.3x   |
| Pure English      | 85,000          | 10,900        | 7.8x    |
| Pure Chinese      | 67,500          | 15,400        | 4.4x    |

Zero allocations in both versions. No behavioral changes.
Copilot AI review requested due to automatic review settings March 24, 2026 08:27
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 24, 2026

Walkthrough

Removed lock synchronization and made multiplier access a read-only global; replaced the previous WordType state machine with an ASCII fast path and simplified state (stLatin/stNumber), switched rune classification to lookup sets and explicit CJK/maths checks, and added golden tests and benchmarks for providers and models.

Changes

Cohort / File(s) Summary
Token Estimator
service/token_estimator.go
Removed sync/multipliersLock; made multipliersMap a direct read-only global and perform uncached lookups. Rewrote per-rune token-weighting logic: replaced WordType state machine with simplified state (stLatin/stNumber) and an ASCII fast path. Replaced string scans with lookup sets/maps (mathSymbolSet, urlDelimSet, @). Adjusted non-ASCII handling: explicit CJK range checks before unicode.Han, isMathSymbol via set membership, and non-ASCII whitespace via unicode.IsSpace.
Tests & Benchmarks
service/token_estimator_test.go
Added TestEstimateToken_Golden (table-driven golden cases for OpenAI, Claude, Gemini across varied text scenarios), TestEstimateTokenByModel (model-specific golden checks), and multiple benchmarks covering providers and text types (English, Chinese, mixed) plus a model-based benchmark.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • feat: refactor token estimation logic #2355: Refactors service/token_estimator.go internals (removes locks, changes classification/state machine and lookup tables) and updates tests — directly overlaps these changes.

Poem

🐰
I hopped through text both wide and deep,
Dropped the locks, now counting’s cheap,
ASCII paths and CJK leaps,
Tests and benches, carrots keep,
Tokens tallied—time to sleep! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'perf: optimize token estimator ~10x throughput' directly and concisely describes the main change: a performance optimization to the token estimator achieving approximately 10x throughput improvement.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes service/token_estimator.go by introducing an ASCII fast-path, removing unnecessary synchronization, and replacing linear scans with O(1) lookups to substantially improve throughput while aiming to preserve estimation results.

Changes:

  • Refactors EstimateToken to use an ASCII-first state machine and faster symbol/URL delimiter classification.
  • Removes the sync.RWMutex around a read-only multipliers map.
  • Adds unit tests and benchmarks for token estimation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
service/token_estimator.go Performance-focused refactor of the token estimator (ASCII fast-path, precomputed lookup sets, lock removal).
service/token_estimator_test.go Adds basic correctness smoke tests and benchmarks for tracking estimator performance regressions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +8 to +33
// 测试用例:确保优化后结果不变
func TestEstimateToken_Correctness(t *testing.T) {
tests := []struct {
name string
provider Provider
text string
}{
{"english", OpenAI, "Hello world, this is a test sentence with some numbers 12345."},
{"chinese", OpenAI, "你好世界,这是一段测试文本。"},
{"mixed", Claude, "Hello 你好 world 世界 123 test@email.com https://example.com/path?q=1&a=2"},
{"math", Gemini, "∑∫∂√∞ x² + y³ = z⁴"},
{"emoji", Claude, "Hello 😀🎉🚀 World"},
{"empty", OpenAI, ""},
{"spaces_newlines", OpenAI, "line1\nline2\tindented double"},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
result := EstimateToken(tt.provider, tt.text)
if tt.text == "" && result != 0 {
t.Errorf("expected 0 for empty text, got %d", result)
}
if tt.text != "" && result <= 0 {
t.Errorf("expected positive token count, got %d", result)
}
})
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestEstimateToken_Correctness claims to ensure the optimized estimator returns the same results, but it only asserts non-empty inputs produce a positive value. This won’t catch regressions from the refactor (e.g., whitespace classification changes, word/number boundary changes, symbol handling). Consider turning these into golden tests with explicit expected counts per provider (and add a case with '\r' / '\f' if those are meaningful), so the test actually protects the behavior you’re trying to preserve.

Copilot uses AI. Check for mistakes.
Comment on lines +121 to +126
} else if urlDelimSet[r] {
count += m.URLDelim
} else if r == '\r' || r == '\f' || r == '\v' {
count += m.Newline
} else {
// 普通空格使用Space权重
count += m.Space
count += m.Symbol
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ASCII fast-path changes whitespace semantics: '\r', '\f', and '\v' are now counted using m.Newline, whereas previously they would have gone through unicode.IsSpace and (since they are not '\n' or '\t') been counted as m.Space. This will change token estimates for inputs containing these characters and contradicts the stated goal of keeping results identical. Consider restoring the prior behavior (count these as Space), or explicitly documenting the behavior change and adding a regression test that locks in the intended handling.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
service/token_estimator_test.go (2)

24-34: Test assertions are too weak to catch regressions.

The tests only verify result > 0 for non-empty text and result == 0 for empty text. This won't detect if the optimization accidentally changed token counts. Consider adding expected value assertions or at minimum golden/snapshot values for known inputs.

Example with expected values
 	tests := []struct {
 		name     string
 		provider Provider
 		text     string
+		wantMin  int // minimum expected tokens (for sanity)
 	}{
-		{"english", OpenAI, "Hello world, this is a test sentence with some numbers 12345."},
-		{"chinese", OpenAI, "你好世界,这是一段测试文本。"},
+		{"english", OpenAI, "Hello world, this is a test sentence with some numbers 12345.", 10},
+		{"chinese", OpenAI, "你好世界,这是一段测试文本。", 8},
 		// ... etc
 	}

 	for _, tt := range tests {
 		t.Run(tt.name, func(t *testing.T) {
 			result := EstimateToken(tt.provider, tt.text)
 			if tt.text == "" && result != 0 {
 				t.Errorf("expected 0 for empty text, got %d", result)
 			}
-			if tt.text != "" && result <= 0 {
-				t.Errorf("expected positive token count, got %d", result)
+			if tt.text != "" && result < tt.wantMin {
+				t.Errorf("expected at least %d tokens, got %d", tt.wantMin, result)
 			}
 		})
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@service/token_estimator_test.go` around lines 24 - 34, The tests for
EstimateToken are too weak—update the test table used in the loop to include an
expected token count for each case (e.g., add an Expected int field to the test
struct), and replace the loose checks with strict assertions comparing result ==
tt.Expected (or use table-driven golden/snapshot values for known inputs) so
changes to EstimateToken are caught; locate the test loop that calls
EstimateToken and modify the test cases and assertions accordingly to validate
exact token counts for known inputs as well as zero for empty text.

37-51: Consider testing unknown/edge-case model names.

The tests cover known providers but don't verify the fallback behavior when model names don't match any provider (e.g., "unknown-model"). Since EstimateTokenByModel defaults to OpenAI for unmatched models, a test confirming this behavior would be valuable.

Add fallback test case
+	// Verify unknown models fall back to OpenAI behavior
+	openAIResult := EstimateTokenByModel("gpt-4o", text)
+	unknownResult := EstimateTokenByModel("unknown-model-xyz", text)
+	if unknownResult != openAIResult {
+		t.Errorf("expected unknown model to use OpenAI fallback, got %d vs %d", unknownResult, openAIResult)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@service/token_estimator_test.go` around lines 37 - 51, Add a test case in
TestEstimateTokenByModel to verify fallback behavior for unknown model names:
call EstimateTokenByModel with an unknown model like "unknown-model" and the
sample text and assert it yields a positive token count (or equals the
OpenAI-based estimate such as EstimateTokenByModel("gpt-4o", text)); also assert
that an empty text with an unknown model returns 0. Target the existing
TestEstimateTokenByModel function and the EstimateTokenByModel calls to validate
the default-to-OpenAI fallback behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@service/token_estimator.go`:
- Around line 147-176: The code currently calls unicode.IsNumber(r) before
checking isMathSymbol(r), causing superscript/subscript digits in mathSymbolSet
to be counted as m.Number; update the logic in the token counting loop (the
block using unicode.IsLetter, unicode.IsNumber, unicode.IsSpace and
isMathSymbol) to call isMathSymbol(r) before unicode.IsNumber(r) and, when
isMathSymbol returns true, increment m.MathSymbol and set state appropriately
(same state handling used for symbols) so those characters are treated as math
symbols rather than numbers.

---

Nitpick comments:
In `@service/token_estimator_test.go`:
- Around line 24-34: The tests for EstimateToken are too weak—update the test
table used in the loop to include an expected token count for each case (e.g.,
add an Expected int field to the test struct), and replace the loose checks with
strict assertions comparing result == tt.Expected (or use table-driven
golden/snapshot values for known inputs) so changes to EstimateToken are caught;
locate the test loop that calls EstimateToken and modify the test cases and
assertions accordingly to validate exact token counts for known inputs as well
as zero for empty text.
- Around line 37-51: Add a test case in TestEstimateTokenByModel to verify
fallback behavior for unknown model names: call EstimateTokenByModel with an
unknown model like "unknown-model" and the sample text and assert it yields a
positive token count (or equals the OpenAI-based estimate such as
EstimateTokenByModel("gpt-4o", text)); also assert that an empty text with an
unknown model returns 0. Target the existing TestEstimateTokenByModel function
and the EstimateTokenByModel calls to validate the default-to-OpenAI fallback
behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f968b7b0-ae02-49c8-b330-b54d9bae1d82

📥 Commits

Reviewing files that changed from the base of the PR and between 9ae9040 and e57e7de.

📒 Files selected for processing (2)
  • service/token_estimator.go
  • service/token_estimator_test.go

Comment on lines +147 to 176
// 非 ASCII 字母/数字 (如带重音的拉丁字母、西里尔字母等)
if unicode.IsLetter(r) {
if state != stLatin {
count += m.Word
state = stLatin
}

// 如果之前不在单词中,或者类型发生变化(字母<->数字),则视为新token
// 注意:对于OpenAI,通常"version 3.5"会切分,"abc123xyz"有时也会切分
// 这里简单起见,字母和数字切换时增加权重
if currentWordType == None || currentWordType != newType {
if newType == Number {
count += m.Number
} else {
count += m.Word
}
currentWordType = newType
continue
}
if unicode.IsNumber(r) {
if state != stNumber {
count += m.Number
state = stNumber
}
// 单词中间的字符不额外计费
continue
}

// 5. 处理标点符号/特殊字符 - 按类型使用不同权重
currentWordType = None
// 空白字符 (非 ASCII 空白,如全角空格)
if unicode.IsSpace(r) {
state = stNone
count += m.Space
continue
}

// 标点符号/特殊字符
state = stNone
if isMathSymbol(r) {
count += m.MathSymbol
} else if r == '@' {
count += m.AtSign
} else if isURLDelim(r) {
count += m.URLDelim
} else {
count += m.Symbol
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Superscript/subscript digits are now counted as numbers, not math symbols.

The mathSymbolSet includes superscript/subscript digits (²³¹⁴⁵⁶⁷⁸⁹⁰, ₀₁₂₃₄₅₆₇₈₉), but unicode.IsNumber(r) at line 155 returns true for these characters before the isMathSymbol check at line 172 is reached. This means they're counted with m.Number instead of m.MathSymbol.

If the original behavior treated these as math symbols (which the inclusion in mathSymbolSet suggests), this is a silent behavioral regression. Consider reordering the checks:

Proposed fix to check math symbols before unicode.IsNumber
 		continue
 	}

+	// 标点符号/特殊字符 (check math symbols early to catch superscript/subscript digits)
+	if isMathSymbol(r) {
+		state = stNone
+		count += m.MathSymbol
+		continue
+	}
+
 	// 非 ASCII 字母/数字 (如带重音的拉丁字母、西里尔字母等)
 	if unicode.IsLetter(r) {
 		if state != stLatin {
 			count += m.Word
 			state = stLatin
 		}
 		continue
 	}
 	if unicode.IsNumber(r) {
 		if state != stNumber {
 			count += m.Number
 			state = stNumber
 		}
 		continue
 	}

 	// 空白字符 (非 ASCII 空白,如全角空格)
 	if unicode.IsSpace(r) {
 		state = stNone
 		count += m.Space
 		continue
 	}

-	// 标点符号/特殊字符
-	state = stNone
-	if isMathSymbol(r) {
-		count += m.MathSymbol
-	} else {
+	state = stNone
 	count += m.Symbol
-	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@service/token_estimator.go` around lines 147 - 176, The code currently calls
unicode.IsNumber(r) before checking isMathSymbol(r), causing
superscript/subscript digits in mathSymbolSet to be counted as m.Number; update
the logic in the token counting loop (the block using unicode.IsLetter,
unicode.IsNumber, unicode.IsSpace and isMathSymbol) to call isMathSymbol(r)
before unicode.IsNumber(r) and, when isMathSymbol returns true, increment
m.MathSymbol and set state appropriately (same state handling used for symbols)
so those characters are treated as math symbols rather than numbers.

mfzzf added 2 commits March 24, 2026 16:35
- \r, \f, \v now correctly use m.Space (matching original unicode.IsSpace behavior)
- Replace weak positive-only assertions with golden tests that lock in
  exact token counts per provider across 9 input categories
- Add dedicated cr_ff_vt, url_heavy, at_sign test cases
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +113 to +116
func BenchmarkEstimateToken_PureEnglish(b *testing.B) {
text := strings.Repeat("The quick brown fox jumps over the lazy dog. ", 200)
for b.Loop() {
EstimateToken(OpenAI, text)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings.Repeat(...) is executed as part of the timed benchmark, which will skew ns/op and make the benchmark unsuitable for regression tracking. Consider moving text to a package-level var (like benchText) or calling b.StopTimer()/b.ResetTimer() after building text so only EstimateToken is measured.

Copilot uses AI. Check for mistakes.
Comment on lines +120 to +123
func BenchmarkEstimateToken_PureChinese(b *testing.B) {
text := strings.Repeat("人工智能技术正在快速发展和广泛应用。", 200)
for b.Loop() {
EstimateToken(OpenAI, text)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings.Repeat(...) is executed as part of the timed benchmark, which will skew ns/op and make the benchmark unsuitable for regression tracking. Consider moving text to a package-level var or calling b.StopTimer()/b.ResetTimer() after building text so only EstimateToken is measured.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
service/token_estimator.go (1)

148-160: ⚠️ Potential issue | 🟡 Minor

unicode.IsLetter / unicode.IsNumber still shadow isMathSymbol.

Because the generic non-ASCII branches run first, mathSymbolSet entries such as , ², and , plus the mathematical alphanumeric block (0x1D400–0x1D7FF), never reach the math-symbol branch. They are currently charged as m.Word / m.Number instead of m.MathSymbol.

Suggested fix
-		// 非 ASCII 字母/数字 (如带重音的拉丁字母、西里尔字母等)
-		if unicode.IsLetter(r) {
+		// 数学符号要先于通用字母/数字分类,否则上下标数字/字母会落到 Word/Number
+		if isMathSymbol(r) {
+			state = stNone
+			count += m.MathSymbol
+			continue
+		}
+
+		// 非 ASCII 字母/数字 (如带重音的拉丁字母、西里尔字母等)
+		if unicode.IsLetter(r) {
 			if state != stLatin {
 				count += m.Word
 				state = stLatin
 			}
 			continue
@@
-		// 标点符号/特殊字符
-		state = stNone
-		if isMathSymbol(r) {
-			count += m.MathSymbol
-		} else {
-			count += m.Symbol
-		}
+		// 标点符号/特殊字符
+		state = stNone
+		count += m.Symbol

Also applies to: 172-173

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@service/token_estimator.go` around lines 148 - 160, The math-symbol
characters are being classified as letters/numbers because unicode.IsLetter /
unicode.IsNumber checks run before the math-symbol branch; update the branching
in the token estimator (the code that sets state and increments count using
m.Word, m.Number, m.MathSymbol) to test for math symbols first for non-ASCII
runes—call isMathSymbol(r) or check mathSymbolSet before the unicode.IsLetter
and unicode.IsNumber branches (the same change should be applied to both the
letter/number blocks around the stLatin/stNumber state handling at the shown
locations) so that math symbols are charged as m.MathSymbol and state is set
appropriately instead of m.Word/m.Number.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@service/token_estimator.go`:
- Around line 148-160: The math-symbol characters are being classified as
letters/numbers because unicode.IsLetter / unicode.IsNumber checks run before
the math-symbol branch; update the branching in the token estimator (the code
that sets state and increments count using m.Word, m.Number, m.MathSymbol) to
test for math symbols first for non-ASCII runes—call isMathSymbol(r) or check
mathSymbolSet before the unicode.IsLetter and unicode.IsNumber branches (the
same change should be applied to both the letter/number blocks around the
stLatin/stNumber state handling at the shown locations) so that math symbols are
charged as m.MathSymbol and state is set appropriately instead of
m.Word/m.Number.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bbe1a510-1faf-4dba-a517-eb55925e288a

📥 Commits

Reviewing files that changed from the base of the PR and between d5b10df and e5d82bd.

📒 Files selected for processing (1)
  • service/token_estimator.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants