perf(index): use to_ascii_lowercase in tokenize hot paths#89
Conversation
str::to_lowercase() is Unicode-aware and processes each byte through Unicode tables — wasteful for ASCII log messages. Replace with to_ascii_lowercase() which is O(1) per byte with no table lookups. Also in extract_positions(): lower the text once with to_ascii_lowercase() instead of calling tokenize() which internally lowercases each token, then searching in the original field_text. Now search_from uses the pre-lowercased text. Fixes: #84
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR optimizes token matching and position extraction by switching to ASCII lowercasing ( ChangesCase-Insensitive Token Matching and Position Extraction
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 6 minutes and 20 seconds.Comment |
4 new tests in coverage.rs: - highlight_positions_case_insensitive: lowercase token "error" finds positions in "ERROR log ERROR" (proves extract_positions searches in to_ascii_lowercase'd text) - highlight_positions_multiple_occurrences: same term appears twice, all positions captured - highlight_positions_no_match: no highlight when term not in document - highlight_positions_empty_field: no highlight for empty text field Tests use MatchQuery (not TermQuery) because MatchQuery tokenizes and matches case-insensitively, while TermQuery requires exact value match.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
rust/crates/cloudsearch-index/src/lib.rs (1)
1003-1012:⚠️ Potential issue | 🔴 Critical | ⚡ Quick win
search_from += pos + 1panics on any non-ASCII tokenWhen a token produced by
tokenize()starts with a multi-byte UTF-8 character (e.g.,'ß','ü','é'),pos + 1advances by one byte into the continuation byte of that character. The very next iteration'slower_text[search_from..]will panic with "byte index N is not a char boundary".Concrete trace:
text = "ßuse", token"ß"(2 bytes).pos = 0,search_from = 1.lower_text[1..]→0xBCis not a char boundary → panic.Replacing with
pos + token.len()is the correct advancement: it always lands on the first byte after the matched token (a valid UTF-8 boundary) and also eliminates spurious overlapping matches.🐛 Proposed fix
- search_from += pos + 1; + search_from += pos + token.len();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 1003 - 1012, The loop over tokens in lib.rs advances the byte index incorrectly with search_from += pos + 1 which can land inside a UTF-8 continuation byte and panic for multi-byte characters; update the advancement to use the matched token's byte length so the next slice starts at a valid UTF-8 boundary (i.e., replace the increment using pos + 1 with pos + token.len()), keeping the rest of the logic that records offsets into seen_offsets for the token.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rust/crates/cloudsearch-index/tests/coverage.rs`:
- Around line 302-319: The tests highlight_positions_no_match and
highlight_positions_empty_field call handle.flush() but never call
handle.refresh(), leaving the indexed document in pending_operations so search()
sees zero documents; add a call to handle.refresh().await.expect("refresh")
immediately before handle.flush().await.expect("flush") in both tests (same
place you see refresh used in
highlight_positions_case_insensitive/highlight_positions_multiple_occurrences)
so the document moves into searchable_documents and the search/highlight logic
is actually exercised.
---
Outside diff comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 1003-1012: The loop over tokens in lib.rs advances the byte index
incorrectly with search_from += pos + 1 which can land inside a UTF-8
continuation byte and panic for multi-byte characters; update the advancement to
use the matched token's byte length so the next slice starts at a valid UTF-8
boundary (i.e., replace the increment using pos + 1 with pos + token.len()),
keeping the rest of the logic that records offsets into seen_offsets for the
token.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 404b621f-0aa9-4110-9f0b-c2f6b54d62ce
📒 Files selected for processing (2)
rust/crates/cloudsearch-index/src/lib.rsrust/crates/cloudsearch-index/tests/coverage.rs
In extract_positions(), replace unwrap_or(0) with let-else that logs a warning and continues when u32::try_from(search_from + pos) fails. Previously, overflow silently returned 0, corrupting position data. Fixes pre-existing issue identified in PR #89 review.
…sh in tests 1. extract_positions: use pos + token.len() instead of pos + 1 when advancing search_from. Using pos + 1 could land inside a UTF-8 continuation byte for multi-byte characters, causing the next slice to start at an invalid boundary and panic. 2. coverage tests: add handle.refresh() before flush in highlight_positions_no_match and highlight_positions_empty_field tests so the indexed document moves into searchable_documents before the search is executed.
1. Add 4 tokenize unit tests in lib.rs (same file, #[cfg(test)] module): - tokenize_lowercase_converts_correctly: "Hello World" → ["hello", "world"] - tokenize_preserves_single_tokens: "test" → ["test"] - tokenize_multiple_whitespace_collapsed: "a b" → ["a", "b"] - tokenize_empty_string: "" → [] 2. Strengthen highlight_positions_no_match assertion — doc is indexed and flushed, then searched with MatchQuery for non-matching term. Asserts the hit has no highlight (not just that hits is empty).
Summary
tokenize(): replacestr::to_lowercase()(Unicode-aware, table-lookup, per-token allocation) withstr::to_ascii_lowercase()(O(1) per byte, no tables)extract_positions(): pre-lowercase the text once withto_ascii_lowercase()instead of searching in the original case-preservedfield_text— avoids case mismatch issues and reduces overheadTest plan
cargo test --package cloudsearch-index— all 84 tests passcargo clippy --workspace --all-targets -- -D warnings -D clippy::pedantic— cleanSummary by CodeRabbit
Bug Fixes
Tests