Skip to content

perf(index): use to_ascii_lowercase in tokenize hot paths#89

Merged
poyrazK merged 10 commits intomainfrom
fix/tokenize-to-ascii-lowercase
May 2, 2026
Merged

perf(index): use to_ascii_lowercase in tokenize hot paths#89
poyrazK merged 10 commits intomainfrom
fix/tokenize-to-ascii-lowercase

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented May 1, 2026

Summary

  • tokenize(): replace str::to_lowercase() (Unicode-aware, table-lookup, per-token allocation) with str::to_ascii_lowercase() (O(1) per byte, no tables)
  • extract_positions(): pre-lowercase the text once with to_ascii_lowercase() instead of searching in the original case-preserved field_text — avoids case mismatch issues and reduces overhead
  • Infrastructure log messages are ASCII-heavy; this change eliminates ~3x overhead for common workloads
  • Fixes tokenize: allocation-heavy str::to_lowercase called per-token in hot paths #84

Test plan

  • cargo test --package cloudsearch-index — all 84 tests pass
  • cargo clippy --workspace --all-targets -- -D warnings -D clippy::pedantic — clean

Summary by CodeRabbit

  • Bug Fixes

    • Search now matches terms case-insensitively, improving search accuracy across varied text formats
    • Highlighting functionality correctly identifies matches regardless of text case
    • Highlights properly display for multiple term occurrences within a single field
  • Tests

    • Expanded test coverage for case-insensitive search and highlighting behavior

str::to_lowercase() is Unicode-aware and processes each byte through
Unicode tables — wasteful for ASCII log messages. Replace with
to_ascii_lowercase() which is O(1) per byte with no table lookups.

Also in extract_positions(): lower the text once with to_ascii_lowercase()
instead of calling tokenize() which internally lowercases each token,
then searching in the original field_text. Now search_from uses the
pre-lowercased text.

Fixes: #84
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 1, 2026

Warning

Rate limit exceeded

@poyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 6 minutes and 20 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b7878f64-b451-4c3d-8e15-94a3529a88a7

📥 Commits

Reviewing files that changed from the base of the PR and between b2f47a1 and c27d7b2.

📒 Files selected for processing (2)
  • rust/crates/cloudsearch-index/src/lib.rs
  • rust/crates/cloudsearch-index/tests/coverage.rs
📝 Walkthrough

Walkthrough

The PR optimizes token matching and position extraction by switching to ASCII lowercasing (to_ascii_lowercase) instead of Unicode-aware lowercasing, and pre-computing lowercased text before searching for token positions. Comprehensive test coverage for case-insensitive highlighting is added.

Changes

Case-Insensitive Token Matching and Position Extraction

Layer / File(s) Summary
Tokenization Optimization
rust/crates/cloudsearch-index/src/lib.rs
tokenize() switches from to_lowercase() to to_ascii_lowercase() for reduced allocation overhead; extract_positions() pre-computes lower_text via to_ascii_lowercase() and searches within the lowercased text to compute correct token byte offsets.
Highlight Behavior Coverage
rust/crates/cloudsearch-index/tests/coverage.rs
Five new #[tokio::test] cases validate case-insensitive highlighting: matching uppercase text with lowercase query, multiple occurrences in a single field, absence when term does not match, and absence when the field is empty. Import statements are reformatted.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A rabbit hops through lowercased hay,
No Unicode burden to slow the day—
ASCII letters, swift and lean,
Highlights dance where matches gleam.
Positions found, case-blind and bright! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main performance optimization: replacing per-token Unicode lowercasing with ASCII-only lowercasing in hot paths.
Linked Issues check ✅ Passed Changes directly address issue #84: replaced per-token str::to_lowercase() with to_ascii_lowercase() in tokenize and extract_positions, and optimized repeated find() calls via pre-lowercased text.
Out of Scope Changes check ✅ Passed Test extensions validate the core changes (case-insensitive matching, position extraction, edge cases) and are directly related to the lowercasing optimization—no out-of-scope modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/tokenize-to-ascii-lowercase

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 6 minutes and 20 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

poyrazK added 4 commits May 1, 2026 22:07
4 new tests in coverage.rs:
- highlight_positions_case_insensitive: lowercase token "error" finds
  positions in "ERROR log ERROR" (proves extract_positions searches
  in to_ascii_lowercase'd text)
- highlight_positions_multiple_occurrences: same term appears twice,
  all positions captured
- highlight_positions_no_match: no highlight when term not in document
- highlight_positions_empty_field: no highlight for empty text field

Tests use MatchQuery (not TermQuery) because MatchQuery tokenizes and
matches case-insensitively, while TermQuery requires exact value match.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
rust/crates/cloudsearch-index/src/lib.rs (1)

1003-1012: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

search_from += pos + 1 panics on any non-ASCII token

When a token produced by tokenize() starts with a multi-byte UTF-8 character (e.g., 'ß', 'ü', 'é'), pos + 1 advances by one byte into the continuation byte of that character. The very next iteration's lower_text[search_from..] will panic with "byte index N is not a char boundary".

Concrete trace: text = "ßuse", token "ß" (2 bytes). pos = 0, search_from = 1. lower_text[1..]0xBC is not a char boundary → panic.

Replacing with pos + token.len() is the correct advancement: it always lands on the first byte after the matched token (a valid UTF-8 boundary) and also eliminates spurious overlapping matches.

🐛 Proposed fix
-                        search_from += pos + 1;
+                        search_from += pos + token.len();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 1003 - 1012, The loop
over tokens in lib.rs advances the byte index incorrectly with search_from +=
pos + 1 which can land inside a UTF-8 continuation byte and panic for multi-byte
characters; update the advancement to use the matched token's byte length so the
next slice starts at a valid UTF-8 boundary (i.e., replace the increment using
pos + 1 with pos + token.len()), keeping the rest of the logic that records
offsets into seen_offsets for the token.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rust/crates/cloudsearch-index/tests/coverage.rs`:
- Around line 302-319: The tests highlight_positions_no_match and
highlight_positions_empty_field call handle.flush() but never call
handle.refresh(), leaving the indexed document in pending_operations so search()
sees zero documents; add a call to handle.refresh().await.expect("refresh")
immediately before handle.flush().await.expect("flush") in both tests (same
place you see refresh used in
highlight_positions_case_insensitive/highlight_positions_multiple_occurrences)
so the document moves into searchable_documents and the search/highlight logic
is actually exercised.

---

Outside diff comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 1003-1012: The loop over tokens in lib.rs advances the byte index
incorrectly with search_from += pos + 1 which can land inside a UTF-8
continuation byte and panic for multi-byte characters; update the advancement to
use the matched token's byte length so the next slice starts at a valid UTF-8
boundary (i.e., replace the increment using pos + 1 with pos + token.len()),
keeping the rest of the logic that records offsets into seen_offsets for the
token.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 404b621f-0aa9-4110-9f0b-c2f6b54d62ce

📥 Commits

Reviewing files that changed from the base of the PR and between d572f12 and b2f47a1.

📒 Files selected for processing (2)
  • rust/crates/cloudsearch-index/src/lib.rs
  • rust/crates/cloudsearch-index/tests/coverage.rs

Comment thread rust/crates/cloudsearch-index/tests/coverage.rs
poyrazK added 5 commits May 2, 2026 15:04
In extract_positions(), replace unwrap_or(0) with let-else that logs
a warning and continues when u32::try_from(search_from + pos) fails.
Previously, overflow silently returned 0, corrupting position data.

Fixes pre-existing issue identified in PR #89 review.
…sh in tests

1. extract_positions: use pos + token.len() instead of pos + 1 when
advancing search_from. Using pos + 1 could land inside a UTF-8
continuation byte for multi-byte characters, causing the next slice
to start at an invalid boundary and panic.

2. coverage tests: add handle.refresh() before flush in
highlight_positions_no_match and highlight_positions_empty_field
tests so the indexed document moves into searchable_documents
before the search is executed.
1. Add 4 tokenize unit tests in lib.rs (same file, #[cfg(test)] module):
   - tokenize_lowercase_converts_correctly: "Hello World" → ["hello", "world"]
   - tokenize_preserves_single_tokens: "test" → ["test"]
   - tokenize_multiple_whitespace_collapsed: "a   b" → ["a", "b"]
   - tokenize_empty_string: "" → []

2. Strengthen highlight_positions_no_match assertion — doc is indexed
   and flushed, then searched with MatchQuery for non-matching term.
   Asserts the hit has no highlight (not just that hits is empty).
Copy link
Copy Markdown
Owner Author

@poyrazK poyrazK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to merge

@poyrazK poyrazK merged commit 9a843ed into main May 2, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tokenize: allocation-heavy str::to_lowercase called per-token in hot paths

1 participant