perf(index): use to_ascii_lowercase in tokenize hot paths by poyrazK · Pull Request #89 · poyrazK/cloudSearch

poyrazK · 2026-05-01T18:52:32Z

Summary

tokenize(): replace str::to_lowercase() (Unicode-aware, table-lookup, per-token allocation) with str::to_ascii_lowercase() (O(1) per byte, no tables)
extract_positions(): pre-lowercase the text once with to_ascii_lowercase() instead of searching in the original case-preserved field_text — avoids case mismatch issues and reduces overhead
Infrastructure log messages are ASCII-heavy; this change eliminates ~3x overhead for common workloads
Fixes tokenize: allocation-heavy str::to_lowercase called per-token in hot paths #84

Test plan

cargo test --package cloudsearch-index — all 84 tests pass
cargo clippy --workspace --all-targets -- -D warnings -D clippy::pedantic — clean

Summary by CodeRabbit

Bug Fixes
- Search now matches terms case-insensitively, improving search accuracy across varied text formats
- Highlighting functionality correctly identifies matches regardless of text case
- Highlights properly display for multiple term occurrences within a single field
Tests
- Expanded test coverage for case-insensitive search and highlighting behavior

str::to_lowercase() is Unicode-aware and processes each byte through Unicode tables — wasteful for ASCII log messages. Replace with to_ascii_lowercase() which is O(1) per byte with no table lookups. Also in extract_positions(): lower the text once with to_ascii_lowercase() instead of calling tokenize() which internally lowercases each token, then searching in the original field_text. Now search_from uses the pre-lowercased text. Fixes: #84

coderabbitai · 2026-05-01T18:52:39Z

Warning

Rate limit exceeded

@poyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 6 minutes and 20 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b7878f64-b451-4c3d-8e15-94a3529a88a7

📥 Commits

Reviewing files that changed from the base of the PR and between b2f47a1 and c27d7b2.

📒 Files selected for processing (2)

rust/crates/cloudsearch-index/src/lib.rs
rust/crates/cloudsearch-index/tests/coverage.rs

📝 Walkthrough

Walkthrough

The PR optimizes token matching and position extraction by switching to ASCII lowercasing (to_ascii_lowercase) instead of Unicode-aware lowercasing, and pre-computing lowercased text before searching for token positions. Comprehensive test coverage for case-insensitive highlighting is added.

Changes

Case-Insensitive Token Matching and Position Extraction

Layer / File(s)	Summary
Tokenization Optimization `rust/crates/cloudsearch-index/src/lib.rs`	`tokenize()` switches from `to_lowercase()` to `to_ascii_lowercase()` for reduced allocation overhead; `extract_positions()` pre-computes `lower_text` via `to_ascii_lowercase()` and searches within the lowercased text to compute correct token byte offsets.
Highlight Behavior Coverage `rust/crates/cloudsearch-index/tests/coverage.rs`	Five new `#[tokio::test]` cases validate case-insensitive highlighting: matching uppercase text with lowercase query, multiple occurrences in a single field, absence when term does not match, and absence when the field is empty. Import statements are reformatted.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A rabbit hops through lowercased hay,
No Unicode burden to slow the day—
ASCII letters, swift and lean,
Highlights dance where matches gleam.
Positions found, case-blind and bright! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main performance optimization: replacing per-token Unicode lowercasing with ASCII-only lowercasing in hot paths.
Linked Issues check	✅ Passed	Changes directly address issue `#84`: replaced per-token str::to_lowercase() with to_ascii_lowercase() in tokenize and extract_positions, and optimized repeated find() calls via pre-lowercased text.
Out of Scope Changes check	✅ Passed	Test extensions validate the core changes (case-insensitive matching, position extraction, edge cases) and are directly related to the lowercasing optimization—no out-of-scope modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/tokenize-to-ascii-lowercase

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 6 minutes and 20 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

4 new tests in coverage.rs: - highlight_positions_case_insensitive: lowercase token "error" finds positions in "ERROR log ERROR" (proves extract_positions searches in to_ascii_lowercase'd text) - highlight_positions_multiple_occurrences: same term appears twice, all positions captured - highlight_positions_no_match: no highlight when term not in document - highlight_positions_empty_field: no highlight for empty text field Tests use MatchQuery (not TermQuery) because MatchQuery tokenizes and matches case-insensitively, while TermQuery requires exact value match.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

rust/crates/cloudsearch-index/src/lib.rs (1)
1003-1012: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

search_from += pos + 1 panics on any non-ASCII token

When a token produced by tokenize() starts with a multi-byte UTF-8 character (e.g., 'ß', 'ü', 'é'), pos + 1 advances by one byte into the continuation byte of that character. The very next iteration's lower_text[search_from..] will panic with "byte index N is not a char boundary".

Concrete trace: text = "ßuse", token "ß" (2 bytes). pos = 0, search_from = 1. lower_text[1..] → 0xBC is not a char boundary → panic.

Replacing with pos + token.len() is the correct advancement: it always lands on the first byte after the matched token (a valid UTF-8 boundary) and also eliminates spurious overlapping matches.
🐛 Proposed fix
-                        search_from += pos + 1;
+                        search_from += pos + token.len();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 1003 - 1012, The loop
over tokens in lib.rs advances the byte index incorrectly with search_from +=
pos + 1 which can land inside a UTF-8 continuation byte and panic for multi-byte
characters; update the advancement to use the matched token's byte length so the
next slice starts at a valid UTF-8 boundary (i.e., replace the increment using
pos + 1 with pos + token.len()), keeping the rest of the logic that records
offsets into seen_offsets for the token.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rust/crates/cloudsearch-index/tests/coverage.rs`:
- Around line 302-319: The tests highlight_positions_no_match and
highlight_positions_empty_field call handle.flush() but never call
handle.refresh(), leaving the indexed document in pending_operations so search()
sees zero documents; add a call to handle.refresh().await.expect("refresh")
immediately before handle.flush().await.expect("flush") in both tests (same
place you see refresh used in
highlight_positions_case_insensitive/highlight_positions_multiple_occurrences)
so the document moves into searchable_documents and the search/highlight logic
is actually exercised.

---

Outside diff comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 1003-1012: The loop over tokens in lib.rs advances the byte index
incorrectly with search_from += pos + 1 which can land inside a UTF-8
continuation byte and panic for multi-byte characters; update the advancement to
use the matched token's byte length so the next slice starts at a valid UTF-8
boundary (i.e., replace the increment using pos + 1 with pos + token.len()),
keeping the rest of the logic that records offsets into seen_offsets for the
token.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 404b621f-0aa9-4110-9f0b-c2f6b54d62ce

📥 Commits

Reviewing files that changed from the base of the PR and between d572f12 and b2f47a1.

📒 Files selected for processing (2)

rust/crates/cloudsearch-index/src/lib.rs
rust/crates/cloudsearch-index/tests/coverage.rs

In extract_positions(), replace unwrap_or(0) with let-else that logs a warning and continues when u32::try_from(search_from + pos) fails. Previously, overflow silently returned 0, corrupting position data. Fixes pre-existing issue identified in PR #89 review.

…sh in tests 1. extract_positions: use pos + token.len() instead of pos + 1 when advancing search_from. Using pos + 1 could land inside a UTF-8 continuation byte for multi-byte characters, causing the next slice to start at an invalid boundary and panic. 2. coverage tests: add handle.refresh() before flush in highlight_positions_no_match and highlight_positions_empty_field tests so the indexed document moves into searchable_documents before the search is executed.

1. Add 4 tokenize unit tests in lib.rs (same file, #[cfg(test)] module): - tokenize_lowercase_converts_correctly: "Hello World" → ["hello", "world"] - tokenize_preserves_single_tokens: "test" → ["test"] - tokenize_multiple_whitespace_collapsed: "a b" → ["a", "b"] - tokenize_empty_string: "" → [] 2. Strengthen highlight_positions_no_match assertion — doc is indexed and flushed, then searched with MatchQuery for non-matching term. Asserts the hit has no highlight (not just that hits is empty).

poyrazK

It's okay to merge

poyrazK added 4 commits May 1, 2026 22:07

style: apply rustfmt

044557f

style: apply rustfmt

535bf61

style: apply rustfmt to coverage tests

b2f47a1

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

Comment thread rust/crates/cloudsearch-index/tests/coverage.rs

poyrazK added 5 commits May 2, 2026 15:04

test(index): strengthen highlight_positions_no_match with MatchAll query

71b2e70

style: apply rustfmt to coverage.rs

c27d7b2

poyrazK commented May 2, 2026

View reviewed changes

poyrazK merged commit 9a843ed into main May 2, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(index): use to_ascii_lowercase in tokenize hot paths#89

perf(index): use to_ascii_lowercase in tokenize hot paths#89
poyrazK merged 10 commits intomainfrom
fix/tokenize-to-ascii-lowercase

poyrazK commented May 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 1, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

poyrazK left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poyrazK commented May 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

poyrazK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poyrazK commented May 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 1, 2026 •

edited

Loading