Skip to content

fix(memory): handle punctuation and whitespace in extractWords#624

Open
koriyoshi2041 wants to merge 2 commits into
google:mainfrom
koriyoshi2041:fix/extractwords-punctuation-handling
Open

fix(memory): handle punctuation and whitespace in extractWords#624
koriyoshi2041 wants to merge 2 commits into
google:mainfrom
koriyoshi2041:fix/extractwords-punctuation-handling

Conversation

@koriyoshi2041
Copy link
Copy Markdown

Summary

Fixes #569.

extractWords in memory/inmemory.go currently splits text only on spaces (strings.SplitSeq(text, " ")), which means:

  • Words separated by tabs or newlines are not properly tokenized
  • Punctuation attached to words (e.g., "great!", "banana,") prevents keyword matching

Changes

  • Replace strings.SplitSeq(text, " ") with strings.Fields(text) to handle all Unicode whitespace (tabs, newlines, multiple spaces)
  • Add strings.TrimFunc to strip non-letter/non-number characters from word boundaries
  • Add 3 test cases covering punctuation, multi-line text, and comma-separated values

Test plan

  • go test ./memory/ -run Test_inMemoryService_SearchMemory — all 8 tests pass (5 existing + 3 new)
  • go vet ./memory/ — clean

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@google-cla
Copy link
Copy Markdown

google-cla Bot commented Mar 6, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Comment thread memory/inmemory.go Outdated
res := make(map[string]struct{})

for s := range strings.SplitSeq(text, " ") {
for _, s := range strings.Fields(text) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep using strings.SplitSeq? That avoids unnecessary allocation.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Switched to strings.FieldsSeq which gives us the iterator (no slice allocation) while also splitting on all unicode whitespace (tabs, newlines, etc.) — best of both worlds.

extractWords previously used strings.SplitSeq with a space delimiter,
which missed tabs, newlines, and other whitespace. It also stored words
with surrounding punctuation (e.g. "great!" instead of "great"),
causing keyword search to miss relevant results.

Replace strings.SplitSeq with strings.Fields to split on all Unicode
whitespace, and add strings.TrimFunc to strip leading/trailing
non-letter, non-number characters from each word.

Add test cases for punctuation stripping, multi-line text with
tabs/newlines, and comma-separated values.

Fixes google#569
Switch from strings.Fields (allocates []string) to strings.FieldsSeq
(returns iterator) per reviewer feedback, while keeping the whitespace
and punctuation handling improvements.
@koriyoshi2041 koriyoshi2041 force-pushed the fix/extractwords-punctuation-handling branch from 092b1cc to 93ee368 Compare June 8, 2026 08:15
@koriyoshi2041
Copy link
Copy Markdown
Author

Rebased this onto current main and cleaned up the gofmt alignment in the added memory tests. The change still uses strings.FieldsSeq, so it keeps the iterator/no-slice-allocation path from your review note while handling punctuation and other whitespace.

Checked locally:

go test ./memory
go test -race -mod=readonly ./memory
go vet ./memory
git diff --check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Memory search fails to match words with punctuation in extractWords

2 participants