fix(memory): handle punctuation and whitespace in extractWords#624
fix(memory): handle punctuation and whitespace in extractWords#624koriyoshi2041 wants to merge 2 commits into
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
| res := make(map[string]struct{}) | ||
|
|
||
| for s := range strings.SplitSeq(text, " ") { | ||
| for _, s := range strings.Fields(text) { |
There was a problem hiding this comment.
Can you keep using strings.SplitSeq? That avoids unnecessary allocation.
There was a problem hiding this comment.
Good point! Switched to strings.FieldsSeq which gives us the iterator (no slice allocation) while also splitting on all unicode whitespace (tabs, newlines, etc.) — best of both worlds.
extractWords previously used strings.SplitSeq with a space delimiter, which missed tabs, newlines, and other whitespace. It also stored words with surrounding punctuation (e.g. "great!" instead of "great"), causing keyword search to miss relevant results. Replace strings.SplitSeq with strings.Fields to split on all Unicode whitespace, and add strings.TrimFunc to strip leading/trailing non-letter, non-number characters from each word. Add test cases for punctuation stripping, multi-line text with tabs/newlines, and comma-separated values. Fixes google#569
Switch from strings.Fields (allocates []string) to strings.FieldsSeq (returns iterator) per reviewer feedback, while keeping the whitespace and punctuation handling improvements.
092b1cc to
93ee368
Compare
|
Rebased this onto current Checked locally: |
Summary
Fixes #569.
extractWordsinmemory/inmemory.gocurrently splits text only on spaces (strings.SplitSeq(text, " ")), which means:"great!","banana,") prevents keyword matchingChanges
strings.SplitSeq(text, " ")withstrings.Fields(text)to handle all Unicode whitespace (tabs, newlines, multiple spaces)strings.TrimFuncto strip non-letter/non-number characters from word boundariesTest plan
go test ./memory/ -run Test_inMemoryService_SearchMemory— all 8 tests pass (5 existing + 3 new)go vet ./memory/— clean