Skip to content

fix: Limit relevant fields in default wildcard search#768

Draft
LarsV123 wants to merge 3 commits into
mainfrom
larsv/2026/03/18-fix-clause-count-issue
Draft

fix: Limit relevant fields in default wildcard search#768
LarsV123 wants to merge 3 commits into
mainfrom
larsv/2026/03/18-fix-clause-count-issue

Conversation

@LarsV123

@LarsV123 LarsV123 commented Mar 18, 2026

Copy link
Copy Markdown
Contributor

NOTE: This is a possible fix to address the issue described in https://sikt.atlassian.net/browse/NP-50823. Because it changes default behavior significantly, it will likely break something unexpected and should be tested carefully.

Problem

The CROSS_FIELDS multi-match query in searchAllWithBoostsQuery used "*" (wildcard) for the default field set, which OpenSearch expands to every field in the index (~433 text fields in prod). Combined with Operator.AND, this creates tokens × fields boolean clauses.

The .limit(7) caps space-separated words, but the standard tokenizer also splits on hyphens — so a query like "Genome-wide association meta-analysis..." (7 words) produces 9+ tokens. With ~433 fields, that's ~3900–4300+ clauses, exceeding the maxClauseCount of 4096. Production has more dynamically-mapped fields than test, which is why the same queries work in test but fail in prod.

Fix

Replaced the "*" wildcard with an explicit DEFAULT_SEARCH_ALL_FIELDS map containing 15 curated fields that are meaningful for free-text publication search:

  • Title (boosted to PI), abstract, tags
  • Identifiers (identifier, publisher ID, DOI)
  • Contributor names
  • Journal/publisher names
  • Affiliation labels (Norwegian, English, Nynorsk — text fields for proper analyzer support)
  • Funding identifiers

Clause count: 15 fields × ~10 tokens = ~150 clauses (vs 4000+ before), well within the 4096 limit — even with hyphenated queries.

When users explicitly specify the fields parameter (NODES_SEARCHED), the behavior is unchanged — only the default "search all" case is affected.

Alternative: copy_to field

If the curated field list turns out to be too restrictive, a more robust long-term option is to add a copy_to directive in the index mapping that copies all searchable fields into a single combined text field (e.g. _search_all). The multi-match query would then target just 1 field instead of N, eliminating the clause explosion entirely while still supporting true all-field search. The trade-off is that it requires a mapping change and full re-index.

@github-actions

github-actions Bot commented Mar 18, 2026

Copy link
Copy Markdown

Test Results

   43 files  ±0     43 suites  ±0   2m 56s ⏱️ -6s
1 087 tests ±0  1 084 ✅ ±0  3 💤 ±0  0 ❌ ±0 
1 170 runs  ±0  1 167 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit ec96af6. ± Comparison against base commit 4142cfe.

This pull request removes 5 and adds 4 tests. Note that renamed tests count towards both.

no.unit.nva.indexingclient.models.IndexDocumentTest ‑ should throw exception when validating and missing mandatory fields:IndexDocument[consumptionAttributes=EventConsumptionAttributes[index=3AMEVQYPqqsBTCgMiOR, documentIdentifier=null], resource={"Uc2RzeqZYExe3s":{"hT2bDo3A47":"bqAbES2tGd","bqmQ6VWdjv":"9Jt4tbtZr7OhcHgH","IHhaFEVuAX":"083lBNqbZDBSqNMnK","T0InOgLaCO7t":"ndReYXqhp5l","5MHTumVFubFIV0P":"QPT1UW62hQGsWZ24Dbn"},"ehnfnlYT58DxzABu":{"y86lf5zkFALcg":"HcuOzjrIZXPZZZng","620Y5OY8eNVfTtXo3sX":"k4lwg7zkzGIVb2","3qXBn8TBJYrkz":"mpZHTSxtcBBaYJR","mFU5TTiPLnGjwM":"okfboXzpaMv","On4Yia0kKuRF":"3hZwlbGbJ5OSn77Z"},"b0RFwE6l22O…
no.unit.nva.indexingclient.models.IndexDocumentTest ‑ should throw exception when validating and missing mandatory fields:IndexDocument[consumptionAttributes=EventConsumptionAttributes[index=null, documentIdentifier=019d23ca1dad-900d7350-a10e-40c4-848f-db2903889f49], resource={"xc7Usq9uqA2hq":{"blgv7do05TxZK0QmTCJ":"eDEfZd4WNPDF2","c0anbyQu5svXQNI6":"S2zDkWEjcfKZ","epXPefiaNSgom8ql":"tuluMnsgATgSl2OGXL9","IwIv0VCzet5i":"pzQEr16YyqbEB","YTiozHxd29Jw66":"BgVE7fR6dA"},"U7lpr5pYmq2l5p6u4":{"43mz7aCBoT":"rSuWAXPRRXDdaiHXUDg","1PBWhBLl0SHd26e":"HpVH55BN54LSlvHKC0u","hCmutR43sZbz2gy2PYB":"ZnKnRnhpX4dd","24xaHtNFGv7e":"oLuP35d8FdUhzD…
no.unit.nva.search.resource.ResourceClientAllScientificValuesTest ‑ [1] { "onlineIssn": "1903-6523" }
no.unit.nva.search.resource.ResourceClientAllScientificValuesTest ‑ [2] { "printIssn": "1903-6523" }
no.unit.nva.indexingclient.models.IndexDocumentTest ‑ should throw exception when validating and missing mandatory fields:IndexDocument[consumptionAttributes=EventConsumptionAttributes[index=QcAPENmUG3EO8, documentIdentifier=null], resource={"1nTL60Quy3u":{"YdWdpSNIpirpbe":"7gycSSe1aBTABKNpA","kZ2uR7rycJc":"WO2rU3Faj18R","EjtVUCRC4Llf7AwSb":"95QeE5yiBgqIsK","GCYe1rvdfc":"gTCSzG1bwB","78rc2F00FQb":"S16XRbB7uUoPn"},"InXhobIVXcMwsLLtMb":{"bv6jFPkBeZO0lEW":"xcQPymcehMxCC5b1","UW7bIIGxIpcXsCy":"qAn1OoWzyD1ccaEfL","oaeJ1zbeqx5A0HCPfJ":"cxDaEBDjYIu1onUje6r","6nStzs7sFG":"IaleNq7ceYr70Kq8o","MsoBC4yDy0P0W":"rBX7nmGIZqeatArfRJ3"},"cB4…
no.unit.nva.indexingclient.models.IndexDocumentTest ‑ should throw exception when validating and missing mandatory fields:IndexDocument[consumptionAttributes=EventConsumptionAttributes[index=null, documentIdentifier=019d2e550320-a50a78a9-0f8e-492d-88aa-5fc100aec732], resource={"oPK8ZZpWBtBAdS9apCx":{"S1pi7bMITrb":"1A73myvSNqj4M1RPlQ3","8VKJGcSyRws0PNK":"4rUDZFlCMEMAxn7b","hjQ07vCnWPBd7mQ":"0xgTDkSCPLwc","YrWaYk3soZW":"XF7XK7NbZJI6fCr","g8EYLzE1PswT":"7aeg7SOU8ce1vX2b9"},"WYXNgRJqkMoHxXs5e4":{"Do1aHRss49Md":"8QNAsUafxJxxvN1Cef","AyRbIZeM8Av":"YwPllPMdVXFcAA77YAd","I6FFdj1HdH4fvIEtw":"VAP3uD34EB","MqntekwHhYe9kV4":"oPhWD5QneZmJ…
no.unit.nva.search.resource.ResourceClientAllScientificValuesTest ‑ [1] { "onlineIssn": "1903-6523" }

no.unit.nva.search.resource.ResourceClientAllScientificValuesTest ‑ [2] { "printIssn": "1903-6523" }

♻️ This comment has been updated with latest results.

@LarsV123 LarsV123 marked this pull request as ready for review March 19, 2026 09:46
Comment on lines +99 to +115
private static final Map<String, Float> DEFAULT_SEARCH_ALL_FIELDS =
Map.ofEntries(
Map.entry(ENTITY_DESCRIPTION_MAIN_TITLE, PI),
Map.entry(ENTITY_ABSTRACT, 1F),
Map.entry(ENTITY_TAGS, 1F),
Map.entry(IDENTIFIER_KEYWORD, 1F),
Map.entry(jsonPath(PUBLISHER, ID), 1F),
Map.entry(jsonPath(ENTITY_DESCRIPTION, REFERENCE, DOI, KEYWORD), 1F),
Map.entry(jsonPath(DOI, KEYWORD), 1F),
Map.entry(jsonPath(ENTITY_CONTRIBUTORS, IDENTITY, NAME, KEYWORD), 1F),
Map.entry(jsonPath(ENTITY_PUBLICATION_CONTEXT, NAME, KEYWORD), 1F),
Map.entry(jsonPath(ENTITY_PUBLICATION_CONTEXT, PUBLISHER, NAME, KEYWORD), 1F),
Map.entry(jsonPath(CONTRIBUTORS_AFFILIATION_LABELS, BOKMAAL_CODE), 1F),
Map.entry(jsonPath(CONTRIBUTORS_AFFILIATION_LABELS, ENGLISH_CODE), 1F),
Map.entry(jsonPath(CONTRIBUTORS_AFFILIATION_LABELS, NYNORSK_CODE), 1F),
Map.entry(FUNDING_SOURCE_IDENTIFIER, 1F),
Map.entry(FUNDINGS_IDENTIFIER, 1F));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be discussed with someone I think. It excludes many fields that are supported by free text search today.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which fields?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

series, journals, projects and more.

@joachimjorgensen joachimjorgensen Mar 27, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll call in a meeting with PO's where they define what fields they want searchable in the query.

@LarsV123 LarsV123 marked this pull request as draft April 9, 2026 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants