Skip to content

fix(search): column bulk operations search not returning results at scale#27216

Open
sonika-shah wants to merge 13 commits intomainfrom
fix/column-bulk-search-at-scale
Open

fix(search): column bulk operations search not returning results at scale#27216
sonika-shah wants to merge 13 commits intomainfrom
fix/column-bulk-search-at-scale

Conversation

@sonika-shah
Copy link
Copy Markdown
Collaborator

@sonika-shah sonika-shah commented Apr 9, 2026

Fixes #27227

Summary

  • Bug: Searching for a column name (e.g., "MAT", "MATNR") in Column Bulk Operations returned 0 results when tables have 20000+ columns
  • Root cause: Composite aggregation returns ALL column names from matching documents, then Java post-filters. With page size 25, the first composite page rarely contained matching column names — they were hidden further in the alphabet
  • Fix: When columnNamePattern is set, switch from composite aggregation to terms aggregation with include regex — ES/OS filters at the aggregation level, so only matching column names produce buckets

How it works: Two-phase terms aggregation

  1. Phase 1 — Names query (lightweight, no sub-aggs): terms agg with include regex, size=10000, ordered by _key asc → returns all matching column names + total count in a single fast query
  2. Java pagination: Sort all matching names, slice the requested page (offset-based cursor)
  3. Phase 2 — Data query (targeted): terms agg with include = exact page names + top_hits → fetches full entity data for only the page-size names on the current page

Why terms agg include(regex) works even with flat objects (columns are not nested):

  • Terms agg scans the global ordinals dictionary — a pre-built sorted list of every unique value in the field across the entire index
  • include(regex) tests each ordinal independently against the regex — it doesn't matter that multiple values came from the same document
  • Non-matching ordinals never allocate a bucket, never scan documents — zero cost

Non-search path (no columnNamePattern, no tag filter): Unchanged — still uses composite aggregation with engine-side after_key cursor pagination.

Tag/glossary filter path: Consolidated to a single _source fetch. Previously was a two-query Phase-1/Phase-2 pattern (find entityFQN#columnName pairs, then re-fetch). Since flat-object mapping requires reading _source to determine which specific column has the tag, we extract full column metadata in the same pass — eliminating the second query.

Approaches considered and rejected

1. Composite agg + Java post-filter (previous approach — the bug)

  • Composite returns ALL column names from matching documents, Java filters with String.contains() after
  • Why it fails: With 20,000+ columns, page size 25, matching names hidden deep in alphabet → first pages always return 0 matches

2. Composite agg with query-level regexp filter

  • Add regexp query on columns.name.keyword to pre-filter documents before aggregation
  • Rejected: Filters documents (tables), not column values. Since columns are flat objects (not nested), a table with 1 matching + 500 non-matching columns still returns all 501 column names in composite buckets

3. Composite agg + filter sub-agg + bucket_selector (elastic/elasticsearch#29079)

  • Use filter sub-aggregation + bucket_selector pipeline agg to drop non-matching buckets
  • Rejected:
    • bucket_selector is officially unsupported with composite (ES docs)
    • Columns are flat objects, not nested — filter sub-agg operates on documents, can't isolate which array value corresponds to which bucket key
    • Still creates ALL buckets first then prunes — pages can come back empty

4. Composite agg with runtime field + conditional emit()

  • Runtime field script filters values via conditional emit(), composite paginates with after_key
  • Rejected: Requires OpenSearch 2.14+ (we support 2.6+). Also significant performance penalty — disables global ordinals and early termination optimizations

5. Terms agg include(regex) + exclude(array) for pagination

  • First request gets 10,000 names with include(regex). Next request adds exclude([...previously seen names...]) to get the next batch
  • Rejected: Mixing regex-based include with array-based exclude is not supported on OpenSearch. Feature was added in ES 7.11 (elastic/elasticsearch#63325), but OpenSearch forked from ES 7.10.2 — before this was merged

6. Terms agg include(partition/num_partitions) + query-level regexp

  • Move regex to query level, use hash-based partitioning for pagination
  • Rejected: partition and regex share the same include parameter — mutually exclusive. And query-level regexp has the same flat-object problem as Approach 2

7. Composite agg with include/exclude on terms source

  • The ideal solution — composite's native pagination + regex filtering
  • Does not exist: Requested in elastic/elasticsearch#50368, closed Feb 2024 as "not planned". Elastic's focus shifted to ESQL

Why 10,000 cap on matching names

  • Terms agg requires an upfront size — there is no cursor/pagination mechanism
  • partition/num_partitions can't be combined with include(regex) (same field)
  • No cross-platform (ES + OpenSearch) way to paginate a terms agg with regex filtering
  • 10,000 unique column names matching a search pattern is an extreme edge case for a search feature
  • Phase 1 is lightweight (just string keys, no document data) so 10,000 buckets is cheap
  • Logged at WARN when the cap is hit so operators can detect it

Cursor encoding

  • Browse path: composite agg's native after_key (map of column-name → value), base64-encoded JSON — unchanged from main
  • Search / tag paths: offset-based pagination (Java-side over a sorted in-memory list), encoded as base64 JSON {"searchOffset": N}
  • Both encodings are opaque to the client; cross-format cursors (e.g. browse cursor sent to a search query) gracefully restart at offset 0

Files changed

File Change
columnGridResponse.json (no schema change in final form)
ColumnAggregator.java Hoisted shared helpers: toCaseInsensitiveRegex(), encodeSearchOffset / decodeSearchOffset, toIntSaturating, NamesWithCount record, agg-name constants, MAX_PATTERN_SEARCH_NAMES, SAMPLE_DOCS_PER_COLUMN
ElasticSearchColumnAggregator.java New aggregateColumnsWithPattern() (search) and aggregateColumnsWithKnownNames() (tag) paths. Tag path extracts columns from _source in a single pass and uses a case-insensitive TreeMap so taggedColumns.get(name) is O(log N) and preserves case-variant grouping. Shared parseBucketHits() between composite and terms parsing. WARN when tag-fetch hits the 10K result-window cap
OpenSearchColumnAggregator.java Same two-phase terms agg approach for OpenSearch client; same tag-path consolidation; same WARN on 10K caps
ColumnAggregatorTest.java 8 unit tests for toCaseInsensitiveRegex — case insensitivity, digits, underscore, special char escaping, single char, empty input
ColumnGridResourceIT.java 9 new integration tests covering case-insensitive search, regex special chars, exclusion of non-matching columns, pattern + tag combo, pattern + glossary combo, tag-filter pagination consistency, multi-entity-type dedup, scale regression (alphabetically-late match in many-column table). Existing tests refactored to await().untilAsserted(...) to remove flake on slow CI

Test plan

Automated

  • ColumnAggregatorTest — 8 unit tests for regex generation (all pass)
  • ColumnMetadataGrouperTest — 7 existing tests still pass (no regression)
  • mvn compile — clean build
  • mvn spotless:apply — clean formatting
  • mvn test-compile for integration-tests module — clean
  • New integration tests in ColumnGridResourceIT:
    • test_getColumnGrid_patternSearchIsCaseInsensitive
    • test_getColumnGrid_patternSearchExcludesNonMatching
    • test_getColumnGrid_patternSearchWithSpecialChars
    • test_getColumnGrid_patternPlusTagFilter
    • test_getColumnGrid_patternPlusGlossaryFilter
    • test_getColumnGrid_tagFilterPaginationConsistency
    • test_getColumnGrid_glossaryFilter_onlyReturnsGlossaryOccurrences
    • test_getColumnGrid_patternSearchAcrossEntityTypesDedupesNames
    • test_getColumnGrid_patternSearchFindsAlphabeticallyLateColumn (the actual bug regression — 200-column table, matching column near end of alphabet, single-page response must contain it)

Manual

  • Search "MAT" on a table with 20000+ columns → returns MAT, MATNR results
  • Pagination through search results works correctly
  • Search + tag filter combination works correctly
  • Non-search browsing (no pattern) still works as before

Post-review hardening

  • Case-insensitive tag map: taggedColumns keyed by TreeMap(CASE_INSENSITIVE_ORDER) so case-variant column names ("User" / "user") merge correctly; replaces the O(N×M) nested loop with direct taggedColumns.get(name) lookup
  • Saturating long → int cast: ColumnAggregator.toIntSaturating() guards against overflow when summing doc_count across groups for totalOccurrences
  • Tag-fetch cap visibility: _source fetch is still capped at index.max_result_window (10K), now logged at WARN when totalHits > 10000 so operators see the truncation
  • buildTagFilterQuery is scope-aware: applies service / database / schema / domain / entity-type / columnNamePattern / metadataStatus filters so the _source fetch is scoped to the same data set as the main query
  • decodeSearchOffset logs at DEBUG on parse failure (mirrors decodeCursor); cross-format cursor reuse degrades to "restart at page 1" rather than silent loop
  • Lint cleanup: replaced Map.class + @SuppressWarnings("unchecked") with TypeReference<>; removed FQN usages of Locale.ROOT / NamedValue in favor of imports

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 9, 2026 19:46
…cale

When searching by column name pattern (e.g., "MAT") in column bulk
operations, the composite aggregation returned ALL column names from
matching documents, then post-filtered in Java. With 20000+ columns,
the first composite page of 25 names rarely contained matches, so
users saw 0 results.

Switch to terms aggregation with `include` regex when a search pattern
is set. This filters at the ES/OS aggregation level — only matching
column names produce buckets. Two-phase approach: (1) lightweight
names query to get all matching names + accurate total, (2) targeted
data query with top_hits for the current page only.
@sonika-shah sonika-shah force-pushed the fix/column-bulk-search-at-scale branch from a773a85 to 9f3b664 Compare April 9, 2026 19:47
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 9, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes column-name search in Column Bulk Operations for very wide schemas (20k+ columns) by switching the columnNamePattern path from composite aggregation + Java post-filtering to a two-phase terms aggregation that filters bucket keys server-side using an include regexp.

Changes:

  • Added ColumnAggregator.toCaseInsensitiveRegex() to generate a Lucene-compatible, case-insensitive regexp for terms.include.
  • Implemented a pattern-search branch in both Elasticsearch and OpenSearch column aggregators using a two-phase terms aggregation (names query + page data query).
  • Added unit tests for regex generation and edge cases.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
openmetadata-service/src/main/java/org/openmetadata/service/search/ColumnAggregator.java Adds shared utility to build Lucene-compatible case-insensitive regex for terms include.
openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java Adds pattern-search code path using two-phase terms aggregation and offset-based pagination cursor.
openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java Mirrors the two-phase terms aggregation approach for OpenSearch and refactors bucket parsing.
openmetadata-service/src/test/java/org/openmetadata/service/search/ColumnAggregatorTest.java Adds unit tests validating regex generation behavior (case handling + escaping).
Comments suppressed due to low confidence (2)

openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:68

  • MAX_PATTERN_SEARCH_NAMES is hard-capped at 10,000 for the phase-1 terms aggregation. On large schemas (e.g., 20k+ columns) a broad pattern (like a single character) can easily match >10k distinct column names, which will silently truncate matchingNames, undercount totalUniqueColumns, and prevent users from paging to the missing matches. Consider paging the name collection (e.g., via composite agg with after_key, or partitioning the terms agg) or raising the limit to cover worst-case table sizes and explicitly detecting/tracking truncation when the limit is hit.
  /** Max column names to retrieve in the names-only query during pattern search. */
  private static final int MAX_PATTERN_SEARCH_NAMES = 10000;

  /** Index configuration with field mappings for each entity type. Uses aliases defined in indexMapping.json */
  private static final Map<String, IndexConfig> INDEX_CONFIGS =
      Map.of(
          "table",

openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:70

  • The phase-1 pattern search uses a terms agg with size=MAX_PATTERN_SEARCH_NAMES (10,000). If the pattern matches more than 10k distinct column names (common on 20k+ column tables for broad patterns), the names list and totalUniqueColumns will be truncated and the remaining matches become unreachable via pagination. Consider implementing a paged name scan (e.g., composite agg with cursor) or otherwise guaranteeing retrieval of all matching names (and/or surfacing a truncation indicator).
  /** Max column names to retrieve in the names-only query during pattern search. */
  private static final int MAX_PATTERN_SEARCH_NAMES = 10000;

  /** Uses aliases defined in indexMapping.json */
  private static final List<String> DATA_ASSET_INDEXES =
      Arrays.asList("table", "dashboardDataModel", "topic", "searchIndex", "container");

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 9, 2026

🟡 Playwright Results — all passed (20 flaky)

✅ 3953 passed · ❌ 0 failed · 🟡 20 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 297 0 2 4
🟡 Shard 2 750 0 9 8
🟡 Shard 3 730 0 2 7
🟡 Shard 4 756 0 3 18
✅ Shard 5 687 0 0 41
🟡 Shard 6 733 0 4 8
🟡 20 flaky test(s) (passed on retry)
  • Pages/AuditLogs.spec.ts › should apply both User and EntityType filters simultaneously (shard 1, 1 retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/ColumnBulkOperations.spec.ts › should update display name and propagate to all occurrences (shard 2, 2 retries)
  • Features/DataProductRenameConsolidation.spec.ts › Rename then change owner - assets should be preserved (shard 2, 1 retry)
  • Features/DomainFilterQueryFilter.spec.ts › Domain filter should persist across page navigation (shard 2, 1 retry)
  • Features/DomainFilterQueryFilter.spec.ts › Domain filter should use exact match and prefix with dot to prevent false positives (shard 2, 1 retry)
  • Features/Glossary/LargeGlossaryPerformance.spec.ts › should handle large number of glossary child term with pagination (shard 2, 1 retry)
  • Features/LandingPageWidgets/DomainWidgetFilter.spec.ts › Setup Domains widget on landing page (shard 2, 2 retries)
  • Features/LandingPageWidgets/DomainWidgetFilter.spec.ts › Domains widget should show only selected domain when domain filter is active (shard 2, 2 retries)
  • Features/Permissions/ServiceEntityPermissions.spec.ts › AutoPilot trigger button is hidden with view-only permission (shard 3, 1 retry)
  • Flow/ExploreDiscovery.spec.ts › Should not display soft deleted assets in search suggestions (shard 3, 1 retry)
  • Pages/DataContractInheritance.spec.ts › Delete Button Disabled - Fully inherited contracts cannot be deleted (shard 4, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Set (shard 4, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Not_Set (shard 4, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/ServiceEntity.spec.ts › Tier Add, Update and Remove (shard 6, 2 retries)
  • Pages/UserDetails.spec.ts › Create team with domain and verify visibility of inherited domain in user profile after team removal (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copilot AI review requested due to automatic review settings April 13, 2026 06:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Copilot AI review requested due to automatic review settings April 26, 2026 13:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

@sonarqubecloud
Copy link
Copy Markdown

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 26, 2026

Code Review ✅ Approved 4 resolved / 4 findings

Refactored column bulk search to use Lucene-compatible regex and consistent case-insensitive deduplication, addressing discrepancies in unit testing and result lookups. No issues remain.

✅ 4 resolved
Bug: Unit tests validate Java regex, not Lucene/ES regex engine

📄 openmetadata-service/src/test/java/org/openmetadata/service/search/ColumnAggregatorTest.java:30-36
ColumnAggregatorTest uses java.util.regex.Pattern to validate the output of toCaseInsensitiveRegex, but at runtime the regex is executed by Elasticsearch/OpenSearch's Lucene-based regex engine, which has different syntax rules (e.g., no backreferences, different anchoring behavior). While the subset of features used here (character classes, .*, literal escaping) is compatible with both engines, the tests don't guarantee correctness against the actual runtime engine.

Consider adding an integration test (or noting this caveat in the test class) that validates the regex against a real ES/OS instance, especially for edge cases like Unicode characters or less common special characters.

Edge Case: TreeSet case-insensitive dedup vs HashMap case-sensitive lookup

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:272-273 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:290-291 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:200-201 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:218-219
In aggregateColumnsWithKnownNames (ES line 272-291, OS line 200-219), a TreeSet(String.CASE_INSENSITIVE_ORDER) is used to deduplicate column names, but taggedColumns is a regular HashMap with case-sensitive keys. If two documents contribute the same column name with different casing (e.g., "MyCol" vs "mycol"), the TreeSet will keep only one variant. When taggedColumns.get(name) is called with that variant, it will only find entries under the exact matching case key, silently dropping occurrences stored under the other case variant.

In practice this is unlikely (column names from the same logical column usually have consistent casing), but it could cause missing occurrences in edge cases.

Bug: totalOccurrences is per-page, not global, in pattern search path

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:267-268 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:187
In the new aggregateColumnsWithPattern method, totalOccurrences is computed from only the current page's grid items (e.g., 25 columns), while totalUniqueColumns is computed from phase 1 across ALL matching names. This means the API response has an accurate totalUniqueColumns but an inaccurate (page-local) totalOccurrences that changes as users paginate.

If the UI displays "X total occurrences" alongside the correct unique-columns count, the numbers will appear inconsistent. On page 1 you might see totalOccurrences=50, on page 2 it becomes totalOccurrences=42, etc.

If getting a global total is too expensive, consider documenting in the response or API contract that totalOccurrences is approximate/page-local when a search pattern is active, or omit it entirely for pattern searches.

Edge Case: Terms agg silently truncates at 10000 matching column names

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:63 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:664 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:65 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:575
The phase-1 names query uses size(MAX_PATTERN_SEARCH_NAMES = 10000) on the terms aggregation. If a broad pattern (e.g., single character "a") matches more than 10,000 distinct column names, the results are silently truncated: totalUniqueColumns will report 10,000, pagination will stop there, and the user won't know results are missing.

Consider adding a check: if the returned bucket count equals MAX_PATTERN_SEARCH_NAMES, either log a warning, return an indicator in the response (e.g., totalUniqueColumns set to -1 or an isTruncated flag), or require a minimum pattern length to avoid overly broad matches.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Column Bulk Operation Search does not return results from subsequent pages for large column counts

2 participants