fix(search): column bulk operations search not returning results at scale#27216
fix(search): column bulk operations search not returning results at scale#27216sonika-shah wants to merge 13 commits intomainfrom
Conversation
…cale When searching by column name pattern (e.g., "MAT") in column bulk operations, the composite aggregation returned ALL column names from matching documents, then post-filtered in Java. With 20000+ columns, the first composite page of 25 names rarely contained matches, so users saw 0 results. Switch to terms aggregation with `include` regex when a search pattern is set. This filters at the ES/OS aggregation level — only matching column names produce buckets. Two-phase approach: (1) lightweight names query to get all matching names + accurate total, (2) targeted data query with top_hits for the current page only.
a773a85 to
9f3b664
Compare
There was a problem hiding this comment.
Pull request overview
Fixes column-name search in Column Bulk Operations for very wide schemas (20k+ columns) by switching the columnNamePattern path from composite aggregation + Java post-filtering to a two-phase terms aggregation that filters bucket keys server-side using an include regexp.
Changes:
- Added
ColumnAggregator.toCaseInsensitiveRegex()to generate a Lucene-compatible, case-insensitive regexp forterms.include. - Implemented a pattern-search branch in both Elasticsearch and OpenSearch column aggregators using a two-phase
termsaggregation (names query + page data query). - Added unit tests for regex generation and edge cases.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| openmetadata-service/src/main/java/org/openmetadata/service/search/ColumnAggregator.java | Adds shared utility to build Lucene-compatible case-insensitive regex for terms include. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java | Adds pattern-search code path using two-phase terms aggregation and offset-based pagination cursor. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java | Mirrors the two-phase terms aggregation approach for OpenSearch and refactors bucket parsing. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/ColumnAggregatorTest.java | Adds unit tests validating regex generation behavior (case handling + escaping). |
Comments suppressed due to low confidence (2)
openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:68
MAX_PATTERN_SEARCH_NAMESis hard-capped at 10,000 for the phase-1termsaggregation. On large schemas (e.g., 20k+ columns) a broad pattern (like a single character) can easily match >10k distinct column names, which will silently truncatematchingNames, undercounttotalUniqueColumns, and prevent users from paging to the missing matches. Consider paging the name collection (e.g., via composite agg withafter_key, or partitioning thetermsagg) or raising the limit to cover worst-case table sizes and explicitly detecting/tracking truncation when the limit is hit.
/** Max column names to retrieve in the names-only query during pattern search. */
private static final int MAX_PATTERN_SEARCH_NAMES = 10000;
/** Index configuration with field mappings for each entity type. Uses aliases defined in indexMapping.json */
private static final Map<String, IndexConfig> INDEX_CONFIGS =
Map.of(
"table",
openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:70
- The phase-1 pattern search uses a
termsagg withsize=MAX_PATTERN_SEARCH_NAMES(10,000). If the pattern matches more than 10k distinct column names (common on 20k+ column tables for broad patterns), the names list andtotalUniqueColumnswill be truncated and the remaining matches become unreachable via pagination. Consider implementing a paged name scan (e.g., composite agg with cursor) or otherwise guaranteeing retrieval of all matching names (and/or surfacing a truncation indicator).
/** Max column names to retrieve in the names-only query during pattern search. */
private static final int MAX_PATTERN_SEARCH_NAMES = 10000;
/** Uses aliases defined in indexMapping.json */
private static final List<String> DATA_ASSET_INDEXES =
Arrays.asList("table", "dashboardDataModel", "topic", "searchIndex", "container");
🟡 Playwright Results — all passed (20 flaky)✅ 3953 passed · ❌ 0 failed · 🟡 20 flaky · ⏭️ 86 skipped
🟡 20 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
|
Code Review ✅ Approved 4 resolved / 4 findingsRefactored column bulk search to use Lucene-compatible regex and consistent case-insensitive deduplication, addressing discrepancies in unit testing and result lookups. No issues remain. ✅ 4 resolved✅ Bug: Unit tests validate Java regex, not Lucene/ES regex engine
✅ Edge Case: TreeSet case-insensitive dedup vs HashMap case-sensitive lookup
✅ Bug:
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar



Fixes #27227
Summary
columnNamePatternis set, switch from composite aggregation to terms aggregation withincluderegex — ES/OS filters at the aggregation level, so only matching column names produce bucketsHow it works: Two-phase terms aggregation
termsagg withincluderegex,size=10000, ordered by_keyasc → returns all matching column names + total count in a single fast querytermsagg withinclude= exact page names +top_hits→ fetches full entity data for only the page-size names on the current pageWhy terms agg
include(regex)works even with flat objects (columns are not nested):include(regex)tests each ordinal independently against the regex — it doesn't matter that multiple values came from the same documentNon-search path (no
columnNamePattern, no tag filter): Unchanged — still uses composite aggregation with engine-sideafter_keycursor pagination.Tag/glossary filter path: Consolidated to a single
_sourcefetch. Previously was a two-query Phase-1/Phase-2 pattern (findentityFQN#columnNamepairs, then re-fetch). Since flat-object mapping requires reading_sourceto determine which specific column has the tag, we extract full column metadata in the same pass — eliminating the second query.Approaches considered and rejected
1. Composite agg + Java post-filter (previous approach — the bug)
String.contains()after2. Composite agg with query-level
regexpfilterregexpquery oncolumns.name.keywordto pre-filter documents before aggregation3. Composite agg + filter sub-agg +
bucket_selector(elastic/elasticsearch#29079)bucket_selectorpipeline agg to drop non-matching bucketsbucket_selectoris officially unsupported with composite (ES docs)4. Composite agg with runtime field + conditional
emit()emit(), composite paginates withafter_key5. Terms agg
include(regex)+exclude(array)for paginationinclude(regex). Next request addsexclude([...previously seen names...])to get the next batchincludewith array-basedexcludeis not supported on OpenSearch. Feature was added in ES 7.11 (elastic/elasticsearch#63325), but OpenSearch forked from ES 7.10.2 — before this was merged6. Terms agg
include(partition/num_partitions)+ query-levelregexppartitionandregexshare the sameincludeparameter — mutually exclusive. And query-level regexp has the same flat-object problem as Approach 27. Composite agg with
include/excludeon terms sourceWhy 10,000 cap on matching names
size— there is no cursor/pagination mechanismpartition/num_partitionscan't be combined withinclude(regex)(same field)Cursor encoding
after_key(map of column-name → value), base64-encoded JSON — unchanged from main{"searchOffset": N}Files changed
columnGridResponse.jsonColumnAggregator.javatoCaseInsensitiveRegex(),encodeSearchOffset/decodeSearchOffset,toIntSaturating,NamesWithCountrecord, agg-name constants,MAX_PATTERN_SEARCH_NAMES,SAMPLE_DOCS_PER_COLUMNElasticSearchColumnAggregator.javaaggregateColumnsWithPattern()(search) andaggregateColumnsWithKnownNames()(tag) paths. Tag path extracts columns from_sourcein a single pass and uses a case-insensitiveTreeMapsotaggedColumns.get(name)is O(log N) and preserves case-variant grouping. SharedparseBucketHits()between composite and terms parsing. WARN when tag-fetch hits the 10K result-window capOpenSearchColumnAggregator.javaColumnAggregatorTest.javatoCaseInsensitiveRegex— case insensitivity, digits, underscore, special char escaping, single char, empty inputColumnGridResourceIT.javaawait().untilAsserted(...)to remove flake on slow CITest plan
Automated
ColumnAggregatorTest— 8 unit tests for regex generation (all pass)ColumnMetadataGrouperTest— 7 existing tests still pass (no regression)mvn compile— clean buildmvn spotless:apply— clean formattingmvn test-compilefor integration-tests module — cleanColumnGridResourceIT:test_getColumnGrid_patternSearchIsCaseInsensitivetest_getColumnGrid_patternSearchExcludesNonMatchingtest_getColumnGrid_patternSearchWithSpecialCharstest_getColumnGrid_patternPlusTagFiltertest_getColumnGrid_patternPlusGlossaryFiltertest_getColumnGrid_tagFilterPaginationConsistencytest_getColumnGrid_glossaryFilter_onlyReturnsGlossaryOccurrencestest_getColumnGrid_patternSearchAcrossEntityTypesDedupesNamestest_getColumnGrid_patternSearchFindsAlphabeticallyLateColumn(the actual bug regression — 200-column table, matching column near end of alphabet, single-page response must contain it)Manual
Post-review hardening
taggedColumnskeyed byTreeMap(CASE_INSENSITIVE_ORDER)so case-variant column names ("User" / "user") merge correctly; replaces the O(N×M) nested loop with directtaggedColumns.get(name)lookupColumnAggregator.toIntSaturating()guards against overflow when summingdoc_countacross groups fortotalOccurrences_sourcefetch is still capped atindex.max_result_window(10K), now logged at WARN whentotalHits > 10000so operators see the truncationbuildTagFilterQueryis scope-aware: applies service / database / schema / domain / entity-type / columnNamePattern / metadataStatus filters so the_sourcefetch is scoped to the same data set as the main querydecodeSearchOffsetlogs at DEBUG on parse failure (mirrorsdecodeCursor); cross-format cursor reuse degrades to "restart at page 1" rather than silent loopMap.class+@SuppressWarnings("unchecked")withTypeReference<>; removed FQN usages ofLocale.ROOT/NamedValuein favor of imports🤖 Generated with Claude Code