Skip to content

feat: native Elasticsearch vector search support#27111

Open
joaopamaral wants to merge 18 commits intoopen-metadata:mainfrom
Automattic:feat/es-elasticsearch-vector-search-upstream
Open

feat: native Elasticsearch vector search support#27111
joaopamaral wants to merge 18 commits intoopen-metadata:mainfrom
Automattic:feat/es-elasticsearch-vector-search-upstream

Conversation

@joaopamaral
Copy link
Copy Markdown

@joaopamaral joaopamaral commented Apr 7, 2026

Summary

Adds native Elasticsearch 8.x/9.x vector search support, mirroring the existing OpenSearch implementation. OpenMetadata deployments backed by Elasticsearch can now use the same semantic/vector search features as OpenSearch deployments.

Changes

  • ElasticSearchVectorService (new): ES implementation of VectorIndexService, using Rest5Client for generic HTTP requests. Mirrors OpenSearchVectorService structure.
  • vector_search_index_es_native.json (new, en/jp/ru/zh): ES-native index mappings using dense_vector / dims / cosine similarity (ES 8.x/9.x format, as opposed to OpenSearch's knn_vector / dimension / HNSW).
  • VectorSearchQueryBuilder.buildNativeESQuery(): emits the ES 8.x/9.x top-level knn query format (distinct from OpenSearch's nested query.knn). Reference: https://www.elastic.co/docs/solutions/search/vector/knn
  • SemanticSearchQueryBuilder for Elasticsearch package: mirrors the OpenSearch equivalent.
  • ElasticSearchIndexManager.extractMappingsJson(): extracts the mappings sub-object before calling putMapping — ES rejects full index JSON (with settings/aliases) at the mappings API.
  • reformatVectorIndexWithDimension(): handles both "dims" (ES native) and "dimension" (OpenSearch) keys so embedding dimension injection works for both backends.
  • SearchRepository / ElasticSearchBulkSink: wired to initialize and use ElasticSearchVectorService when ES backend is configured.
  • Tests: VectorSearchQueryBuilderTest, ElasticSearchIndexManagerTest, and new ElasticSearchVectorServiceTest.

Compatibility

  • OpenSearch path is unchanged — OpenSearchBulkSink / OpenSearchVectorService untouched.
  • No schema changes required for existing deployments.

Test plan

  • mvn test -pl openmetadata-service -Dtest=VectorSearchQueryBuilderTest,ElasticSearchIndexManagerTest,ElasticSearchVectorServiceTest
  • Integration: configure embeddingProvider in elasticSearchConfiguration, run Search Index app against an ES 8.x/9.x cluster, verify vector index is created and knn search returns results
  • Confirm OpenSearch deployments unaffected

References

🤖 Generated with Claude Code


Summary by Gitar

  • Test coverage:
    • Added unit tests for EsUtils.enrichIndexMappingForElasticsearch covering null input, missing fingerprints, and successful vector dimension injection.
    • Added readIndexMappingReturnsMappingForKnownIndex to SearchRepositoryBehaviorTest to verify correct index mapping retrieval.

This will update automatically on new commits.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@joaopamaral
Copy link
Copy Markdown
Author

joaopamaral commented Apr 7, 2026

Initial results look good, but I've run a test only with ES 9.x and version 1.12.4 (not the one from main). I also need to double-check if OpenSearch is affected by this change. Also need to review some AI-resolved conflicts from version 1.12.4 with main.

@harshach
Copy link
Copy Markdown
Collaborator

harshach commented Apr 7, 2026

Thanks @joaopamaral this is great!!. Can you make it ready for review? and also address comments here #27111 (comment)

@joaopamaral
Copy link
Copy Markdown
Author

Sure @harshach! I'll work on the bot review first before making it ready for review! 👍

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 8, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

2 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 8, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

5 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@joaopamaral joaopamaral marked this pull request as ready for review April 10, 2026 19:41
Copilot AI review requested due to automatic review settings April 10, 2026 19:41
@joaopamaral
Copy link
Copy Markdown
Author

Hi @harshach, ’ve addressed the bot review, but I still need to re-review the code after rebasing/merging with main and rerun the tests against a real server. So far, I’ve tested this PR with version 1.12.4 and ES 9.3.1. I still need to validate that everything continues to work correctly with OpenSearch and ES 8.x.

I won’t be able to run tests for the next couple of days, but feel free to proceed with any testing on your side in the meantime.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds native Elasticsearch (8.x/9.x) vector search support to OpenMetadata, aiming to provide semantic/vector search capabilities on Elasticsearch deployments comparable to the existing OpenSearch implementation.

Changes:

  • Added a new ElasticSearchVectorService plus wiring in SearchRepository / ElasticSearchBulkSink to initialize and use it when Elasticsearch is the configured backend.
  • Introduced ES-native vector index mapping templates (vector_search_index_es_native.json) and extended query-building to emit Elasticsearch’s top-level knn query format.
  • Added/updated tests around the ES-native query format and Elasticsearch vector service behavior.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
openmetadata-spec/src/main/resources/json/schema/search/searchRequest.json Adds semanticSearch flag to the search request schema.
openmetadata-spec/src/main/resources/elasticsearch/en/vector_search_index_es_native.json New ES-native vector index template (en).
openmetadata-spec/src/main/resources/elasticsearch/jp/vector_search_index_es_native.json New ES-native vector index template (jp).
openmetadata-spec/src/main/resources/elasticsearch/ru/vector_search_index_es_native.json New ES-native vector index template (ru).
openmetadata-spec/src/main/resources/elasticsearch/zh/vector_search_index_es_native.json New ES-native vector index template (zh).
openmetadata-service/src/main/java/org/openmetadata/service/search/vector/VectorSearchQueryBuilder.java Adds buildNativeESQuery and refactors filter emission for vector search queries.
openmetadata-service/src/test/java/org/openmetadata/service/search/vector/VectorSearchQueryBuilderTest.java Adds coverage for ES-native top-level knn query structure and filter behavior.
openmetadata-service/src/main/java/org/openmetadata/service/search/vector/VectorIndexService.java Extends vector service interface and adds an alias helper.
openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java Adjusts to use the new interface default alias method and annotates overrides.
openmetadata-service/src/main/java/org/openmetadata/service/search/vector/ElasticSearchVectorService.java New Elasticsearch vector service implementation using Rest5Client for generic requests.
openmetadata-service/src/test/java/org/openmetadata/service/search/vector/ElasticSearchVectorServiceTest.java New tests for ES vector service result parsing, grouping, and dimension patching.
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java Initializes ES vector service when Elasticsearch backend is configured; mapping selection tweaks for ES-native template.
openmetadata-service/src/main/java/org/openmetadata/service/search/RecreateWithEmbeddings.java Attempts to include a vector “entity” key in recreate flow when vector search is enabled.
openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/SemanticSearchQueryBuilder.java New builder for semantic/hybrid query composition on Elasticsearch.
openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchIndexManager.java Extracts mappings sub-object before calling putMapping.
openmetadata-service/src/test/java/org/openmetadata/service/search/elasticsearch/ElasticSearchIndexManagerTest.java Adds a test asserting updateIndex handles full index JSON by extracting mappings.
openmetadata-service/src/main/java/org/openmetadata/service/resources/search/VectorSearchResource.java Switches to repository-provided VectorIndexService and adds a fingerprint endpoint.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/ElasticSearchBulkSink.java Adds async vector-embedding task execution + migration path for ES indexing jobs.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/ElasticSearchBulkSinkSimpleTest.java Adds minimal coverage for vector-embedding helpers on the ES sink.
openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/SemanticSearchTool.java Uses repository VectorIndexService rather than OpenSearch-only implementation.

@joaopamaral
Copy link
Copy Markdown
Author

Also need to review all after this refactor #26000 😢

@harshach harshach added the safe to test Add this label to run secure Github workflows on PRs label Apr 22, 2026
@harshach
Copy link
Copy Markdown
Collaborator

@joaopamaral thanks for your work on this, can you check the co-pilot comments and address the merge conflict here please

@github-actions
Copy link
Copy Markdown
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Copy Markdown
Contributor

⚠️ TypeScript Types Need Update

The generated TypeScript types are out of sync with the JSON schema changes.

Since this is a pull request from a forked repository, the types cannot be automatically committed.
Please generate and commit the types manually:

cd openmetadata-ui/src/main/resources/ui
./json2ts-generate-all.sh -l true
git add src/generated/
git commit -m "Update generated TypeScript types"
git push

After pushing the changes, this check will pass automatically.

The /vector/fingerprint diagnostic endpoint allowed any authenticated
user to enumerate vector fingerprints for arbitrary entity UUIDs.
Replace the subject-only extraction with authorizer.authorizeAdmin()
to restrict access to admins.

Add VectorSearchResourceTest covering: admin gate enforcement,
found/not-found fingerprints, bad UUID, missing parentId, and
service-unavailable when vector search is disabled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Elasticsearch returns 4xx/5xx as regular HTTP responses that the low-level
client does not throw on. Previously the response body was returned as-is,
causing downstream JSON parsing failures with no context about the real error.

Now checks response.getStatusCode() and throws IOException with the status
and body when >= 400, mirroring the same pattern in OpenSearchVectorService.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 25, 2026 17:34
…arch/vector/ElasticSearchVectorServiceTest.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 20 changed files in this pull request and generated 3 comments.

Comment on lines 603 to 617
@@ -601,11 +613,7 @@ private String getIndexMapping(IndexMapping indexMapping) {
}

public String readIndexMapping(IndexMapping indexMapping) {
String mapping = getIndexMapping(indexMapping);
if (isVectorEmbeddingEnabled() && embeddingClient != null && mapping != null) {
mapping = reformatVectorIndexWithDimension(mapping, embeddingClient.getDimension());
}
return mapping;
return getIndexMapping(indexMapping);
}
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readIndexMapping() no longer enriches mappings when vector embeddings are enabled. Index creation/update now enriches via EsUtils.enrichIndexMappingForElasticsearch, but index template creation (createOrUpdateIndexTemplate(s)) still uses readIndexMapping() directly, so templates for embedding-capable indices may miss the injected dense_vector embedding field and _meta when embeddings are enabled. Consider enriching the mapping content for the template path as well (e.g., apply the same EsUtils.enrichIndexMappingForElasticsearch before calling putIndexTemplate).

Copilot uses AI. Check for mistakes.
… coverage

The method was deleted in the inline-embedding refactor. Remove the test
that invoked it via reflection (which would throw NoSuchMethodException).

Replace with:
- SearchRepositoryBehaviorTest: readIndexMappingReturnsMappingForKnownIndex
  verifies readIndexMapping still loads the file-based mapping correctly
- EsUtilsTest: three tests for enrichIndexMappingForElasticsearch covering
  null/empty input, skip when fingerprint field absent, and dense_vector
  injection with _meta when fingerprint field present and vector enabled

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 25, 2026

Code Review 👍 Approved with suggestions 5 resolved / 6 findings

Adds native Elasticsearch vector search support with comprehensive test coverage and fixes to pagination, initialization ordering, and interface safety. Consider adding a type guard for the extractRestClient cast to Rest5ClientTransport to prevent runtime errors.

💡 Edge Case: extractRestClient cast to Rest5ClientTransport has no guard

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/ElasticSearchVectorService.java:60-63

At line 61, extractRestClient() performs an unchecked cast (Rest5ClientTransport) client._transport(). While the current codebase always creates the ES client with Rest5ClientTransport, a future refactor or different client construction path would produce a ClassCastException with no helpful message. A defensive check or better error message would improve debuggability.

Suggested fix
private static Rest5Client extractRestClient(ElasticsearchClient client) {
  if (!(client._transport() instanceof Rest5ClientTransport rest5)) {
    throw new IllegalArgumentException(
        "ElasticSearchVectorService requires Rest5ClientTransport, got: "
            + client._transport().getClass().getName());
  }
  return rest5.restClient();
}
✅ 5 resolved
Bug: Test calls build() with 4 args but method requires 6 — won't compile

📄 openmetadata-service/src/test/java/org/openmetadata/service/search/vector/VectorSearchQueryBuilderTest.java:864 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/VectorSearchQueryBuilder.java:19-25
The test testNativeESQueryAndOpenSearchQueryProduceSameFilters at line 864 calls VectorSearchQueryBuilder.build(vector, 10, 100, filters) with 4 arguments, but the build() method signature requires 6 parameters: (float[] vector, int size, int from, int k, Map<String, List<String>> filters, double threshold). This will fail to compile. The intent appears to be comparing OpenSearch and ES native queries with the same filters.

Edge Case: loadIndexMapping dimension replacement is brittle — exact string match

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/ElasticSearchVectorService.java:523-527 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:3060-3068
In ElasticSearchVectorService.loadIndexMapping(), the dimension placeholder is replaced via exact string matching: template.replace(""dims": 512", ""dims": " + dimension). This requires the JSON template to have exactly one space after the colon. If the template is reformatted (e.g., minified, or extra spaces), the replacement silently fails. The code does have a post-check that throws if no replacement happened, but the same brittleness exists in SearchRepository.reformatVectorIndexWithDimension() fallback (lines 3062-3068) which does multiple .replace() calls for both with-space and without-space variants — showing awareness of the problem but an incomplete fix.

Edge Case: init() assigns instance before registerVectorEmbeddingHandler completes

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/ElasticSearchVectorService.java:65-79
In ElasticSearchVectorService.init() (line 70), instance is assigned the new service object, and then instance.registerVectorEmbeddingHandler() is called on line 71. Since getInstance() is not synchronized, a concurrent caller could observe the instance in a partially-initialized state (before the handler is registered). The volatile keyword ensures the reference is visible but doesn't guarantee that registerVectorEmbeddingHandler() has completed.

This mirrors the existing OpenSearch pattern, but the ES version is new code where it can be fixed.

Bug: ES search pagination is broken vs OpenSearch implementation

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/ElasticSearchVectorService.java:95-109 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java:196-210
The ElasticSearchVectorService.search() passes from directly to the ES kNN query, which skips raw hits (chunks). However, the API contract expects from to skip parent entities (grouped results). The OpenSearch implementation correctly handles this: it fetches from + size + 1 parents via a loop, then skips from parents in application code.

With the current ES implementation, requesting from=5, size=10 skips 5 raw chunks (not 5 parents), leading to incorrect pagination results. Additionally, the ES version uses the 2-arg VectorSearchResponse constructor, so totalHits and hasMore are always null — callers lose pagination metadata.

Quality: Unsafe downcast defeats purpose of VectorIndexService interface

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/ElasticSearchBulkSink.java:778
ElasticSearchBulkSink casts vectorService to (ElasticSearchVectorService) to call copyExistingVectorDocuments(). This couples the sink back to the concrete class, undermining the new interface abstraction. If vectorService is ever a different implementation, this throws ClassCastException at runtime.

🤖 Prompt for agents
Code Review: Adds native Elasticsearch vector search support with comprehensive test coverage and fixes to pagination, initialization ordering, and interface safety. Consider adding a type guard for the extractRestClient cast to Rest5ClientTransport to prevent runtime errors.

1. 💡 Edge Case: extractRestClient cast to Rest5ClientTransport has no guard
   Files: openmetadata-service/src/main/java/org/openmetadata/service/search/vector/ElasticSearchVectorService.java:60-63

   At line 61, `extractRestClient()` performs an unchecked cast `(Rest5ClientTransport) client._transport()`. While the current codebase always creates the ES client with `Rest5ClientTransport`, a future refactor or different client construction path would produce a `ClassCastException` with no helpful message. A defensive check or better error message would improve debuggability.

   Suggested fix:
   private static Rest5Client extractRestClient(ElasticsearchClient client) {
     if (!(client._transport() instanceof Rest5ClientTransport rest5)) {
       throw new IllegalArgumentException(
           "ElasticSearchVectorService requires Rest5ClientTransport, got: "
               + client._transport().getClass().getName());
     }
     return rest5.restClient();
   }

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@github-actions
Copy link
Copy Markdown
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Copy Markdown
Contributor

Jest test Coverage

UI tests summary

Lines Statements Branches Functions
Coverage: 61%
61.94% (61764/99706) 42.06% (33022/78504) 45.1% (9763/21647)

@sonarqubecloud
Copy link
Copy Markdown

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants