Skip to content

Add semantic enrichment support for OpenSearch sink index creation#6771

Open
kkondaka wants to merge 7 commits intoopensearch-project:mainfrom
kkondaka:ossink-ase
Open

Add semantic enrichment support for OpenSearch sink index creation#6771
kkondaka wants to merge 7 commits intoopensearch-project:mainfrom
kkondaka:ossink-ase

Conversation

@kkondaka
Copy link
Copy Markdown
Collaborator

Description

This PR adds support for creating OpenSearch indices with semantic enrichment via the AWS OpenSearch Service
and AOSS (OpenSearch Serverless) control plane APIs. When configured, Data Prepper will create indices with
semantic enrichment-enabled field mappings before the normal index setup, allowing OpenSearch to automatically
generate vector embeddings for specified text fields.

Problem
Currently, Data Prepper's OpenSearch sink creates indices using the standard OpenSearch REST API. However,
enabling
⧉ semantic enrichment https://docs.aws.amazon.com/opensearch-service/latest/developerguide/semantic-search.html
on index fields requires using the AWS control plane APIs (es:CreateIndex for managed domains,
OpenSearchServerless.CreateIndex for AOSS), which are not accessible through the standard OpenSearch client.
Users had to manually pre-create indices with semantic enrichment before running Data Prepper pipelines.

Solution
Introduced a new semantic_enrichment configuration block under aws settings that allows users to specify which
fields should have semantic enrichment enabled:

sink:
  - opensearch:
      aws:
        sts_role_arn: "arn:aws:iam::123456789012:role/MyRole"
        region: "us-west-2"
        semantic_enrichment:
          fields: ["title", "description"]
          language: "english"  # optional, defaults to "english"

How it works

  1. During sink initialization, if semantic_enrichment.fields is configured and AWS SigV4 auth is enabled, the sink creates a SemanticEnrichmentIndexCreator
  2. The creator auto-detects whether the target is a managed domain or serverless collection from the host URL
  3. It builds an index schema with semantic_enrichment: { status: ENABLED, language_options: } on each specified field
  4. The request is SigV4-signed and sent to the appropriate AWS control plane endpoint
  5. If the index already exists (ConflictException / ResourceAlreadyExistsException), creation is silently skipped
  6. Normal index setup via indexManager.setupIndex() proceeds afterward

Issues Resolved

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • [ X] New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • [X ] Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 20, 2026

✅ License Header Check Passed

All newly added files have proper license headers. Great work! 🎉

if (!hostname.contains(".aoss.")) {
throw new IllegalArgumentException(
"Host does not appear to be an AOSS endpoint: " + hostname +
". Semantic enrichment via AOSS control plane requires a serverless collection.");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"control plane"

Also, "Serverless" should be capitalized.

// or vpc-{domain-name}-{random}.{region}.es.amazonaws.com
if (!hostname.contains(".es.amazonaws.com")) {
throw new IllegalArgumentException(
"Host does not appear to be a managed OpenSearch domain: " + hostname +
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"managed OpenSearch domain" -> "Amazon OpenSearch Service domain"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NA

static final String DEFAULT_LANGUAGE = "english";

@JsonProperty("fields")
private List<String> fields;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each field can have it's own language. At the moment 2 options are present, multilingual and english.

public void createIndex(final String indexName, final SemanticEnrichmentConfig semanticConfig) throws IOException {
if (serverless) {
createServerlessIndex(indexName, semanticConfig);
} else {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: else is redundant, can do an early return

@pyek-bot
Copy link
Copy Markdown

As @dlvenable mentioned, custom endpoints could cause an issue. Should the user have control to specify if it is an AOSS endpoint or AOS endpoint? This can resolve the issue.

Additionally, i'm not sure if silently failing is the right way. If the index fails to create, should it fail fast, so user can correct it. It can lead to unknown issues down the line if the mapping is not right. Maybe the index already exists and user can just provide it or we can check if index exists, if the mapping matches the one requested in the yaml.

Comment on lines +29 to +31
public String getLanguage() {
return language;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the ASE doc, the language_options is set at each field in the create-index API so it can be different for different fiedls. But this config can only apply 1 language for all the fields. https://docs.aws.amazon.com/opensearch-service/latest/developerguide/opensearch-semantic-enrichment.html.
So is it better to define a LanguageConfig that includes both field and language, and use List instead of List fields?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zhangxunmt Maybe I should remove language option for now? What do you think?

Copy link
Copy Markdown
Collaborator

@Zhangxunmt Zhangxunmt Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean only support English as default?
I think defining a List of Map should work easily as below.

@JsonProperty("semantic_fields")
private List<Map<String, String>> semanticFields

In the config

semantic_fields:
      - title_semantic: "english"
      - product_description:  "multilingual"

fieldMapping.put("semantic_enrichment", semanticEnrichment);
properties.put(field, fieldMapping);
}
return Map.of("mappings", Map.of("properties", properties));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this cannot support what's in the official API example that allows one field can be multilingual and another can be english, as mentioned above.

graytaylor0
graytaylor0 previously approved these changes Apr 28, 2026
@JsonProperty("language")
private String language = DEFAULT_LANGUAGE;

@JsonProperty("collection_name")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move these new configurations into the general aws: block. We may have other features in the future that rely directly on AOS/AOSS. They would hit the same potential issues of custom domain names. Those features would then have duplicate configurations.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity - just the domain_name and collection_name configurations would go into aws:. The others fit in this block.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlvenable aws: block has serverless: option, it doesn't make sense to put both domain_name and collection_name because if serverless: is true only collection_name makes sense, similarly when serverless: is false, only domain_name makes sense. I think we need some generic name

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose generic resource_name: in aws: block

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That name works for me.

Comment on lines +17 to +19
public enum SemanticEnrichmentLanguage {
ENGLISH("english"),
MULTILINGUAL("multilingual");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the change! This does raise a question of what happens if additional language modes are supported.

Does it default to english?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it would need to add the additional languages in the enum if new languages are supported in the future. It's not that flexible but using enum would avoid typo in the config.

kkondaka added 4 commits May 5, 2026 12:49
Signed-off-by: Kondaka <krishkdk@amazon.com>
Signed-off-by: Kondaka <krishkdk@amazon.com>
…dpoint for creating semantic enrichment enabled indexes

Signed-off-by: Kondaka <krishkdk@amazon.com>
Signed-off-by: Kondaka <krishkdk@amazon.com>
…eparate class

Signed-off-by: Kondaka <krishkdk@amazon.com>
Copy link
Copy Markdown
Member

@dlvenable dlvenable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkondaka ,

Overall this is good, but please update the logging terms to avoid "control plane" and "AWS OpenSearch" per my previous review.

  • Use "Amazon OpenSearch Serverless APIs" instead of "serverless control plane".
  • Use "Amazon OpenSearch Service APIs" instead of "managed domain control plane"
  • Use "Amazon OpenSearchService instead of "AWS OpenSearch"

For some reason I can't post comments on lines right now, but there are three occurrences I found.

Also the license header checks are still failing. See the comment it auto-created.

kkondaka added 2 commits May 6, 2026 09:42
Signed-off-by: Kondaka <krishkdk@amazon.com>
Signed-off-by: Kondaka <krishkdk@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants