Add semantic enrichment support for OpenSearch sink index creation#6771
Add semantic enrichment support for OpenSearch sink index creation#6771kkondaka wants to merge 7 commits intoopensearch-project:mainfrom
Conversation
✅ License Header Check PassedAll newly added files have proper license headers. Great work! 🎉 |
| if (!hostname.contains(".aoss.")) { | ||
| throw new IllegalArgumentException( | ||
| "Host does not appear to be an AOSS endpoint: " + hostname + | ||
| ". Semantic enrichment via AOSS control plane requires a serverless collection."); |
There was a problem hiding this comment.
"control plane"
Also, "Serverless" should be capitalized.
| // or vpc-{domain-name}-{random}.{region}.es.amazonaws.com | ||
| if (!hostname.contains(".es.amazonaws.com")) { | ||
| throw new IllegalArgumentException( | ||
| "Host does not appear to be a managed OpenSearch domain: " + hostname + |
There was a problem hiding this comment.
"managed OpenSearch domain" -> "Amazon OpenSearch Service domain"
| static final String DEFAULT_LANGUAGE = "english"; | ||
|
|
||
| @JsonProperty("fields") | ||
| private List<String> fields; |
There was a problem hiding this comment.
Each field can have it's own language. At the moment 2 options are present, multilingual and english.
| public void createIndex(final String indexName, final SemanticEnrichmentConfig semanticConfig) throws IOException { | ||
| if (serverless) { | ||
| createServerlessIndex(indexName, semanticConfig); | ||
| } else { |
There was a problem hiding this comment.
nit: else is redundant, can do an early return
|
As @dlvenable mentioned, custom endpoints could cause an issue. Should the user have control to specify if it is an AOSS endpoint or AOS endpoint? This can resolve the issue. Additionally, i'm not sure if silently failing is the right way. If the index fails to create, should it fail fast, so user can correct it. It can lead to unknown issues down the line if the mapping is not right. Maybe the index already exists and user can just provide it or we can check if index exists, if the mapping matches the one requested in the yaml. |
| public String getLanguage() { | ||
| return language; | ||
| } |
There was a problem hiding this comment.
From the ASE doc, the language_options is set at each field in the create-index API so it can be different for different fiedls. But this config can only apply 1 language for all the fields. https://docs.aws.amazon.com/opensearch-service/latest/developerguide/opensearch-semantic-enrichment.html.
So is it better to define a LanguageConfig that includes both field and language, and use List instead of List fields?
There was a problem hiding this comment.
@Zhangxunmt Maybe I should remove language option for now? What do you think?
There was a problem hiding this comment.
Do you mean only support English as default?
I think defining a List of Map should work easily as below.
@JsonProperty("semantic_fields")
private List<Map<String, String>> semanticFields
In the config
semantic_fields:
- title_semantic: "english"
- product_description: "multilingual"
| fieldMapping.put("semantic_enrichment", semanticEnrichment); | ||
| properties.put(field, fieldMapping); | ||
| } | ||
| return Map.of("mappings", Map.of("properties", properties)); |
There was a problem hiding this comment.
I think this cannot support what's in the official API example that allows one field can be multilingual and another can be english, as mentioned above.
| @JsonProperty("language") | ||
| private String language = DEFAULT_LANGUAGE; | ||
|
|
||
| @JsonProperty("collection_name") |
There was a problem hiding this comment.
I think we should move these new configurations into the general aws: block. We may have other features in the future that rely directly on AOS/AOSS. They would hit the same potential issues of custom domain names. Those features would then have duplicate configurations.
There was a problem hiding this comment.
For clarity - just the domain_name and collection_name configurations would go into aws:. The others fit in this block.
There was a problem hiding this comment.
@dlvenable aws: block has serverless: option, it doesn't make sense to put both domain_name and collection_name because if serverless: is true only collection_name makes sense, similarly when serverless: is false, only domain_name makes sense. I think we need some generic name
There was a problem hiding this comment.
I propose generic resource_name: in aws: block
| public enum SemanticEnrichmentLanguage { | ||
| ENGLISH("english"), | ||
| MULTILINGUAL("multilingual"); |
There was a problem hiding this comment.
Thanks for making the change! This does raise a question of what happens if additional language modes are supported.
Does it default to english?
There was a problem hiding this comment.
yeah it would need to add the additional languages in the enum if new languages are supported in the future. It's not that flexible but using enum would avoid typo in the config.
Signed-off-by: Kondaka <krishkdk@amazon.com>
Signed-off-by: Kondaka <krishkdk@amazon.com>
…dpoint for creating semantic enrichment enabled indexes Signed-off-by: Kondaka <krishkdk@amazon.com>
Signed-off-by: Kondaka <krishkdk@amazon.com>
…eparate class Signed-off-by: Kondaka <krishkdk@amazon.com>
dlvenable
left a comment
There was a problem hiding this comment.
Overall this is good, but please update the logging terms to avoid "control plane" and "AWS OpenSearch" per my previous review.
- Use "Amazon OpenSearch Serverless APIs" instead of "serverless control plane".
- Use "Amazon OpenSearch Service APIs" instead of "managed domain control plane"
- Use "Amazon OpenSearchService instead of "AWS OpenSearch"
For some reason I can't post comments on lines right now, but there are three occurrences I found.
Also the license header checks are still failing. See the comment it auto-created.
Signed-off-by: Kondaka <krishkdk@amazon.com>
Signed-off-by: Kondaka <krishkdk@amazon.com>
Description
This PR adds support for creating OpenSearch indices with semantic enrichment via the AWS OpenSearch Service
and AOSS (OpenSearch Serverless) control plane APIs. When configured, Data Prepper will create indices with
semantic enrichment-enabled field mappings before the normal index setup, allowing OpenSearch to automatically
generate vector embeddings for specified text fields.
Problem
Currently, Data Prepper's OpenSearch sink creates indices using the standard OpenSearch REST API. However,
enabling
⧉ semantic enrichment https://docs.aws.amazon.com/opensearch-service/latest/developerguide/semantic-search.html
on index fields requires using the AWS control plane APIs (es:CreateIndex for managed domains,
OpenSearchServerless.CreateIndex for AOSS), which are not accessible through the standard OpenSearch client.
Users had to manually pre-create indices with semantic enrichment before running Data Prepper pipelines.
Solution
Introduced a new semantic_enrichment configuration block under aws settings that allows users to specify which
fields should have semantic enrichment enabled:
How it works
Issues Resolved
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.