Skip to content

Add array field splitting support to split_event processor#6774

Open
srikanthpadakanti wants to merge 4 commits intoopensearch-project:mainfrom
srikanthpadakanti:split-event-array-support-5707-public
Open

Add array field splitting support to split_event processor#6774
srikanthpadakanti wants to merge 4 commits intoopensearch-project:mainfrom
srikanthpadakanti:split-event-array-support-5707-public

Conversation

@srikanthpadakanti
Copy link
Copy Markdown
Contributor

@srikanthpadakanti srikanthpadakanti commented Apr 22, 2026

Description

Extends the split_event processor to support splitting array fields into individual events. When no delimiter is configured, the processor treats the field as an array and creates a separate event for each element. When a delimiter is configured, arrays are still split automatically alongside string splitting.

  # Array split mode (new) - no delimiter needed
  processor:
    - split_event:
        field: "/logs"

  Input:
  {"host": "server1", "logs": ["error in module A", "warning in module B"]}

  Output (2 events):
  {"host": "server1", "logs": "error in module A"}
  {"host": "server1", "logs": "warning in module B"}

Works with primitive arrays, object arrays, nested arrays, and mixed types. Single-element arrays are unwrapped. Empty arrays pass through unchanged. Non-array, non-string fields pass through unchanged with a debug log.

Issues Resolved

Resolves #5707

Check List

  • [ X ] New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
  • [ X ] New functionality has javadoc added
  • [ X ] Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@srikanthpadakanti
Copy link
Copy Markdown
Contributor Author

Please review this when you get a chance. @dlvenable @kkondaka

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@srikanthpadakanti srikanthpadakanti force-pushed the split-event-array-support-5707-public branch from 6bab3fe to aa04a75 Compare April 22, 2026 20:13
Copy link
Copy Markdown
Member

@dlvenable dlvenable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @srikanthpadakanti for the contribution!

…upport-5707-public

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
assertEquals(originalAcknowledgementSet, acknowledgementSet);
}
final Record<Event> lastRecord = editedRecords.get(editedRecords.size() - 1);
final DefaultEventHandle lastEventHandle = (DefaultEventHandle) lastRecord.getData().getEventHandle();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this changed to verify only on the last record?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original test had a bug. The for loop iterated editedRecords but read testRecord.getData() on every iteration instead of the loop variable record.getData(). It was checking the original event against itself every time, not the split records.

The fix verifies the last record because splitIntoRecords reuses the original record (with its original event handle and acknowledgement set) for the last split value. The earlier records get new handles added to the original via addToAcknowledgementSetFromOriginEvent, but those handles do not carry their own acknowledgement set independently.

@srikanthpadakanti srikanthpadakanti requested a review from divbok May 3, 2026 04:28
continue;
}

if (value instanceof List<?>) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to put this inside splitter == null check below?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkondaka The List check is intentionally before the splitter null check. This was from @dlvenable review feedback, he asked "if I specified a delimiter and received an array couldn't I split on the array automatically?"

The answer is yes. Arrays split regardless of delimiter config. If a field is a List, it splits on the array elements. If a field is a String and a delimiter exists, it splits on the delimiter. Both work together.

Moving the List check inside splitter == null would break that behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the List check inside splitter == null would mean arrays only split in no delimiter mode, which contradicts the approved design

Copy link
Copy Markdown
Collaborator

@kkondaka kkondaka May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srikanthpadakanti But it is also confusing if splitter is not null and array ignores it. Isn't it better to reject such config? Value may be dynamic and not an array all the time but even in that case, I think behavior becomes very unpredictable.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting that we can always split on an array - whether the delimiter is present or not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the discussion. To summarize the agreed behavior based on @dlvenable's feedback:

  • If the field value is a List/array, it is always split into separate events regardless of whether a delimiter is configured
  • The delimiter only applies to String values
  • This is intentional: arrays are inherently splittable without needing a delimiter, so rejecting the config would unnecessarily restrict users who want to split both string fields (by delimiter) and array fields (by element) in the same processor

The List check is placed before the splitter null check so that arrays are split even when no delimiter is configured. This keeps the behavior simple and predictable: arrays always split, strings only split when a delimiter is provided.

@kkondaka lmk your thoughts.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srikanthpadakanti Thanks for summarizing the discussion. I have couple of questions -

  1. So, for the same field value could be array some times and not array some times?
  2. If it is array delimiter is not used, if it not array, delimiter is used, is that, right? What if the value is something that cannot be split (like a Map or something)? My question is if it would be confusing to allow any type or should we allow the field value to be certain type for a certain split config?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkondaka To your questions:

▎ 1. Yes. in real pipelines the same field can be an array in some events and a string in others (e.g., a log field that's sometimes a single value, sometimes a list). The processor handles both per-event based on the actual runtime type.
▎ 2. For unsplittable types (Map, Integer, etc.) the processor passes the event through unchanged with a debug log. No failure, no data loss, no silent drop. This matches how other Data Prepper processors handle unexpected types (e.g., grok on a non-string field just skips it).

▎ Restricting to specific types would require users to guarantee field homogeneity across all events, which isn't realistic for heterogeneous log pipelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support foreach (looping) over input arrays to split event

4 participants