Skip to content

NIFI-15209 JoltTransformRecord should not only take schema from first record#10545

Closed
sammu97 wants to merge 2 commits intoapache:mainfrom
sammu97:NIFI-15209
Closed

NIFI-15209 JoltTransformRecord should not only take schema from first record#10545
sammu97 wants to merge 2 commits intoapache:mainfrom
sammu97:NIFI-15209

Conversation

@sammu97
Copy link

@sammu97 sammu97 commented Nov 18, 2025

This PR aims to fix an unwanted behaviour of having fields omitted after a JoltTransformRecord on a batch of records within the same FlowFile, due to multiple outputs having more than 1 schema. The current implementation of the processor retrieves the schema of the FIRST transformed record, and abides by that schema throughout the rest of the transformations. A new property is introduced for the JoltTransformRecord, where users can decide to either keep the same behaviour, or utilize the new PARTITION_BY_SCHEMA strategy, which will split the transformations into separate FlowFIles, according to the number of schemas.

Summary

NIFI-15209

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using ./mvnw clean install -P contrib-check
    • JDK 21
    • JDK 25

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

@sammu97 sammu97 marked this pull request as draft November 18, 2025 20:09
Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this issue @sammu97. As a general note, it seems like it would be cleaner to avoid moving all of the test schema and JSON files to a new directory, in order to focus on the actual changes proposed. Can that be adjusted?

@sammu97
Copy link
Author

sammu97 commented Nov 19, 2025

Yes sure @exceptionfactory , will handle this as soon as i can.

Also, I'm seeing that some checks are failing on code checkout, is this due to the Cloudflare outage?

@exceptionfactory
Copy link
Contributor

Yes sure @exceptionfactory , will handle this as soon as i can.

Also, I'm seeing that some checks are failing on code checkout, is this due to the Cloudflare outage?

Yes, they were due to the outage, I have restarted the checks.

@sammu97
Copy link
Author

sammu97 commented Nov 19, 2025

Looks like the build failed on some of the OSs, im suspecting a file ordering issue. Will investigate and update the PR accordingly

@sammu97
Copy link
Author

sammu97 commented Nov 23, 2025

@exceptionfactory Had to make some fixes for Windows as the checks are usually omitted. However, any idea about the error for the Mac tests?

The template is not valid. .github/workflows/ci-workflow.yml (Line: 224, Col: 16): hashFiles('**/package-lock.json') failed. Fail to hash files under directory '/Users/runner/work/nifi/nifi'

@sammu97 sammu97 marked this pull request as ready for review November 23, 2025 16:01
@ChrisSamo632
Copy link
Contributor

@exceptionfactory Had to make some fixes for Windows as the checks are usually omitted. However, any idea about the error for the Mac tests?

The template is not valid. .github/workflows/ci-workflow.yml (Line: 224, Col: 16): hashFiles('**/package-lock.json') failed. Fail to hash files under directory '/Users/runner/work/nifi/nifi'

@sammu97 the node cache issue in the build appears to have been an intermittent problem over the weekend. I spotted other PRs with similar errors, but then things seem to be working again this morning. I've restarted the failed job on your PR and so far things like happier 🤞

@sammu97
Copy link
Author

sammu97 commented Nov 24, 2025

@exceptionfactory Had to make some fixes for Windows as the checks are usually omitted. However, any idea about the error for the Mac tests?
The template is not valid. .github/workflows/ci-workflow.yml (Line: 224, Col: 16): hashFiles('**/package-lock.json') failed. Fail to hash files under directory '/Users/runner/work/nifi/nifi'

@sammu97 the node cache issue in the build appears to have been an intermittent problem over the weekend. I spotted other PRs with similar errors, but then things seem to be working again this morning. I've restarted the failed job on your PR and so far things like happier 🤞

@ChrisSamo632 Yep, seems like it's already past the step that was failing. Thanks!

@sammu97
Copy link
Author

sammu97 commented Nov 24, 2025

@exceptionfactory Just a small note too. I've also amended some logic for the testNoRecords() test, as I have put out a small change that if the Jolt has no records to transform, in my opinion there should be no resulting flowfile as there is nothing to write. Not sure what you think about this, should I be leaving the old logic?

@sammu97 sammu97 marked this pull request as draft January 9, 2026 15:18
@sammu97 sammu97 marked this pull request as ready for review January 9, 2026 18:10
@sammu97
Copy link
Author

sammu97 commented Jan 9, 2026

Hey @exceptionfactory, I've resolved some conflicts due to the PR being stale, and also made some fixes relating to issues where records with the same schema were still being partitioned due field ordering. Let me know what you think about the change please

Fix JoltTransformRecord partitioning and normalization
@sammu97
Copy link
Author

sammu97 commented Mar 9, 2026

Hi @exceptionfactory , was wondering if you could take a look at this PR when possible please? Would love to have this fixed.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sammu97 the current state of the pull request includes a number of unnecessary formatting changes, which need to be reverted.

On a general review, the set of changes introduce significant complexity, raising questions about the viability of the approach.

One general question before proceeding, is the intended use of this Processor with JSON, or some other format? If it is with JSON, then it may be better to consider a new Processor that deals directly with JSON, instead of the Record API.

@sammu97
Copy link
Author

sammu97 commented Mar 11, 2026

Hi @exceptionfactory, you're right about the formatting. Will revert any unnecessary ones, apologies.

Regarding it's intended use, we are specifically making use of this with the JoltTransformRecord to transform ND Json content. I agree the solution turned out a bit complex, though I am not sure on the alternatives so if you have any ideas please go ahead. My primary concern here is that this issue is not easily noticeable and can do a lot of damage before it is noticed when making use of this processor.

@exceptionfactory
Copy link
Contributor

Thanks for the reply @sammu97.

The source data being NDJSON is an important detail and may lead to a different solution.

Right now, the JoltTransformJSON Processor works with either the entire FlowFile, or with Attributes, neither of which align with NDJSON. However, if that Processor were changed to support handling NDJSON, applying the configured transform to each line of NDJSON, it sounds like that might be a better fit, avoiding any kind of schema inference issues. I might be able to put something together if that seems like a potential solution.

@sammu97
Copy link
Author

sammu97 commented Mar 11, 2026

@exceptionfactory That solution works for me to be honest. Just one thing if we go this route though, I do still believe that this behaviour with the JoltTransformRecord should be documented somewhere, so that users are aware of it. What do you think?

@dan-s1
Copy link
Contributor

dan-s1 commented Mar 12, 2026

Correct me if I am wrong but this or a similar conversation seems to have taken place on NIFI-14309

@exceptionfactory
Copy link
Contributor

Correct me if I am wrong but this or a similar conversation seems to have taken place on NIFI-14309

@dan-s1, yes, this relates to the discussion on NIFI-14309, although that issue refers to the JSLT Processor, not the Jolt Processor. This discussion on NDJSON handling highlights a use case gap with record-oriented processing, so it seems best to introduce something more specific in this case for Jolt.

@exceptionfactory
Copy link
Contributor

@exceptionfactory That solution works for me to be honest. Just one thing if we go this route though, I do still believe that this behaviour with the JoltTransformRecord should be documented somewhere, so that users are aware of it. What do you think?

Thanks for the confirmation, and for your work in this pull request @sammu97. Yes, I agree it would be helpful to document the current behavior of JoltTransformRecord as it stands. The documentation for the Record Reader property mentions something about the record schema, so updating that property documentation in a separate pull request seems reasonable.

To address the primary goal, I submitted pull request #11001 for NIFI-15712 adding support for JSON Lines/NDJSON to the JoltTransformJSON Processor. After some minor refactoring, it was a straightforward addition, which sounds like it should fit the use case of widely varying JSON elements, avoiding the record schema challenges.

@sammu97
Copy link
Author

sammu97 commented Mar 13, 2026

@exceptionfactory That solution works for me to be honest. Just one thing if we go this route though, I do still believe that this behaviour with the JoltTransformRecord should be documented somewhere, so that users are aware of it. What do you think?

Thanks for the confirmation, and for your work in this pull request @sammu97. Yes, I agree it would be helpful to document the current behavior of JoltTransformRecord as it stands. The documentation for the Record Reader property mentions something about the record schema, so updating that property documentation in a separate pull request seems reasonable.

To address the primary goal, I submitted pull request #11001 for NIFI-15712 adding support for JSON Lines/NDJSON to the JoltTransformJSON Processor. After some minor refactoring, it was a straightforward addition, which sounds like it should fit the use case of widely varying JSON elements, avoiding the record schema challenges.

Awesome, thanks @exceptionfactory for the solution! Looking forward to having this deployed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants