NIFI-15209 JoltTransformRecord should not only take schema from first record#10545
NIFI-15209 JoltTransformRecord should not only take schema from first record#10545sammu97 wants to merge 2 commits intoapache:mainfrom
Conversation
exceptionfactory
left a comment
There was a problem hiding this comment.
Thanks for working on this issue @sammu97. As a general note, it seems like it would be cleaner to avoid moving all of the test schema and JSON files to a new directory, in order to focus on the actual changes proposed. Can that be adjusted?
|
Yes sure @exceptionfactory , will handle this as soon as i can. Also, I'm seeing that some checks are failing on code checkout, is this due to the Cloudflare outage? |
Yes, they were due to the outage, I have restarted the checks. |
|
Looks like the build failed on some of the OSs, im suspecting a file ordering issue. Will investigate and update the PR accordingly |
|
@exceptionfactory Had to make some fixes for Windows as the checks are usually omitted. However, any idea about the error for the Mac tests?
|
@sammu97 the node cache issue in the build appears to have been an intermittent problem over the weekend. I spotted other PRs with similar errors, but then things seem to be working again this morning. I've restarted the failed job on your PR and so far things like happier 🤞 |
@ChrisSamo632 Yep, seems like it's already past the step that was failing. Thanks! |
|
@exceptionfactory Just a small note too. I've also amended some logic for the testNoRecords() test, as I have put out a small change that if the Jolt has no records to transform, in my opinion there should be no resulting flowfile as there is nothing to write. Not sure what you think about this, should I be leaving the old logic? |
|
Hey @exceptionfactory, I've resolved some conflicts due to the PR being stale, and also made some fixes relating to issues where records with the same schema were still being partitioned due field ordering. Let me know what you think about the change please |
Fix JoltTransformRecord partitioning and normalization
|
Hi @exceptionfactory , was wondering if you could take a look at this PR when possible please? Would love to have this fixed. |
exceptionfactory
left a comment
There was a problem hiding this comment.
@sammu97 the current state of the pull request includes a number of unnecessary formatting changes, which need to be reverted.
On a general review, the set of changes introduce significant complexity, raising questions about the viability of the approach.
One general question before proceeding, is the intended use of this Processor with JSON, or some other format? If it is with JSON, then it may be better to consider a new Processor that deals directly with JSON, instead of the Record API.
|
Hi @exceptionfactory, you're right about the formatting. Will revert any unnecessary ones, apologies. Regarding it's intended use, we are specifically making use of this with the JoltTransformRecord to transform ND Json content. I agree the solution turned out a bit complex, though I am not sure on the alternatives so if you have any ideas please go ahead. My primary concern here is that this issue is not easily noticeable and can do a lot of damage before it is noticed when making use of this processor. |
|
Thanks for the reply @sammu97. The source data being NDJSON is an important detail and may lead to a different solution. Right now, the JoltTransformJSON Processor works with either the entire FlowFile, or with Attributes, neither of which align with NDJSON. However, if that Processor were changed to support handling NDJSON, applying the configured transform to each line of NDJSON, it sounds like that might be a better fit, avoiding any kind of schema inference issues. I might be able to put something together if that seems like a potential solution. |
|
@exceptionfactory That solution works for me to be honest. Just one thing if we go this route though, I do still believe that this behaviour with the JoltTransformRecord should be documented somewhere, so that users are aware of it. What do you think? |
|
Correct me if I am wrong but this or a similar conversation seems to have taken place on NIFI-14309 |
@dan-s1, yes, this relates to the discussion on NIFI-14309, although that issue refers to the JSLT Processor, not the Jolt Processor. This discussion on NDJSON handling highlights a use case gap with record-oriented processing, so it seems best to introduce something more specific in this case for Jolt. |
Thanks for the confirmation, and for your work in this pull request @sammu97. Yes, I agree it would be helpful to document the current behavior of To address the primary goal, I submitted pull request #11001 for NIFI-15712 adding support for JSON Lines/NDJSON to the |
Awesome, thanks @exceptionfactory for the solution! Looking forward to having this deployed |
This PR aims to fix an unwanted behaviour of having fields omitted after a JoltTransformRecord on a batch of records within the same FlowFile, due to multiple outputs having more than 1 schema. The current implementation of the processor retrieves the schema of the FIRST transformed record, and abides by that schema throughout the rest of the transformations. A new property is introduced for the JoltTransformRecord, where users can decide to either keep the same behaviour, or utilize the new PARTITION_BY_SCHEMA strategy, which will split the transformations into separate FlowFIles, according to the number of schemas.
Summary
NIFI-15209
Tracking
Please complete the following tracking steps prior to pull request creation.
Issue Tracking
Pull Request Tracking
NIFI-00000NIFI-00000Pull Request Formatting
mainbranchVerification
Please indicate the verification steps performed prior to pull request creation.
Build
./mvnw clean install -P contrib-checkLicensing
LICENSEandNOTICEfilesDocumentation