NIFI-15669: Refactored ConsumeKinesis to remove dependency on KCL. Th… by markap14 · Pull Request #10964 · apache/nifi

markap14 · 2026-03-04T17:44:03Z

…is provides much faster startup times and drastically reduces heap utilization when using Enhanced Fan Out (EFO) mode.

Summary

NIFI-00000

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Apache NiFi Jira issue created

Pull Request Tracking

Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000
Pull request contains commits signed with a registered key indicating Verified status

Pull Request Formatting

Pull Request based on current revision of the main branch
Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

Build completed using ./mvnw clean install -P contrib-check
- JDK 21
- JDK 25

Licensing

New dependencies are compatible with the Apache License 2.0 according to the License Policy
New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

Documentation formatting appears as expected in rendered files

exceptionfactory

Thanks for the extensive work on redesigning this Processor @markap14!

I plan to do a more thorough review, for now, highlighting an integration test failure that may point to some unstable expectations.

Error:  Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 130.6 s <<< FAILURE! -- in org.apache.nifi.processors.aws.kinesis.ConsumeKinesisIT
Error:  org.apache.nifi.processors.aws.kinesis.ConsumeKinesisIT.testKplMultipleAggregatedRecords -- Time elapsed: 4.039 s <<< FAILURE!
org.opentest4j.AssertionFailedError: 3 aggregated records x 5 sub-records each ==> expected: <15> but was: <0>
	at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:158)
	at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:139)
	at org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:201)
	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:152)
	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:590)
	at org.apache.nifi.processors.aws.kinesis.ConsumeKinesisIT.testKplMultipleAggregatedRecords(ConsumeKinesisIT.java:618)

awelless

Reviewed source files only so far.
Stylistic comments are marked with "nit".

awelless · 2026-03-09T11:09:38Z

...i-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/CheckpointTableUtils.java

+            final TableSchema destinationSchema) {
+        return switch (destinationSchema) {
+            case NEW -> item;
+            case LEGACY -> convertToLegacyItem(item);


What's the intention in supporting lease table conversion to the old KCL format?

Yeah good catch. After I did some refactoring in how the migration works, this is no longer actually necessary. Will remove.

awelless · 2026-03-09T11:10:46Z

...i-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/CheckpointTableUtils.java

+        }
+
+        final AttributeValue sequenceNumber = item.get("sequenceNumber");
+        final String leaseKey = streamName.s() + ":" + shardIdValue;


This depends on whether KCL is configured in a single- or multi-stream mode. If it's single stream, only shard id is a part of the lease key.

awelless · 2026-03-09T11:14:56Z

...le/nifi-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/ConsumeKinesis.java

-        The maximum size of the buffer is controlled by the 'Max Bytes to Buffer' property.
-        In addition, the processor may cache some amount of data for each shard when the processor's buffer is full.""")
+        ConsumeKinesis buffers Kinesis Records in memory until they can be processed. \
+        The maximum size of the buffer is controlled by the 'Max Batch Size' property.""")


Nit: Max Batch Size determines how much data we write in a single task execution.
It doesn't configure the buffer caches, which is 500 GetRecords results for polling and {number of active shards} for EFO.

awelless · 2026-03-09T11:15:49Z

...le/nifi-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/ConsumeKinesis.java

+    private static final long QUEUE_POLL_TIMEOUT_MILLIS = 100;
+    private static final Duration API_CALL_TIMEOUT = Duration.ofSeconds(30);
+    private static final Duration API_CALL_ATTEMPT_TIMEOUT = Duration.ofSeconds(10);
+    private static final byte[] NEWLINE_DELIMITER = new byte[] {'\n'};


Nit: shall we use System.lineSeparator() instead?

No. The separator should not depend on the OS of the host.

awelless · 2026-03-09T11:17:25Z

...le/nifi-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/ConsumeKinesis.java

-
-                    Using a larger value may increase the throughput, but will do so at the expense of using more memory.
-                    """)
+    static final PropertyDescriptor MAX_RECORDS_PER_REQUEST = new PropertyDescriptor.Builder()


This property is used only when CONSUMER_TYPE is SHARED_THROUGHPUT. We shouldn't display it for EFO consumers.

awelless · 2026-03-09T14:17:34Z

.../nifi-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/EfoKinesisClient.java

+        private volatile String lastQueuedSequenceNumber;
+        private volatile String lastOnNextMaxSequence;
+        private final AtomicReference<String> lastAcknowledgedSequenceNumber = new AtomicReference<>();


lastQueuedSequenceNumber and lastAcknowledgedSequenceNumber are used only to calculate a sequence number to start reading data from during a subscription restart.
It seems that lastQueuedSequenceNumber can always be used, since we don't purge queues in KinesisConsumerClient.

Also lastQueuedSequenceNumber is the same as lastOnNextMaxSequence.

awelless · 2026-03-09T14:31:08Z

.../nifi-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/EfoKinesisClient.java

+        }
+        shardConsumers.clear();
+
+        deregisterEfoConsumer();


When NiFi scales down, the processor is stopped on the node being decommissioned, right?
Meaning this node will deregister the consumer, while the other active nodes are still subscribed to it.
The efo consumer is created in initialize only, thus after decommissioning a node the processors will be stuck until restarted.
If the above is correct, then we shouldn't deregister the consumer, if there are other nodes using it.

Good catch! I hadn't considered that case. Will update it to only deregister in @OnRemoved so that if the processor is deleted from the canvas it will deregister it.

awelless · 2026-03-09T14:33:56Z

...i-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/PollingKinesisClient.java

+                    }
+                }
+
+                if (totalQueuedResults() >= MAX_QUEUED_RESULTS) {


This is not fair. There a risk a consumer of a particular shard will be always sleeping.
Shall we either:

Not track totalQueuedResults and use a batch per shard approach as done in efo consumer?

Use a fair semaphore to handle cache limits instead of simple sleep?

awelless · 2026-03-09T14:35:14Z

...i-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/PollingKinesisClient.java

+            return null;
+        } catch (final SdkClientException e) {
+            if (!state.isStopped()) {
+                logger.warn("GetRecords timed out for shard {}; will retry with existing iterator", shardId);


Nit: this isn't necessarily a timeout error, right?

awelless · 2026-03-09T14:40:29Z

...i-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/PollingKinesisClient.java

+        for (final ShardFetchResult result : results) {
+            final PollingShardState state = pollingShardStates.get(result.shardId());
+            if (state != null) {
+                state.requestReset();


Should we rollback to the latest sequence number instead? The records are still kept in the queue in KinesisConsumeClient.
This might cause out of order delivery - test with repro.

We should drain the queue while still holding the shard lease. I.e. here, in the rollbackResults.
Since reset doesn't happen immediately, so there is a window when the lease can be acquired, subsequent records polled from the queue, before the reset happens. Repro.

But we need to make sure that while draining the queue we don't fetch a new batch of records. Otherwise we'd have to drain it as well.

awelless · 2026-03-09T16:55:39Z

...le/nifi-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/ConsumeKinesis.java

+                        shardManager.writeCheckpoints(batch.checkpoints());
+                        consumerClient.acknowledgeResults(accepted);


If writeCheckpoints fails, we don't acknowledgeResults in this callback. It seems efo consumer will be stuck in that situation, as we request next records in the acknowledgement. Shall we swap these operations? Or have a ladder with try {} finally {} statement.

…is provides much faster startup times and drastically reduces heap utilization when using Enhanced Fan-Out (EFO) mode.

markap14 · 2026-03-10T01:59:05Z

Thanks for the thorough review @awelless! I did several refactorings of this PR before pushing it, and it looks like I did a pretty poor job of cleaning up a couple of the approaches that I'd taken. Should be in much better shape now! And you caught a few interesting points that I'd not considered, as well! I pushed a new commit that I think addresses everything. Added some additional tests. Pushed 30,085,000 records to a Kinesis Stream and then consumed all using both EFO and Shared Throughput mode to ensure that all data was consumed in exactly the correct order without any duplicates and to ensure that performance was as expected. All looks good!

…use we always include all sub-records within a single ProcessSession so we don't need to checkpoint partial sequences

awelless

The comment about race condition is the one requiring attention. The rest is optional.

awelless · 2026-03-10T09:40:45Z

...le/nifi-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/ConsumeKinesis.java

+    @Override
+    public void migrateProperties(final PropertyConfiguration config) {
+        ProxyServiceMigration.renameProxyConfigurationServiceProperty(config);
+        config.renameProperty("Max Bytes to Buffer", "Max Batch Size");


Should we really move Max Bytes to Buffer to Max Batch Size? These are different properties.
The default value of 100 MB for buffer size might be too much for the batch size.

awelless · 2026-03-10T09:44:11Z

...le/nifi-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/ConsumeKinesis.java

            .build();

-    static final PropertyDescriptor PROXY_CONFIGURATION_SERVICE = ProxyConfiguration.createProxyConfigPropertyDescriptor(ProxySpec.HTTP, ProxySpec.HTTP_AUTH);
+    static final PropertyDescriptor ENDPOINT_OVERRIDE = new PropertyDescriptor.Builder()


I see. Then should we have separate endpoints for Kinesis and DynamoDB?
In Localstack that's the same endpoint for each service, but this might not be the case for production scenarios.

awelless · 2026-03-10T09:48:16Z

...le/nifi-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/ConsumeKinesis.java

+            return;
+        }
+
+        final Set<String> ownedShardIds = new HashSet<>();


Nit:

Suggested change

final Set<String> ownedShardIds = new HashSet<>();

final Set<String> ownedShardIds = HashSet.newHashSet(ownedShards.size());

awelless · 2026-03-10T09:52:23Z

...le/nifi-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/ConsumeKinesis.java

-    private void shutdownScheduler() {
-        if (kinesisScheduler.shutdownComplete()) {
+    @OnRemoved
+    public void onRemoved(final ProcessContext context) {


This approach to deregistering consumers looks good to me.
What should happen when the processor's CONSUMER_TYPE changes? Now we wait for the processor to be removed. Furthermore, if the APPLICATION_NAME is changed, the old consumer will be orphaned.

Should we deregister the consumer immediately when CONSUMER_TYPE changes to SHARED_THROUGHPUT or when CONSUMER_TYPE is efo and APPLICATION_NAME changes?

It's possible, but I don't think it's worth the effort. If user does that and wants to reclaim the slot, they can do so manually in AWS console.

awelless · 2026-03-10T09:54:57Z

...le/nifi-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/ConsumeKinesis.java

+            resultsByShard.computeIfAbsent(result.shardId(), k -> new ArrayList<>()).add(result);
+        }
+        for (final List<ShardFetchResult> shardResults : resultsByShard.values()) {
+            shardResults.sort(Comparator.comparing(ShardFetchResult::firstSequenceNumber));


I wonder if we really need to sort the results here.
Since we're consuming data from a queue in KinesisConsumerClient, we should have the data ordered by sequence numbers already, right? We operate on lists everywhere, so the order is preserved.

awelless · 2026-03-10T15:07:34Z

...i-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/PollingKinesisClient.java

+        for (final ShardFetchResult result : results) {
+            final PollingShardState state = pollingShardStates.get(result.shardId());
+            if (state != null) {
+                state.requestReset();


We should drain the queue while still holding the shard lease. I.e. here, in the rollbackResults.
Since reset doesn't happen immediately, so there is a window when the lease can be acquired, subsequent records polled from the queue, before the reset happens. Repro.

But we need to make sure that while draining the queue we don't fetch a new batch of records. Otherwise we'd have to drain it as well.

exceptionfactory

Thanks for putting together this major refactor @markap14, the approach looks good in general. I'm planning on further review, but noted a handful of mostly minor recommendations thus far.

exceptionfactory · 2026-03-10T20:52:54Z

...i-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/CheckpointTableUtils.java

+                Thread.sleep(TABLE_POLL_MILLIS);
+            } catch (final InterruptedException e) {
+                Thread.currentThread().interrupt();
+                throw new ProcessException("Interrupted while waiting for DynamoDB table to become ACTIVE", e);


Suggested change

throw new ProcessException("Interrupted while waiting for DynamoDB table to become ACTIVE", e);

throw new ProcessException("Interrupted while waiting for DynamoDB table [%s] to become ACTIVE".formatted(tableName), e);

exceptionfactory · 2026-03-10T20:53:29Z

...i-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/CheckpointTableUtils.java

+                Thread.sleep(TABLE_POLL_MILLIS);
+            } catch (final InterruptedException e) {
+                Thread.currentThread().interrupt();
+                throw new ProcessException("Interrupted while waiting for DynamoDB table deletion", e);


Suggested change

throw new ProcessException("Interrupted while waiting for DynamoDB table deletion", e);

throw new ProcessException("Interrupted while waiting for DynamoDB table [%s] deletion".formatted(tableName), e);

exceptionfactory · 2026-03-10T20:56:48Z

...i-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/CheckpointTableUtils.java

+            if (keySchema.size() == 2
+                    && hasKey(keySchema, "streamName", KeyType.HASH)
+                    && hasKey(keySchema, "shardId", KeyType.RANGE)) {
+                return TableSchema.NEW;


streamName and sharedId appear to be used multiple times in multiple methods, which seem look good candidates for private static final Strings.

exceptionfactory · 2026-03-10T20:59:30Z

...ifi-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/DeaggregatedRecord.java

+ * @param data the user payload bytes
+ * @param approximateArrivalTimestamp approximate time the enclosing record arrived at Kinesis
+ */
+record DeaggregatedRecord(


Based on the description, what do you think about naming this UserRecord, DistinctRecord or DataRecord? The User prefix sounds a bit related to identity, but aligns with the description. Distinct or similar may be an option.

exceptionfactory · 2026-03-10T21:01:01Z

.../nifi-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/EfoKinesisClient.java

+ * per shard via HTTP/2. Uses Reactive Streams demand-driven backpressure to control the
+ * rate of event delivery.
+ */
+final class EfoKinesisClient extends KinesisConsumerClient {


What do you think about spelling out EnhancedFanOutKinesisClient for clarity since Efo is not a common acronym.

exceptionfactory · 2026-03-10T21:23:20Z

...i-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/PollingKinesisClient.java

+                }
+            } else if (!existing.isExhausted() && !existing.isStopped() && !existing.isLoopRunning()
+                    && existing.tryStartLoop()) {
+                logger.warn("Restarting dead fetch loop for shard {}", shardId);


It would be helpful to include the streamName

exceptionfactory · 2026-03-10T21:23:57Z

...i-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/PollingKinesisClient.java

+            });
+        } catch (final RejectedExecutionException e) {
+            state.markLoopStopped();
+            logger.debug("Executor shut down; cannot start fetch loop for shard {}", shardId);


Suggested change

logger.debug("Executor shut down; cannot start fetch loop for shard {}", shardId);

logger.debug("Executor shut down; cannot start fetch loop for stream [{}] shard [{}]", streamName, shardId);

exceptionfactory · 2026-03-10T21:24:19Z

...i-aws-kinesis/src/main/java/org/apache/nifi/processors/aws/kinesis/PollingKinesisClient.java

+                }
+            } catch (final Exception e) {
+                if (!state.isStopped()) {
+                    logger.error("Unexpected error in fetch loop for shard {}; will retry", shardId, e);


Should this be logged as a warning if it is going to be retried?

exceptionfactory · 2026-03-10T21:25:23Z

...s-kinesis/src/test/java/org/apache/nifi/processors/aws/kinesis/CheckpointTableUtilsTest.java

+                "streamName", AttributeValue.builder().s("my-stream").build(),
+                "shardId", AttributeValue.builder().s("shardId-0001").build(),
+                "sequenceNumber", AttributeValue.builder().s("12345").build());


Recommend declaring static variables for the map keys and values that can be reused across methods.

exceptionfactory · 2026-03-10T21:34:20Z

nifi-extension-bundles/nifi-aws-bundle/nifi-aws-kinesis/pom.xml


+        <dependency>
+            <groupId>software.amazon.awssdk</groupId>
+            <artifactId>apache-client</artifactId>


This library brings in Apache HTTP Client 4, which has limited updates. The url-connection-client does not have all the flexibility, but what do you think about using it instead?

It looks like the url-connection-client does not support proxies directly. And it doesn't support connection pooling with a max, which we're depending on here. Fortunately, though, we can upgrade to Apache HTTP Client 5, which I think makes a lot of sense.

markap14 · 2026-03-11T19:21:19Z

Thanks @exceptionfactory I think all of your feedback makes sense. I pushed a new commit that incorporates all of it and switches to Apache HTTP Client 5 instead of version 4.

awelless · 2026-03-13T13:06:48Z

nifi-extension-bundles/nifi-aws-bundle/nifi-aws-kinesis-nar/pom.xml

            <dependency>
                <groupId>software.amazon.awssdk</groupId>
-                <artifactId>apache-client</artifactId>
+                <artifactId>apache5-client</artifactId>


Currently nifi-aws-service-api-nar doesn't bring apache5-client as a dependency. We should either add it to that nar or remove it from this list.

Before adding it into nifi-aws-service-api I was getting ClassNotFoundException for Apache 5 http client.

D'oh! I added it to the api nar but it looks like i didn't include that in the commit 🤦 Will have that up shortly.

… logging cleanup

exceptionfactory self-assigned this Mar 4, 2026

exceptionfactory reviewed Mar 4, 2026

View reviewed changes

markap14 force-pushed the NIFI-15669 branch 3 times, most recently from 62fb403 to a7366a0 Compare March 5, 2026 19:40

awelless reviewed Mar 9, 2026

View reviewed changes

markap14 force-pushed the NIFI-15669 branch from a7366a0 to 3f0e488 Compare March 10, 2026 01:53

NIFI-15669: Refactored ConsumeKinesis to remove dependency on KCL. Th…

10d48b0

…is provides much faster startup times and drastically reduces heap utilization when using Enhanced Fan-Out (EFO) mode.

markap14 force-pushed the NIFI-15669 branch from 3f0e488 to 201093b Compare March 10, 2026 01:55

NIFI-15669: Addressed review feedback

144a8b8

markap14 force-pushed the NIFI-15669 branch from 201093b to 144a8b8 Compare March 10, 2026 01:55

NIFI-15669: Simplified checkpointing by eliminating subsequences beca…

c40712d

…use we always include all sub-records within a single ProcessSession so we don't need to checkpoint partial sequences

awelless reviewed Mar 10, 2026

View reviewed changes

NIFI-15669: Addressed a couple of additional corner cases

ecf5332

exceptionfactory reviewed Mar 10, 2026

View reviewed changes

markap14 force-pushed the NIFI-15669 branch from 2b70749 to ef6f6c8 Compare March 11, 2026 19:18

NIFI-15669: Addressed review feedback

07e3bd6

markap14 force-pushed the NIFI-15669 branch from ef6f6c8 to 07e3bd6 Compare March 11, 2026 19:39

awelless reviewed Mar 13, 2026

View reviewed changes

NIFI-15669: Added apache5-client to api-nar; some very minor code and…

8812feb

… logging cleanup

		shardManager.writeCheckpoints(batch.checkpoints());
		consumerClient.acknowledgeResults(accepted);

	final Set<String> ownedShardIds = new HashSet<>();
	final Set<String> ownedShardIds = HashSet.newHashSet(ownedShards.size());

	throw new ProcessException("Interrupted while waiting for DynamoDB table to become ACTIVE", e);
	throw new ProcessException("Interrupted while waiting for DynamoDB table [%s] to become ACTIVE".formatted(tableName), e);

	logger.debug("Executor shut down; cannot start fetch loop for shard {}", shardId);
	logger.debug("Executor shut down; cannot start fetch loop for stream [{}] shard [{}]", streamName, shardId);

Conversation

markap14 commented Mar 4, 2026

Summary

Tracking

Issue Tracking

Pull Request Tracking

Pull Request Formatting

Verification

Build

Licensing

Documentation

Uh oh!

exceptionfactory left a comment

Choose a reason for hiding this comment

Uh oh!

awelless left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markap14 commented Mar 10, 2026

Uh oh!

awelless left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

exceptionfactory left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

awelless left a comment •

edited

Loading