-
Notifications
You must be signed in to change notification settings - Fork 0
chore: removal of nullable fields #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughRewires DI to centralize external clients and lifecycle, removes LifecycleEnabled, converts many lifecycle services to DI-managed logic classes, adds SSE registry/router and K8s worker logic, migrates workers to FastStream entrypoints, changes schema/Avro registry handling, and updates tests/fakes to the new DI surface. (50 words) Changes
Sequence Diagram(s)sequenceDiagram
participant FastStream as FastStream App
participant DI as Dishka DI Container
participant Schema as SchemaRegistry
participant Logic as K8sWorkerLogic
participant Producer as Kafka Producer
participant K8s as Kubernetes API
FastStream->>DI: build container (BoundaryClientProvider, EventProvider, K8sWorkerProvider...)
FastStream->>Schema: initialize schemas (via DI)
FastStream->>Logic: resolve K8sWorkerLogic
Note right of Logic: handlers registered for CreatePod / DeletePod events
FastStream->>Producer: subscribe Kafka topic -> receive DomainEvent
Producer->>Schema: decode Avro -> DomainEvent
Schema->>FastStream: emit DomainEvent to handler
FastStream->>Logic: call handle_create_pod_command(event)
Logic->>K8s: create ConfigMap / Pod (k8s_v1 / k8s_apps_v1)
K8s-->>Logic: response/status
Logic->>Producer: publish PodCreated / ExecutionFailed events
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues found across 2 files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/services/notification_service.py (1)
136-156: Add_subscribe_to_events()to background tasks so execution event notifications fire.
_subscribe_to_events()is never invoked, so the Kafka consumer for execution-event notifications (EXECUTION_COMPLETED, EXECUTION_FAILED, EXECUTION_TIMEOUT) never starts. Add it to the background task queue:🔧 Proposed fix
def _start_background_tasks(self) -> None: """Start background processing tasks.""" tasks = [ + asyncio.create_task(self._subscribe_to_events()), asyncio.create_task(self._process_pending_notifications()), asyncio.create_task(self._cleanup_old_notifications()), ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/services/notification_service.py (1)
118-155: _subscribe_to_events() is never invoked—execution-event notifications won't be consumed.The async method
_subscribe_to_events()is defined (line 207) but never called anywhere in the codebase. This means the service will not set up Kafka consumer subscriptions for execution events, breaking event-driven notifications entirely.Additionally,
_start_background_tasks()callsasyncio.create_task()in__init__without explicitly verifying a running loop. While lazy DI instantiation during async route handling may avoid an immediateRuntimeError, this pattern is fragile and doesn't fail fast.Move event subscription setup into an async initialization point (e.g., a
start()method called from the lifespan handler, or schedule_subscribe_to_events()as a background task once the loop is guaranteed to be running).
sse shutdown manager: moved to DI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/core/providers.py (1)
515-531: Remove task creation fromNotificationService.__init__or make the provider async-aware.
NotificationService.__init__callsasyncio.create_task()on lines 199-200 to start background tasks. Sinceget_notification_serviceis a synchronous provider, this will raiseRuntimeError: no running event loopif resolved in any context before an event loop is active. Either defer task startup to an explicit async method (e.g.,async def start()) called after service resolution in an async context, or convert the provider to resolve the service in an async scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/app/services/k8s_worker/worker.py (1)
169-175: Clearidempotent_consumeron shutdown to avoid stale “running” status.
get_statusnow reports"running"based onidempotent_consumer is not None, but__aexit__doesn’t null it after stopping. This can report running even after shutdown.🐛 Proposed fix
@@ if self.idempotent_consumer: await self.idempotent_consumer.stop() + self.idempotent_consumer = NoneAlso applies to: 430-434
backend/app/services/sse/kafka_redis_bridge.py (1)
16-60: Make aenter idempotent and exception‑safe.Tests call aenter twice; current code overwrites
self.consumerswithout stopping the already-started set, which can leak running consumers. Also, a partial start failure leaves running consumers behind. Guard re-entry and clean up on failure.🔧 Suggested fix
async def __aenter__(self) -> "SSEKafkaRedisBridge": """Start the SSE Kafka→Redis bridge.""" self.logger.info(f"Starting SSE Kafka→Redis bridge with {self.num_consumers} consumers") + if self.consumers: + self.logger.debug("SSE Kafka→Redis bridge already started") + return self + # Phase 1: Build all consumers and track them immediately (no I/O) self.consumers = [self._build_consumer(i) for i in range(self.num_consumers)] # Phase 2: Start all in parallel - already tracked in self.consumers for cleanup topics = list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.WEBSOCKET_GATEWAY]) - await asyncio.gather(*[c.start(topics) for c in self.consumers]) + try: + await asyncio.gather(*[c.start(topics) for c in self.consumers]) + except Exception: + await asyncio.gather(*[c.stop() for c in self.consumers], return_exceptions=True) + self.consumers.clear() + raise
🤖 Fix all issues with AI agents
In `@backend/app/services/coordinator/coordinator.py`:
- Around line 112-117: The log message in ExecutionCoordinator.__aenter__ is
misleading because queue_manager is constructed in __init__; update __aenter__
to not claim initialization: remove or replace the "Queue manager initialized"
logger call in __aenter__ (referencing ExecutionCoordinator.__aenter__ and the
queue_manager created in ExecutionCoordinator.__init__) and, if you need to
indicate readiness here, change it to a truthful message such as
"ExecutionCoordinator service started" or "Queue manager ready" so the log
reflects actual work performed in __aenter__.
In `@backend/app/services/pod_monitor/monitor.py`:
- Around line 289-297: The current _backoff() raises RuntimeError when reconnect
attempts exceed config.max_reconnect_attempts which propagates out of
_watch_loop() and kills the _watch_task; instead, modify _backoff() to not raise
but mark the monitor as paused (e.g., set self._paused = True and log the
failure with details) and/or schedule a retry via a new helper (e.g.,
_ensure_watch_running) that runs in background to attempt restart, and update
_watch_loop() to check self._paused and exit gracefully without killing the
background manager; ensure references to _backoff, _watch_loop, _watch_task, and
config.max_reconnect_attempts are used so the new paused flag or background
restart task can detect and restart the watcher.
🧹 Nitpick comments (6)
backend/app/services/coordinator/coordinator.py (1)
196-197: Redundanthasattrcheck.
idempotency_manageris always assigned in__init__(line 97), sohasattris unnecessary. The None check alone is sufficient.Suggested simplification
- if hasattr(self, "idempotency_manager") and self.idempotency_manager: + if self.idempotency_manager: await self.idempotency_manager.close()backend/workers/run_k8s_worker.py (1)
42-46: Consider exiting if the worker stops unexpectedly.The loop now waits only on
shutdown_event. If the worker/consumer dies, the process may keep running and just log status. If you want automatic recovery or a clean exit, consider breaking/setting the shutdown event when status indicates not running.♻️ Possible adjustment
while not shutdown_event.is_set(): await asyncio.sleep(60) status = await worker.get_status() logger.info(f"Kubernetes worker status: {status}") + if not status.get("running", True): + logger.error("KubernetesWorker not running; initiating shutdown.") + shutdown_event.set()backend/app/services/pod_monitor/monitor.py (1)
122-138: Consider logging exceptions from the watch task during shutdown.If
_watch_taskterminates due to an exception other thanCancelledError(e.g.,RuntimeErrorfrom_backoff), it will be silently ignored during__aexit__. Consider retrieving and logging any exception.🔧 Suggested improvement
if self._watch_task: self._watch_task.cancel() try: await self._watch_task except asyncio.CancelledError: pass + except Exception as e: + self.logger.warning(f"Watch task terminated with error: {e}")backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (1)
118-125: Assertion may not validate event handling outcome.The assertion
len(orch._sagas) > 0verifies thatregister_saga(_Saga)populated_sagas, not that_handle_eventprocessed the event. Based on the relevant code snippet showing_handle_eventiterates overself._sagas, this assertion passes due to theregister_sagacall on line 121, not because of the event handling on line 123.Consider asserting on observable side effects of event handling, such as checking the fake repository for saved sagas.
🔧 Suggested improvement
`@pytest.mark.asyncio` async def test_min_success_flow(event_metrics: EventMetrics) -> None: - orch = _orch(event_metrics) - orch.register_saga(_Saga) + fake_repo = _FakeRepo() + orch = SagaOrchestrator( + config=SagaConfig(name="t", enable_compensation=True, store_events=True, publish_commands=False), + saga_repository=fake_repo, + producer=_FakeProd(), + schema_registry_manager=MagicMock(spec=SchemaRegistryManager), + settings=MagicMock(spec=Settings), + event_store=_FakeStore(), + idempotency_manager=_FakeIdem(), + resource_allocation_repository=_FakeAlloc(), + logger=_test_logger, + event_metrics=event_metrics, + ) + orch.register_saga(_Saga) # Handle the event await orch._handle_event(make_execution_requested_event(execution_id="e")) - # basic sanity; deep behavior covered by integration - assert len(orch._sagas) > 0 + # Verify saga was created and saved + assert len(fake_repo.saved) > 0backend/workers/run_saga_orchestrator.py (1)
31-43: Consider liveness handling after dropping theis_runninggate.
With onlyshutdown_event, the worker will keep sleeping even if the orchestrator stops unexpectedly; please confirm that’s intended or wire a failure hook to trigger shutdown.backend/app/services/saga/saga_orchestrator.py (1)
361-384: Consider re-raisingCancelledErrorfor cleaner task semantics.While suppressing
CancelledErrorworks here sinceasyncio.gather(..., return_exceptions=True)is used in__aexit__, re-raising is the recommended pattern for asyncio tasks. This ensures the task is properly marked as cancelled rather than completing normally.♻️ Optional: Re-raise CancelledError
except asyncio.CancelledError: self.logger.info("Timeout checker cancelled") + raise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 issues found across 32 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/workers/run_pod_monitor.py">
<violation number="1" location="backend/workers/run_pod_monitor.py:45">
P2: The loop now ignores the PodMonitor state, so if the monitor stops internally the worker will keep polling forever instead of exiting. Consider reintroducing a state check (or equivalent stop condition) to avoid hanging on a stopped monitor.</violation>
</file>
<file name="backend/app/services/saga/saga_orchestrator.py">
<violation number="1" location="backend/app/services/saga/saga_orchestrator.py:364">
P2: The timeout checker now stops permanently on any unexpected exception because the loop no longer catches and logs `Exception`. A transient repository error will kill the task and timeout checks will never resume. Wrap the body of the loop in a try/except to log and continue.</violation>
</file>
<file name="backend/app/services/k8s_worker/worker.py">
<violation number="1" location="backend/app/services/k8s_worker/worker.py:433">
P2: `running` is now derived from `idempotent_consumer` existence, but the shutdown path stops the consumer without clearing the reference. After `__aexit__`, `get_status()` will still return running=True even though the worker has stopped. Track an explicit running flag or clear `self.idempotent_consumer` on shutdown so status reflects the actual lifecycle state.</violation>
</file>
<file name="backend/tests/unit/services/saga/test_saga_orchestrator_unit.py">
<violation number="1" location="backend/tests/unit/services/saga/test_saga_orchestrator_unit.py:125">
P3: This assertion no longer validates that `_handle_event` did anything; `_sagas` is populated by `register_saga` before the handler runs, so the test will pass even if event handling fails. Assert on a side effect of `_handle_event` (e.g., running instances or saved saga) instead.</violation>
</file>
<file name="backend/workers/run_k8s_worker.py">
<violation number="1" location="backend/workers/run_k8s_worker.py:43">
P2: The run loop no longer exits when the KubernetesWorker stops. If the worker shuts down unexpectedly, this loop keeps running and prevents graceful cleanup until a signal arrives. Consider restoring the `worker.is_running` guard (or breaking when status indicates stopped) so the process terminates cleanly when the worker stops.</violation>
</file>
<file name="backend/app/events/event_store_consumer.py">
<violation number="1" location="backend/app/events/event_store_consumer.py:131">
P2: Unhandled exceptions in the batch processor will terminate the background task and stop future flushes; add error handling inside the loop to log and continue.</violation>
</file>
<file name="backend/workers/run_coordinator.py">
<violation number="1" location="backend/workers/run_coordinator.py:43">
P2: The loop no longer exits when the coordinator stops, so the worker can keep polling indefinitely after a coordinator failure instead of shutting down and cleaning up.</violation>
</file>
<file name="backend/app/services/pod_monitor/monitor.py">
<violation number="1" location="backend/app/services/pod_monitor/monitor.py:297">
P2: Raising RuntimeError in the watch backoff path will terminate the background watch task without a clean shutdown; since the task is created with `create_task` and `__aexit__` only suppresses `CancelledError`, this exception will bubble up later on shutdown or leave the monitor silently stopped. Consider exiting the watch loop via cancellation instead of throwing a RuntimeError.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
backend/tests/unit/services/saga/test_saga_orchestrator_unit.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/tests/integration/conftest.py`:
- Around line 74-83: The SSERedisBus constructor is being called with arguments
in the wrong order in the sse_redis_bus fixture: SSERedisBus expects
(redis_client, logger, exec_prefix, notif_prefix) but the fixture passes
exec_prefix and notif_prefix positionally and logger last; fix by passing logger
as the second argument (either positionally after redis_client or as
logger=<logger>) and ensure exec_prefix and notif_prefix are provided afterwards
(or as keywords) so the call to SSERedisBus(redis_client, logger,
exec_prefix=..., notif_prefix=..., logger=...) matches the constructor
signature.
🧹 Nitpick comments (1)
backend/tests/integration/idempotency/test_consumer_idempotent.py (1)
9-10: Redundant import ofEventDispatcher.
EventDispatcheris imported twice—once fromapp.events.core(line 9) and again aliased asDispfromapp.events.core.dispatcher(line 10). Sinceapp.events.corelikely re-exports from the dispatcher module, these are the same class.Suggested fix
-from app.events.core import ConsumerConfig, EventDispatcher, UnifiedConsumer, UnifiedProducer -from app.events.core.dispatcher import EventDispatcher as Disp +from app.events.core import ConsumerConfig, EventDispatcher, UnifiedConsumer, UnifiedProducerThen update line 47 to use
EventDispatcherdirectly:- disp: Disp = EventDispatcher(logger=_test_logger) + disp = EventDispatcher(logger=_test_logger)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
13 issues found across 47 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/workers/run_k8s_worker.py">
<violation number="1" location="backend/workers/run_k8s_worker.py:53">
P1: Exceptions from completed tasks in `done` set are silently swallowed. If `consumer.run()` or `logic.ensure_daemonset_task()` raises an exception, the worker will silently proceed to shutdown without logging or propagating the error. Add exception handling for tasks in `done`.</violation>
</file>
<file name="backend/workers/run_result_processor.py">
<violation number="1" location="backend/workers/run_result_processor.py:42">
P2: Exceptions from completed tasks in `done` set are never retrieved. If `consumer.run()` crashes with an exception before the shutdown signal, the error is silently ignored. Check for exceptions in completed tasks to properly log failures.</violation>
</file>
<file name="backend/app/core/dishka_lifespan.py">
<violation number="1" location="backend/app/core/dishka_lifespan.py:116">
P3: Inconsistent logging style: use structured logging with `extra` parameter instead of f-string interpolation to match the rest of this file's logging pattern. This improves log aggregation and follows the established convention.</violation>
</file>
<file name="backend/app/services/coordinator/coordinator_logic.py">
<violation number="1" location="backend/app/services/coordinator/coordinator_logic.py:280">
P1: `event.priority` is an `int`, so calling `.name` on it will raise `AttributeError`. The priority value must be converted to `QueuePriority` enum first.</violation>
</file>
<file name="backend/app/services/result_processor/processor_logic.py">
<violation number="1" location="backend/app/services/result_processor/processor_logic.py:33">
P3: Misleading docstring: The class is no longer stateful after this refactoring. The removed `_state`, `_consumer`, and `_dispatcher` attributes were what made it stateful. Consider updating the docstring to reflect that this class is now a stateless service that processes events.</violation>
</file>
<file name="backend/app/services/sse/sse_connection_registry.py">
<violation number="1" location="backend/app/services/sse/sse_connection_registry.py:49">
P1: Metrics are decremented even when the connection wasn't actually registered, which can lead to incorrect (potentially negative) connection counts. The code should only decrement metrics when a connection was actually removed from the registry.</violation>
</file>
<file name="backend/app/events/core/consumer.py">
<violation number="1" location="backend/app/events/core/consumer.py:109">
P1: `event` may be undefined when the error callback is invoked. If `self._schema_registry.deserialize_event()` throws an exception, `event` was never assigned, causing a `NameError`. Initialize `event` before the try block or guard the callback invocation.</violation>
</file>
<file name="backend/workers/run_coordinator.py">
<violation number="1" location="backend/workers/run_coordinator.py:53">
P1: Exceptions from completed tasks are silently lost. If `run_task` finishes with an exception (e.g., consumer or scheduling loop crashes), the error is never retrieved or logged. You should check tasks in `done` for exceptions to avoid silent failures.</violation>
</file>
<file name="backend/app/services/sse/event_router.py">
<violation number="1" location="backend/app/services/sse/event_router.py:44">
P2: Inefficient use of `model_dump()` just to extract a single field. Since all SSE-relevant events have `execution_id` as a direct attribute, use `getattr(event, 'execution_id', None)` to avoid serializing the entire model.</violation>
</file>
<file name="backend/workers/run_pod_monitor.py">
<violation number="1" location="backend/workers/run_pod_monitor.py:45">
P1: Exceptions from completed tasks are silently swallowed. When using `asyncio.wait()`, you must check completed tasks for exceptions, otherwise failures in `monitor.run()` will go unnoticed. Add exception handling for tasks in `done`.</violation>
</file>
<file name="backend/workers/run_saga_orchestrator.py">
<violation number="1" location="backend/workers/run_saga_orchestrator.py:65">
P1: Exceptions from completed tasks are silently swallowed. If `run_task` fails with an exception, it ends up in `done` but the exception is never checked or re-raised. Add exception handling for completed tasks to avoid silent failures.</violation>
</file>
<file name="backend/tests/integration/result_processor/test_result_processor.py">
<violation number="1" location="backend/tests/integration/result_processor/test_result_processor.py:130">
P2: Unused parameter `consumer_config` after refactoring. The function now creates its own `processor_consumer_config` and `stored_consumer_config` locally instead of using the injected fixture. Remove the unused parameter from the function signature.</violation>
</file>
<file name="backend/app/services/pod_monitor/monitor.py">
<violation number="1" location="backend/app/services/pod_monitor/monitor.py:123">
P2: `asyncio.CancelledError` is caught but not re-raised, breaking Python's async cancellation contract. The task will appear to complete normally instead of being cancelled, which can cause issues with task lifecycle management (e.g., `task.cancelled()` returns False, callers won't see the exception).</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
|
||
| async def route_event(self, event: DomainEvent) -> None: | ||
| """Route an event to Redis for SSE delivery.""" | ||
| data = event.model_dump() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Inefficient use of model_dump() just to extract a single field. Since all SSE-relevant events have execution_id as a direct attribute, use getattr(event, 'execution_id', None) to avoid serializing the entire model.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/services/sse/event_router.py, line 44:
<comment>Inefficient use of `model_dump()` just to extract a single field. Since all SSE-relevant events have `execution_id` as a direct attribute, use `getattr(event, 'execution_id', None)` to avoid serializing the entire model.</comment>
<file context>
@@ -0,0 +1,63 @@
+
+ async def route_event(self, event: DomainEvent) -> None:
+ """Route an event to Redis for SSE delivery."""
+ data = event.model_dump()
+ execution_id = data.get("execution_id")
+
</file context>
backend/tests/integration/result_processor/test_result_processor.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/app/services/coordinator/coordinator_logic.py (1)
231-281: Cast priority toQueuePrioritybefore using.name.If
event.priorityis an int (as it appears elsewhere),priority.namewill raise and the schedule path will fail.🐛 Proposed fix: normalize to QueuePriority
- priority = getattr(event, "priority", QueuePriority.NORMAL) - self.metrics.record_coordinator_queue_time(queue_time, priority.name) + priority = QueuePriority(getattr(event, "priority", QueuePriority.NORMAL)) + self.metrics.record_coordinator_queue_time(queue_time, priority.name)backend/app/services/result_processor/processor_logic.py (1)
88-93: Guard memory limit parsing to avoid handler failure.Line 90 uses
rstrip("Mi"), which will mis-handleGior other formats and can raiseValueError, aborting the handler. Consider a minimal parse guard to keep result processing resilient.🛠️ Suggested fix
- settings_limit = self._settings.K8S_POD_MEMORY_LIMIT - memory_limit_mib = int(settings_limit.rstrip("Mi")) # TODO: Less brittle acquisition of limit - memory_percent = (memory_mib / memory_limit_mib) * 100 - self._metrics.memory_utilization_percent.record( - memory_percent, attributes={"lang_and_version": lang_and_version} - ) + settings_limit = self._settings.K8S_POD_MEMORY_LIMIT + try: + if settings_limit.endswith("Gi"): + memory_limit_mib = int(settings_limit[:-2]) * 1024 + elif settings_limit.endswith("Mi"): + memory_limit_mib = int(settings_limit[:-2]) + else: + raise ValueError(f"Unsupported unit: {settings_limit}") + memory_percent = (memory_mib / memory_limit_mib) * 100 + self._metrics.memory_utilization_percent.record( + memory_percent, attributes={"lang_and_version": lang_and_version} + ) + except ValueError: + self.logger.warning( + "Skipping memory utilization metric due to unparseable limit", + extra={"limit": settings_limit}, + )
🤖 Fix all issues with AI agents
In `@backend/app/core/dishka_lifespan.py`:
- Around line 102-127: Background tasks created in the tasks list
(run_sse_consumers, event_store_consumer.run, notification_service.run) are
fire-and-forget and only awaited at shutdown, so runtime exceptions can be lost;
update the creation of each asyncio.create_task to attach an add_done_callback
that checks task.exception(), logs the error via logger, and on first non-None
exception cancels all other tasks in the tasks collection (safely skipping
already done/cancelled tasks) to implement immediate supervision and fail-fast
behavior, ensuring you still await asyncio.gather(*tasks,
return_exceptions=True) in the finally block to finish cleanup.
In `@backend/app/core/providers.py`:
- Around line 463-472: The provider get_sse_connection_registry currently
creates an SSEConnectionRegistry per request (provide(scope=Scope.REQUEST)),
which resets _active_connections and causes get_connection_count() to always
return 0; change the provider to use application scope
(provide(scope=Scope.APP)) so SSEConnectionRegistry is a singleton for the app,
preserving the asyncio.Lock() and shared _active_connections across requests and
allowing registrations from SSE streams to be visible to the health endpoint.
In `@backend/app/events/core/consumer.py`:
- Around line 104-110: The variable event can be unbound if deserialize_event()
raises, causing a NameError when awaiting self._error_callback(e, event); fix by
initializing event = None before the try/except where deserialize_event() is
called (so the callback always receives a defined value), and keep the existing
error logging and metrics calls (e.g., the except block that calls
self.logger.error, self._event_metrics.record_kafka_consumption_error and awaits
self._error_callback).
In `@backend/app/services/idempotency/middleware.py`:
- Around line 123-144: register_idempotent_handler wraps a handler in
IdempotentEventHandler then passes it to EventDispatcher.register which expects
handler.__name__; set the wrapper's __name__ (or use
functools.update_wrapper/functools.wraps) so the wrapped instance exposes the
original handler's __name__ to avoid AttributeError. In practice, after creating
wrapped in register_idempotent_handler, assign wrapped.__name__ =
getattr(handler, "__name__", type(handler).__name__) (or call
update_wrapper(wrapped, handler)) before calling
self._dispatcher.register(event_type)(wrapped).
In `@backend/app/services/sse/sse_connection_registry.py`:
- Around line 30-49: The register_connection and unregister_connection currently
call metrics.increment_sse_connections / decrement_sse_connections
unconditionally, causing drift on duplicate adds/removes; modify
register_connection to only call metrics.increment_sse_connections("executions")
(and log) if connection_id was not already present in
self._active_connections[execution_id], and modify unregister_connection to only
call metrics.decrement_sse_connections("executions") (and log) if connection_id
was actually removed (i.e., it existed before discard and the set changed or
length decreased); keep the async with self._lock around these checks and
mutations and use the same symbols (_active_connections, register_connection,
unregister_connection, metrics.increment_sse_connections,
metrics.decrement_sse_connections) to locate the changes.
In `@backend/tests/integration/events/test_consumer_lifecycle.py`:
- Around line 40-52: The test signals readiness too early: move the
consumer_started.set() so it runs after the internal consumer is initialized
inside run() (or after consumer.run() completes its startup logic) to avoid
calling consumer.seek_to_beginning()/consumer.seek_to_end() before
consumer._consumer is assigned; locate the run_with_signal coroutine and ensure
the readiness event is set only once consumer.run() has created/started the
AIOKafkaConsumer (or add an explicit check/wait for consumer._consumer to be
non-None before performing seeks).
In `@backend/workers/run_coordinator.py`:
- Around line 42-64: The code awaits asyncio.wait([run_task, shutdown_task]) but
only inspects pending tasks, so exceptions from completed tasks (e.g., run_task
from run_coordinator_tasks()) are never re-raised; update the logic after
asyncio.wait to iterate over the done set and call task.result() (or check
task.exception()) for each done task and re-raise or surface any exceptions (at
least for run_task) before cancelling pending tasks; locate
run_coordinator_tasks, run_task, shutdown_task, and the asyncio.wait call to
implement this change so failures in run_coordinator_tasks propagate correctly.
In `@backend/workers/run_k8s_worker.py`:
- Around line 49-64: The code creates run_task =
asyncio.create_task(run_worker_tasks()) and shutdown_task =
asyncio.create_task(shutdown_event.wait()) then awaits asyncio.wait but never
retrieves results from completed tasks, so exceptions from run_worker_tasks()
are swallowed; after the asyncio.wait call inspect each task in the done set
(e.g., check task.done() and call task.result() or task.exception()) and
re-raise or propagate any exception found (for example by calling task.result()
which will raise) before cancelling pending tasks so that worker failures
surface instead of being masked.
In `@backend/workers/run_result_processor.py`:
- Around line 35-53: The current shutdown sequence swallows exceptions from
consumer.run() because run_task's exception is never retrieved; after the
asyncio.wait completes, inspect the completed tasks (e.g., check if run_task is
in the done set), call run_task.result() (or handle/run_task.exception()) to
surface any exception, log it with logger.error including context, and re-raise
or set an exit error before cancelling pending tasks (shutdown_task) so failures
in consumer.run() are not treated as normal shutdown; update the logic around
run_task/run_task.result(), shutdown_task, and shutdown_event accordingly.
In `@backend/workers/run_saga_orchestrator.py`:
- Around line 54-70: The current shutdown logic cancels pending tasks but can
silently swallow exceptions from completed tasks like run_task; after
asyncio.wait returns, inspect the done set (e.g., run_task in done) and
propagate any errors by awaiting the completed tasks or checking
task.exception() and re-raising it before returning/finishing; keep existing
cancellation/await handling for pending tasks (shutdown_task) but ensure
exceptions from run_orchestrator_tasks are not ignored so the orchestrator
failure surfaces to the caller.
♻️ Duplicate comments (4)
backend/app/events/event_store_consumer.py (1)
102-104: Add error handling for batch storage and commit failures.If
_store_batch()orconsumer.commit()throws an exception (e.g., transient MongoDB or Kafka broker issue), the consumer will terminate entirely. Consider wrapping these operations in a try-except to log and continue processing subsequent batches.🐛 Proposed fix
if events: - await self._store_batch(events) - await consumer.commit() + try: + await self._store_batch(events) + await consumer.commit() + except Exception as e: + self.logger.error(f"Failed to store/commit batch: {e}", exc_info=True) + # Continue processing - Kafka will redeliver uncommitted messagesbackend/app/services/pod_monitor/monitor.py (1)
277-285: Max reconnect RuntimeError terminates monitor ungracefully.This issue was previously flagged. When
_backoff()raisesRuntimeErrorafter max attempts, it propagates through_watch_loop()torun(), which only catchesCancelledError. The monitor stops without cleanup or recovery path. Consider either catching this inrun()for graceful degradation or using cancellation-based exit.backend/app/services/saga/saga_logic.py (1)
299-321: Keep the timeout loop resilient to transient errors.As written, any unexpected exception will terminate the loop permanently. This matches the earlier concern and still applies here.
🐛 Proposed fix: log and continue on unexpected errors
async def check_timeouts_loop(self) -> None: """Check for saga timeouts (runs until cancelled).""" try: while True: - # Check every 30 seconds - await asyncio.sleep(30) - - cutoff_time = datetime.now(UTC) - timedelta(seconds=self.config.timeout_seconds) - - timed_out = await self._repo.find_timed_out_sagas(cutoff_time) - - for instance in timed_out: - self.logger.warning(f"Saga {instance.saga_id} timed out") - - instance.state = SagaState.TIMEOUT - instance.error_message = f"Saga timed out after {self.config.timeout_seconds} seconds" - instance.completed_at = datetime.now(UTC) - - await self._save_saga(instance) - self._running_instances.pop(instance.saga_id, None) + try: + # Check every 30 seconds + await asyncio.sleep(30) + + cutoff_time = datetime.now(UTC) - timedelta(seconds=self.config.timeout_seconds) + + timed_out = await self._repo.find_timed_out_sagas(cutoff_time) + + for instance in timed_out: + self.logger.warning(f"Saga {instance.saga_id} timed out") + + instance.state = SagaState.TIMEOUT + instance.error_message = f"Saga timed out after {self.config.timeout_seconds} seconds" + instance.completed_at = datetime.now(UTC) + + await self._save_saga(instance) + self._running_instances.pop(instance.saga_id, None) + except Exception as e: + self.logger.error(f"Error in timeout checker: {e}", exc_info=True) + await asyncio.sleep(5) except asyncio.CancelledError: self.logger.info("Timeout checker cancelled")backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (1)
93-99: Assertion doesn’t validate handle_event side effects.
register_sagaalready populates_sagas, so the current assertion can pass even ifhandle_eventdoes nothing. Consider asserting on a side effect (e.g., repo save or saga state change) after the handler runs.
🧹 Nitpick comments (8)
backend/app/services/k8s_worker/worker_logic.py (4)
56-56: Redundant fallback for non-optional parameter.The
configparameter is typed asK8sWorkerConfig(non-optional), yet line 56 has a fallbackconfig or K8sWorkerConfig(). Either make the parameterOptional[K8sWorkerConfig]or remove the fallback to match the signature.- self.config = config or K8sWorkerConfig() + self.config = config
154-154: Fire-and-forget task lacks exception propagation.The task created here is not stored, so any unhandled exception will be silently lost (logged only). While the current implementation handles exceptions internally in
_create_pod_for_execution, consider storing task references for better observability and graceful shutdown handling.
305-308: Defensive null checks on required dependency.
self.produceris assigned from a required constructor parameter (non-optionalUnifiedProducer). These null checks suggest the producer might be None, but that would indicate a constructor bug. Either remove the checks or type the field asOptional[UnifiedProducer]if nullability is intentional.Also applies to: 319-322, 336-339
356-356: Undocumented hardcoded delay.The 5-second sleep before DaemonSet operations appears arbitrary. Consider documenting the reason (e.g., waiting for API server readiness) or making it configurable via
K8sWorkerConfig.backend/tests/e2e/test_k8s_worker_create_pod.py (1)
83-91: Resource cleanup may be skipped on test failure.If the assertions on lines 85-87 fail, the cleanup on lines 90-91 won't execute, potentially leaving orphaned Kubernetes resources. Consider wrapping the verification and cleanup in a try/finally block.
🧹 Suggested fix for reliable cleanup
# Verify resources exist - got_cm = logic.v1.read_namespaced_config_map(name=f"script-{exec_id}", namespace=ns) - assert got_cm is not None - got_pod = logic.v1.read_namespaced_pod(name=f"executor-{exec_id}", namespace=ns) - assert got_pod is not None - - # Cleanup - logic.v1.delete_namespaced_pod(name=f"executor-{exec_id}", namespace=ns) - logic.v1.delete_namespaced_config_map(name=f"script-{exec_id}", namespace=ns) + try: + got_cm = logic.v1.read_namespaced_config_map(name=f"script-{exec_id}", namespace=ns) + assert got_cm is not None + got_pod = logic.v1.read_namespaced_pod(name=f"executor-{exec_id}", namespace=ns) + assert got_pod is not None + finally: + # Cleanup regardless of assertion outcome + logic.v1.delete_namespaced_pod(name=f"executor-{exec_id}", namespace=ns) + logic.v1.delete_namespaced_config_map(name=f"script-{exec_id}", namespace=ns)backend/app/services/sse/event_router.py (1)
42-46: Avoid fullmodel_dump()just to readexecution_id.
model_dump()allocates a full dict (potentially large payloads). A direct attribute lookup avoids the extra work.Suggested refactor
- data = event.model_dump() - execution_id = data.get("execution_id") + execution_id = getattr(event, "execution_id", None)backend/tests/integration/services/sse/test_partitioned_event_router.py (1)
39-41: Consider closing the subscription after the test.The test opens a Redis subscription but doesn't explicitly close it. While the fixture cleanup may handle the Redis client, explicitly closing the subscription would be cleaner.
♻️ Suggested cleanup
# Await the subscription - verify event arrived in Redis msg = await asyncio.wait_for(subscription.get(RedisSSEMessage), timeout=2.0) assert msg is not None assert str(msg.event_type) == str(ev.event_type) + + await subscription.close()backend/app/services/saga/saga_service.py (1)
162-178: Consider renaming method to reflect new implementation.The method
get_saga_status_from_orchestratorreferences "orchestrator" which no longer exists. Consider renaming to something likeget_saga_status_liveorget_live_saga_statusto better reflect it fetches fromSagaLogic's in-memory state first.
backend/app/core/dishka_lifespan.py
Outdated
| # Phase 3: Start run() methods as background tasks | ||
| # Note: Services are already initialized by their DI providers (which handle __aenter__/__aexit__) | ||
|
|
||
| async def run_sse_consumers() -> None: | ||
| """Run SSE consumers using TaskGroup.""" | ||
| async with asyncio.TaskGroup() as tg: | ||
| for consumer in sse_consumers: | ||
| tg.create_task(consumer.run()) | ||
|
|
||
| tasks = [ | ||
| asyncio.create_task(run_sse_consumers(), name="sse_consumers"), | ||
| asyncio.create_task(event_store_consumer.run(), name="event_store_consumer"), | ||
| asyncio.create_task(notification_service.run(), name="notification_service"), | ||
| ] | ||
| logger.info(f"Background services started ({len(sse_consumers)} SSE consumers)") | ||
|
|
||
| try: | ||
| yield | ||
| finally: | ||
| # Cancel all background tasks on shutdown | ||
| logger.info("Shutting down background services...") | ||
| for task in tasks: | ||
| task.cancel() | ||
|
|
||
| # Wait for tasks to finish cancellation | ||
| await asyncio.gather(*tasks, return_exceptions=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cat -n backend/app/core/dishka_lifespan.py | head -150Repository: HardMax71/Integr8sCode
Length of output: 188
🌐 Web query:
asyncio.create_task() exception handling unwaited tasks Python
💡 Result:
Short answer: an exception in a Task created with asyncio.create_task() is saved on the Task and will be logged as "Task exception was never retrieved" if nobody ever awaits the Task or calls result()/exception(). Fixes: await the task, use a TaskGroup (Python 3.11+) or keep a strong reference and handle its result in a callback, or install a loop exception handler. Examples and links:
-
Await or gather tasks (propagates/raises exceptions):
- await task
- await asyncio.gather(...)
-
Use asyncio.TaskGroup (Python 3.11+) to automatically wait and raise grouped exceptions:
async with asyncio.TaskGroup() as tg:
tg.create_task(coro()) # all tasks awaited when context exits. [1] -
Keep references + inspect exceptions (for "fire-and-forget"):
background = set()
task = asyncio.create_task(coro())
background.add(task)
task.add_done_callback(lambda t: background.discard(t) or log_exception(t))
def log_exception(t):
if t.cancelled(): return
exc = t.exception()
if exc is not None:
logger.error("task failed", exc_info=exc) # or handle it. [2][3] -
Global handling: loop.set_exception_handler(...) can be used for loop-level logging/control.
Sources: asyncio docs on Task/TaskGroup and "detect never-retrieved exceptions". [1][2][3]
References
[1] Python docs — Coroutines and Tasks / TaskGroup (asyncio.TaskGroup). (docs.python.org)
[2] Python docs — Detect never-retrieved exceptions in asyncio (explains "Task exception was never retrieved"). (docs.python.org)
[3] Practical guidance: keeping strong references and add_done_callback examples. (runebook.dev)
If you want, tell me your Python version and I’ll give the exact snippet tailored to it.
Citations:
- 1: https://docs.python.org/3.11/library/asyncio-task.html?utm_source=openai
- 2: https://docs.python.org/3.9/library/asyncio-dev.html?utm_source=openai
- 3: https://runebook.dev/en/docs/python/library/asyncio-task/asyncio.Task?utm_source=openai
🌐 Web query:
Python asyncio TaskGroup vs create_task exception handling
💡 Result:
Short answer
- asyncio.TaskGroup.create_task(...) creates tasks that belong to the TaskGroup; if any member task raises (other than CancelledError) the group cancels the remaining member tasks, waits for them, and then raises an ExceptionGroup (or re-raises KeyboardInterrupt/SystemExit). [1]
- asyncio.create_task(...) (or loop.create_task) creates a standalone Task not owned by a TaskGroup; its exceptions do not affect a surrounding TaskGroup and it is not cancelled when the group is torn down. Use standalone tasks if you want a task to outlive or be isolated from the group. [1][2]
Key differences and practical notes
- Exception aggregation: TaskGroup collects non-cancellation exceptions from members and raises them as an ExceptionGroup when the group exits; plain create_task does not provide that aggregation. [1]
- Cancellation semantics: When a TaskGroup member fails, the group cancels the remaining member tasks (and may cancel the task running the async-with body). Standalone tasks created with asyncio.create_task are not cancelled by the TaskGroup. [1][2]
- Edge cases/bugs: There have been reported corner cases (e.g., tasks added after abort) and fixes/discussions in CPython issue tracker — be careful if you dynamically add tasks after a failure. [3]
If you want, I can show a minimal code example demonstrating both behaviors.
🏁 Script executed:
# Try alternative repository access
git clone https://github.com/HardMax71/Integr8sCode.git /tmp/repo 2>&1 | head -20Repository: HardMax71/Integr8sCode
Length of output: 155
🏁 Script executed:
cd /tmp/repo && cat -n backend/app/core/dishka_lifespan.py | sed -n '90,140p'Repository: HardMax71/Integr8sCode
Length of output: 1066
🏁 Script executed:
cd /tmp/repo && cat -n backend/app/core/dishka_lifespan.py | sed -n '1,150p'Repository: HardMax71/Integr8sCode
Length of output: 4966
🏁 Script executed:
cd /tmp/repo && cat -n backend/app/core/dishka_lifespan.pyRepository: HardMax71/Integr8sCode
Length of output: 4966
🏁 Script executed:
cd /tmp/repo && git log --oneline backend/app/core/dishka_lifespan.py | head -10Repository: HardMax71/Integr8sCode
Length of output: 608
🏁 Script executed:
cd /tmp/repo && git show HEAD:backend/app/core/dishka_lifespan.py | wc -lRepository: HardMax71/Integr8sCode
Length of output: 69
🏁 Script executed:
cd /tmp/repo && git show 577a184:backend/app/core/dishka_lifespan.py 2>/dev/null | tail -60Repository: HardMax71/Integr8sCode
Length of output: 2666
🏁 Script executed:
cd /tmp/repo && git log --all --oneline --grep="background\|task\|supervision" | head -20Repository: HardMax71/Integr8sCode
Length of output: 163
🏁 Script executed:
cd /tmp/repo && git diff 577a184^..577a184 -- backend/app/core/dishka_lifespan.py | head -150Repository: HardMax71/Integr8sCode
Length of output: 2243
🏁 Script executed:
cd /tmp/repo && git show 0a65186 -- backend/app/core/dishka_lifespan.py | head -200Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
cd /tmp/repo && git log --oneline | head -20Repository: HardMax71/Integr8sCode
Length of output: 921
🏁 Script executed:
cd /tmp/repo && git log --oneline --all | grep -i "asyncio\|task"Repository: HardMax71/Integr8sCode
Length of output: 163
🏁 Script executed:
cd /tmp/repo && git show d6fe5ea -- backend/app/core/dishka_lifespan.pyRepository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
cd /tmp/repo && git log --all --oneline | grep -i "task fix"Repository: HardMax71/Integr8sCode
Length of output: 163
🏁 Script executed:
cd /tmp/repo && git show d6fe5ea:backend/app/core/dishka_lifespan.py 2>&1 | catRepository: HardMax71/Integr8sCode
Length of output: 152
🏁 Script executed:
cd /tmp/repo && git branch -aRepository: HardMax71/Integr8sCode
Length of output: 2486
🏁 Script executed:
cd /tmp/repo && git log --all --oneline --decorate -- backend/app/core/dishka_lifespan.py | head -20Repository: HardMax71/Integr8sCode
Length of output: 1088
🏁 Script executed:
cd /tmp/repo && git show ff45a52:backend/app/core/dishka_lifespan.py 2>&1 | catRepository: HardMax71/Integr8sCode
Length of output: 4984
Add runtime supervision for background tasks.
Tasks are created fire-and-forget and only awaited on shutdown with return_exceptions=True; a runtime exception in any consumer/service will go silent until exit. Use a done callback to log failures and cancel remaining tasks immediately.
Suggested supervision pattern
tasks = [
asyncio.create_task(run_sse_consumers(), name="sse_consumers"),
asyncio.create_task(event_store_consumer.run(), name="event_store_consumer"),
asyncio.create_task(notification_service.run(), name="notification_service"),
]
logger.info(f"Background services started ({len(sse_consumers)} SSE consumers)")
+
+ def _handle_task_done(task: asyncio.Task) -> None:
+ if task.cancelled():
+ return
+ exc = task.exception()
+ if exc:
+ logger.error("Background task failed", exc_info=exc)
+ for t in tasks:
+ if t is not task:
+ t.cancel()
+
+ for task in tasks:
+ task.add_done_callback(_handle_task_done)🤖 Prompt for AI Agents
In `@backend/app/core/dishka_lifespan.py` around lines 102 - 127, Background tasks
created in the tasks list (run_sse_consumers, event_store_consumer.run,
notification_service.run) are fire-and-forget and only awaited at shutdown, so
runtime exceptions can be lost; update the creation of each asyncio.create_task
to attach an add_done_callback that checks task.exception(), logs the error via
logger, and on first non-None exception cancels all other tasks in the tasks
collection (safely skipping already done/cancelled tasks) to implement immediate
supervision and fail-fast behavior, ensuring you still await
asyncio.gather(*tasks, return_exceptions=True) in the finally block to finish
cleanup.
| @provide(scope=Scope.REQUEST) | ||
| def get_sse_shutdown_manager( | ||
| self, logger: logging.Logger, connection_metrics: ConnectionMetrics | ||
| ) -> SSEShutdownManager: | ||
| return create_sse_shutdown_manager(logger=logger, connection_metrics=connection_metrics) | ||
| def get_sse_connection_registry( | ||
| self, | ||
| logger: logging.Logger, | ||
| connection_metrics: ConnectionMetrics, | ||
| ) -> SSEConnectionRegistry: | ||
| return SSEConnectionRegistry( | ||
| logger=logger, | ||
| connection_metrics=connection_metrics, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find SSEConnectionRegistry class definition
fd -t f "\.py$" | xargs grep -l "class SSEConnectionRegistry"Repository: HardMax71/Integr8sCode
Length of output: 188
🏁 Script executed:
# Search for get_connection_count usage
rg "get_connection_count" --type py -A 3 -B 3Repository: HardMax71/Integr8sCode
Length of output: 6270
🏁 Script executed:
# Search for SSEConnectionRegistry usage
rg "SSEConnectionRegistry" --type py -A 2 -B 2Repository: HardMax71/Integr8sCode
Length of output: 6034
🏁 Script executed:
# Check how SSEService is provided
rg "@provide.*SSEService" --type py -A 5 -B 2Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# Check for Scope definitions and their meanings
rg "class Scope" --type py -A 10Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# Look at sse_connection_registry.py to understand persistence expectations
cat -n backend/app/services/sse/sse_connection_registry.py | head -50Repository: HardMax71/Integr8sCode
Length of output: 2296
🏁 Script executed:
# Search for SSEService provider definition more broadly
rg "get_sse_service|SSEService" backend/app/core/providers.py -A 5 -B 5Repository: HardMax71/Integr8sCode
Length of output: 1278
🏁 Script executed:
# Look at the full providers.py around the SSEService definition
rg "def get_" backend/app/core/providers.py | grep -i sseRepository: HardMax71/Integr8sCode
Length of output: 355
🏁 Script executed:
# Check SSEService class to understand if it needs app-wide state
cat -n backend/app/services/sse/sse_service.py | head -80Repository: HardMax71/Integr8sCode
Length of output: 3831
🏁 Script executed:
# Check if there's any caching or special handling for REQUEST scope singletons
rg "Scope\." backend/app/core/providers.py -B 2 -A 1 | head -40Repository: HardMax71/Integr8sCode
Length of output: 706
🏁 Script executed:
# Look for health status endpoint that uses get_connection_count
rg "get_health_status" --type py -A 10 -B 5Repository: HardMax71/Integr8sCode
Length of output: 3730
🏁 Script executed:
# Search for any test that shows REQUEST vs APP scope behavior
rg "Scope\.(REQUEST|APP)" backend/tests --type py | head -20Repository: HardMax71/Integr8sCode
Length of output: 48
SSEConnectionRegistry scope must be Scope.APP, not Scope.REQUEST.
With Scope.REQUEST, a new registry instance is created per HTTP request, resulting in an empty _active_connections dictionary on every health status check. This causes get_connection_count() to always return 0. The registry is designed to track active connections globally across the application—registrations from one request's SSE stream must be visible when the health endpoint is called from a different request. The internal asyncio.Lock() and the persistent state expectations confirm the registry is intended as application-wide infrastructure.
🤖 Prompt for AI Agents
In `@backend/app/core/providers.py` around lines 463 - 472, The provider
get_sse_connection_registry currently creates an SSEConnectionRegistry per
request (provide(scope=Scope.REQUEST)), which resets _active_connections and
causes get_connection_count() to always return 0; change the provider to use
application scope (provide(scope=Scope.APP)) so SSEConnectionRegistry is a
singleton for the app, preserving the asyncio.Lock() and shared
_active_connections across requests and allowing registrations from SSE streams
to be visible to the health endpoint.
backend/app/events/core/consumer.py
Outdated
| except Exception as e: | ||
| self.logger.error(f"Error processing message: {e}", exc_info=True) | ||
| self._event_metrics.record_kafka_consumption_error( | ||
| msg.topic, self._config.group_id, type(e).__name__ | ||
| ) | ||
| if self._error_callback: | ||
| await self._error_callback(e, event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix: event may be unbound when error callback is invoked.
If deserialize_event() throws an exception (line 80), the event variable is never assigned. The error callback at line 110 will then raise a NameError.
🐛 Proposed fix
+ event: DomainEvent | None = None
try:
event = await self._schema_registry.deserialize_event(msg.value, msg.topic)
headers = {k: v.decode() if isinstance(v, bytes) else v for k, v in (msg.headers or [])}
ctx = extract_trace_context(headers)
with tracer.start_as_current_span(
"kafka.consume",
context=ctx,
kind=SpanKind.CONSUMER,
attributes={
EventAttributes.KAFKA_TOPIC: msg.topic,
EventAttributes.KAFKA_PARTITION: msg.partition,
EventAttributes.KAFKA_OFFSET: msg.offset,
EventAttributes.EVENT_TYPE: event.event_type,
EventAttributes.EVENT_ID: event.event_id,
},
):
await self._dispatcher.dispatch(event)
if not self._config.enable_auto_commit:
await self._consumer.commit()
self._event_metrics.record_kafka_message_consumed(msg.topic, self._config.group_id)
except Exception as e:
self.logger.error(f"Error processing message: {e}", exc_info=True)
self._event_metrics.record_kafka_consumption_error(
msg.topic, self._config.group_id, type(e).__name__
)
- if self._error_callback:
+ if self._error_callback and event is not None:
await self._error_callback(e, event)🤖 Prompt for AI Agents
In `@backend/app/events/core/consumer.py` around lines 104 - 110, The variable
event can be unbound if deserialize_event() raises, causing a NameError when
awaiting self._error_callback(e, event); fix by initializing event = None before
the try/except where deserialize_event() is called (so the callback always
receives a defined value), and keep the existing error logging and metrics calls
(e.g., the except block that calls self.logger.error,
self._event_metrics.record_kafka_consumption_error and awaits
self._error_callback).
| def register_idempotent_handler( | ||
| self, | ||
| event_type: str, | ||
| event_type: EventType, | ||
| handler: Callable[[DomainEvent], Awaitable[None]], | ||
| key_strategy: str | None = None, | ||
| custom_key_func: Callable[[DomainEvent], str] | None = None, | ||
| fields: Set[str] | None = None, | ||
| fields: set[str] | None = None, | ||
| ttl_seconds: int | None = None, | ||
| cache_result: bool = True, | ||
| on_duplicate: Callable[[DomainEvent, Any], Any] | None = None, | ||
| ) -> None: | ||
| """Subscribe an idempotent handler for specific event type""" | ||
| # Create the idempotent handler wrapper | ||
| idempotent_wrapper = IdempotentEventHandler( | ||
| """Register an idempotent handler for an event type.""" | ||
| wrapped = IdempotentEventHandler( | ||
| handler=handler, | ||
| idempotency_manager=self.idempotency_manager, | ||
| logger=self.logger, | ||
| key_strategy=key_strategy or self.default_key_strategy, | ||
| idempotency_manager=self._idempotency_manager, | ||
| logger=self._logger, | ||
| key_strategy=key_strategy or self._default_key_strategy, | ||
| custom_key_func=custom_key_func, | ||
| fields=fields, | ||
| ttl_seconds=ttl_seconds or self.default_ttl_seconds, | ||
| cache_result=cache_result, | ||
| ttl_seconds=ttl_seconds or self._default_ttl_seconds, | ||
| on_duplicate=on_duplicate, | ||
| ) | ||
|
|
||
| # Create an async handler that processes the message | ||
| async def async_handler(message: Any) -> Any: | ||
| self.logger.info(f"IDEMPOTENT HANDLER CALLED for {event_type}") | ||
|
|
||
| # Extract event from confluent-kafka Message | ||
| if not hasattr(message, "value"): | ||
| self.logger.error(f"Received non-Message object for {event_type}: {type(message)}") | ||
| return None | ||
|
|
||
| # Debug log to check message details | ||
| self.logger.info( | ||
| f"Handler for {event_type} - Message type: {type(message)}, " | ||
| f"has key: {hasattr(message, 'key')}, " | ||
| f"has topic: {hasattr(message, 'topic')}" | ||
| ) | ||
|
|
||
| raw_value = message.value | ||
|
|
||
| # Debug the raw value | ||
| self.logger.info(f"Raw value extracted: {raw_value[:100] if raw_value else 'None or empty'}") | ||
|
|
||
| # Handle tombstone messages (null value for log compaction) | ||
| if raw_value is None: | ||
| self.logger.warning(f"Received empty message for {event_type} - tombstone or consumed value") | ||
| return None | ||
|
|
||
| # Handle empty messages | ||
| if not raw_value: | ||
| self.logger.warning(f"Received empty message for {event_type} - empty bytes") | ||
| return None | ||
|
|
||
| try: | ||
| # Deserialize using schema registry if available | ||
| event = await self.consumer._schema_registry.deserialize_event(raw_value, message.topic) | ||
| if not event: | ||
| self.logger.error(f"Failed to deserialize event for {event_type}") | ||
| return None | ||
|
|
||
| # Call the idempotent wrapper directly in async context | ||
| await idempotent_wrapper(event) | ||
|
|
||
| self.logger.debug(f"Successfully processed {event_type} event: {event.event_id}") | ||
| return None | ||
| except Exception as e: | ||
| self.logger.error(f"Failed to process message for {event_type}: {e}", exc_info=True) | ||
| raise | ||
|
|
||
| # Register with the dispatcher if available | ||
| if self.dispatcher: | ||
| # Create wrapper for EventDispatcher | ||
| async def dispatch_handler(event: DomainEvent) -> None: | ||
| await idempotent_wrapper(event) | ||
|
|
||
| self.dispatcher.register(EventType(event_type))(dispatch_handler) | ||
| else: | ||
| # Fallback to direct consumer registration if no dispatcher | ||
| self.logger.error(f"No EventDispatcher available for registering idempotent handler for {event_type}") | ||
|
|
||
| async def start(self, topics: list[KafkaTopic]) -> None: | ||
| """Start the consumer with idempotency""" | ||
| self.logger.info(f"IdempotentConsumerWrapper.start called with topics: {topics}") | ||
| # Make handlers idempotent before starting | ||
| self.make_handlers_idempotent() | ||
|
|
||
| # Start the consumer with required topics parameter | ||
| await self.consumer.start(topics) | ||
| self.logger.info("IdempotentConsumerWrapper started successfully") | ||
|
|
||
| async def stop(self) -> None: | ||
| """Stop the consumer""" | ||
| await self.consumer.stop() | ||
|
|
||
| # Delegate other methods to the wrapped consumer | ||
| def __getattr__(self, name: str) -> Any: | ||
| return getattr(self.consumer, name) | ||
| self._dispatcher.register(event_type)(wrapped) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix handler registration to avoid AttributeError on __name__.
EventDispatcher.register logs handler.__name__, but IdempotentEventHandler instances don’t define it, so register_idempotent_handler can crash at registration time.
🐛 Proposed fix: set a safe __name__ before registration
wrapped = IdempotentEventHandler(
handler=handler,
idempotency_manager=self._idempotency_manager,
logger=self._logger,
key_strategy=key_strategy or self._default_key_strategy,
custom_key_func=custom_key_func,
fields=fields,
ttl_seconds=ttl_seconds or self._default_ttl_seconds,
on_duplicate=on_duplicate,
)
+ wrapped.__name__ = getattr(handler, "__name__", wrapped.__class__.__name__)
self._dispatcher.register(event_type)(wrapped)🤖 Prompt for AI Agents
In `@backend/app/services/idempotency/middleware.py` around lines 123 - 144,
register_idempotent_handler wraps a handler in IdempotentEventHandler then
passes it to EventDispatcher.register which expects handler.__name__; set the
wrapper's __name__ (or use functools.update_wrapper/functools.wraps) so the
wrapped instance exposes the original handler's __name__ to avoid
AttributeError. In practice, after creating wrapped in
register_idempotent_handler, assign wrapped.__name__ = getattr(handler,
"__name__", type(handler).__name__) (or call update_wrapper(wrapped, handler))
before calling self._dispatcher.register(event_type)(wrapped).
| async def register_connection(self, execution_id: str, connection_id: str) -> None: | ||
| """Register a new SSE connection.""" | ||
| async with self._lock: | ||
| if execution_id not in self._active_connections: | ||
| self._active_connections[execution_id] = set() | ||
|
|
||
| self._active_connections[execution_id].add(connection_id) | ||
| self.logger.debug("Registered SSE connection", extra={"connection_id": connection_id}) | ||
| self.metrics.increment_sse_connections("executions") | ||
|
|
||
| async def unregister_connection(self, execution_id: str, connection_id: str) -> None: | ||
| """Unregister an SSE connection.""" | ||
| async with self._lock: | ||
| if execution_id in self._active_connections: | ||
| self._active_connections[execution_id].discard(connection_id) | ||
| if not self._active_connections[execution_id]: | ||
| del self._active_connections[execution_id] | ||
|
|
||
| self.logger.debug("Unregistered SSE connection", extra={"connection_id": connection_id}) | ||
| self.metrics.decrement_sse_connections("executions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prevent metrics drift on duplicate register/unregister.
Line 36 and Line 49 update metrics even when the connection was already present/absent, which can over/under-count (and potentially go negative). Consider only updating metrics when the set actually changes.
🛠️ Suggested fix
async def register_connection(self, execution_id: str, connection_id: str) -> None:
"""Register a new SSE connection."""
async with self._lock:
- if execution_id not in self._active_connections:
- self._active_connections[execution_id] = set()
-
- self._active_connections[execution_id].add(connection_id)
- self.logger.debug("Registered SSE connection", extra={"connection_id": connection_id})
- self.metrics.increment_sse_connections("executions")
+ conns = self._active_connections.setdefault(execution_id, set())
+ if connection_id not in conns:
+ conns.add(connection_id)
+ self.metrics.increment_sse_connections("executions")
+ self.logger.debug("Registered SSE connection", extra={"connection_id": connection_id})
async def unregister_connection(self, execution_id: str, connection_id: str) -> None:
"""Unregister an SSE connection."""
async with self._lock:
- if execution_id in self._active_connections:
- self._active_connections[execution_id].discard(connection_id)
- if not self._active_connections[execution_id]:
- del self._active_connections[execution_id]
-
- self.logger.debug("Unregistered SSE connection", extra={"connection_id": connection_id})
- self.metrics.decrement_sse_connections("executions")
+ conns = self._active_connections.get(execution_id)
+ if conns and connection_id in conns:
+ conns.remove(connection_id)
+ if not conns:
+ del self._active_connections[execution_id]
+ self.metrics.decrement_sse_connections("executions")
+ self.logger.debug("Unregistered SSE connection", extra={"connection_id": connection_id})🤖 Prompt for AI Agents
In `@backend/app/services/sse/sse_connection_registry.py` around lines 30 - 49,
The register_connection and unregister_connection currently call
metrics.increment_sse_connections / decrement_sse_connections unconditionally,
causing drift on duplicate adds/removes; modify register_connection to only call
metrics.increment_sse_connections("executions") (and log) if connection_id was
not already present in self._active_connections[execution_id], and modify
unregister_connection to only call
metrics.decrement_sse_connections("executions") (and log) if connection_id was
actually removed (i.e., it existed before discard and the set changed or length
decreased); keep the async with self._lock around these checks and mutations and
use the same symbols (_active_connections, register_connection,
unregister_connection, metrics.increment_sse_connections,
metrics.decrement_sse_connections) to locate the changes.
| async def run_with_signal() -> None: | ||
| consumer_started.set() | ||
| await consumer.run() | ||
|
|
||
| task = asyncio.create_task(run_with_signal()) | ||
|
|
||
| try: | ||
| st = c.get_status() | ||
| assert st.state == "running" and st.is_running is True | ||
| # Exercise seek functions; don't force specific partition offsets | ||
| await c.seek_to_beginning() | ||
| await c.seek_to_end() | ||
| # No need to sleep; just ensure we can call seek APIs while running | ||
| # Wait for consumer to start | ||
| await asyncio.wait_for(consumer_started.wait(), timeout=5.0) | ||
|
|
||
| # Exercise seek functions while consumer is running | ||
| await consumer.seek_to_beginning() | ||
| await consumer.seek_to_end() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test may have a race: consumer_started signals before consumer is fully running.
consumer_started.set() fires before consumer.run() is called, so the seek operations at lines 51-52 may execute before the internal AIOKafkaConsumer is started. The seek_to_beginning() and seek_to_end() methods guard with if self._consumer, but self._consumer is only assigned inside run() after construction.
Consider moving the event signal after consumer initialization:
💡 Suggested improvement
async def run_with_signal() -> None:
- consumer_started.set()
- await consumer.run()
+ # Small delay to ensure consumer.run() has started and set self._consumer
+ await asyncio.sleep(0.1)
+ consumer_started.set()
+ await consumer.run()Alternatively, add a short delay before exercising seek methods, or check that consumer._consumer is not None before proceeding.
🤖 Prompt for AI Agents
In `@backend/tests/integration/events/test_consumer_lifecycle.py` around lines 40
- 52, The test signals readiness too early: move the consumer_started.set() so
it runs after the internal consumer is initialized inside run() (or after
consumer.run() completes its startup logic) to avoid calling
consumer.seek_to_beginning()/consumer.seek_to_end() before consumer._consumer is
assigned; locate the run_with_signal coroutine and ensure the readiness event is
set only once consumer.run() has created/started the AIOKafkaConsumer (or add an
explicit check/wait for consumer._consumer to be non-None before performing
seeks).
backend/workers/run_coordinator.py
Outdated
| async def run_coordinator_tasks() -> None: | ||
| """Run consumer and scheduling loop using TaskGroup.""" | ||
| async with asyncio.TaskGroup() as tg: | ||
| tg.create_task(consumer.run()) | ||
| tg.create_task(logic.scheduling_loop()) | ||
|
|
||
| try: | ||
| # Wait for shutdown signal or service to stop | ||
| while coordinator.is_running and not shutdown_event.is_set(): | ||
| await asyncio.sleep(60) | ||
| status = await coordinator.get_status() | ||
| logger.info(f"Coordinator status: {status}") | ||
| # Run coordinator until shutdown signal | ||
| run_task = asyncio.create_task(run_coordinator_tasks()) | ||
| shutdown_task = asyncio.create_task(shutdown_event.wait()) | ||
|
|
||
| done, pending = await asyncio.wait( | ||
| [run_task, shutdown_task], | ||
| return_when=asyncio.FIRST_COMPLETED, | ||
| ) | ||
|
|
||
| # Cancel remaining tasks | ||
| for task in pending: | ||
| task.cancel() | ||
| try: | ||
| await task | ||
| except asyncio.CancelledError: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cat -n backend/workers/run_coordinator.py | head -80Repository: HardMax71/Integr8sCode
Length of output: 188
🌐 Web query:
asyncio.wait exception handling task failures not propagated
💡 Result:
Short answer
asyncio.wait does not re-raise exceptions from tasks. It just returns (done, pending); exceptions raised inside tasks are stored on the Task object and only propagated if you await the task or call task.result()/task.exception(). Use await on the tasks (or gather), or check task.exception() / use return_when=FIRST_EXCEPTION to detect failures. [1][2][3]
Why this happens (concise)
- create_task schedules a Task that runs independently; an unhandled exception becomes the Task’s exception state — it is not automatically re-raised in the creator. The loop will warn if the exception is never retrieved. [1][2]
- asyncio.wait only waits for completion (or first exception when you use FIRST_EXCEPTION) and returns sets; it does not itself raise the exception. You must inspect/await the returned Task(s) to get the exception. [1][2]
What to do (options)
- Await the tasks (propagates exception): results = await asyncio.gather(*tasks) (gather will re-raise the first exception by default). [1][4]
- Use await individually or call task.result()/task.exception() on tasks returned by asyncio.wait(done, ...). Example:
- done, _ = await asyncio.wait(tasks, return_when=asyncio.FIRST_EXCEPTION)
- for t in done:
exc = t.exception()
if exc is not None: handle/raise(exc)
- Run a monitor loop that uses return_when=FIRST_EXCEPTION and reacts immediately (reschedule/cancel/raise). [5]
References
[1] Python docs — asyncio.wait / tasks & futures.
[2] StackOverflow — explanation: exceptions stay in Task until retrieved.
[3] Python issue tracker — unhandled-task exceptions are delivered to exception handler; tasks don’t implicitly propagate to parent.
[4] asyncio.gather behavior (docs / tutorials).
[5] StackOverflow example of an exception-aware scheduler using asyncio.wait(FIRST_EXCEPTION).
If you want, I can paste a minimal code example showing each approach.
Surface task failures after asyncio.wait.
If run_coordinator_tasks() fails, the exception is stored in the task but never retrieved because the code only processes pending tasks. The done set containing the failed task is ignored, so the worker exits cleanly despite a crash.
🐛 Proposed fix: propagate completed task exceptions
done, pending = await asyncio.wait(
[run_task, shutdown_task],
return_when=asyncio.FIRST_COMPLETED,
)
# Cancel remaining tasks
for task in pending:
task.cancel()
try:
await task
except asyncio.CancelledError:
pass
+
+ # Surface failures from completed tasks
+ for task in done:
+ if exc := task.exception():
+ logger.error("Coordinator task failed: %s", exc)
+ raise exc🤖 Prompt for AI Agents
In `@backend/workers/run_coordinator.py` around lines 42 - 64, The code awaits
asyncio.wait([run_task, shutdown_task]) but only inspects pending tasks, so
exceptions from completed tasks (e.g., run_task from run_coordinator_tasks())
are never re-raised; update the logic after asyncio.wait to iterate over the
done set and call task.result() (or check task.exception()) for each done task
and re-raise or surface any exceptions (at least for run_task) before cancelling
pending tasks; locate run_coordinator_tasks, run_task, shutdown_task, and the
asyncio.wait call to implement this change so failures in run_coordinator_tasks
propagate correctly.
| logger.info("ResultProcessor consumer initialized, starting run...") | ||
|
|
||
| logger.info("ResultProcessor started and running") | ||
| try: | ||
| # Run consumer until shutdown signal | ||
| run_task = asyncio.create_task(consumer.run()) | ||
| shutdown_task = asyncio.create_task(shutdown_event.wait()) | ||
|
|
||
| done, pending = await asyncio.wait( | ||
| [run_task, shutdown_task], | ||
| return_when=asyncio.FIRST_COMPLETED, | ||
| ) | ||
|
|
||
| # Wait for shutdown signal or service to stop | ||
| while processor._state == ProcessingState.PROCESSING and not shutdown_event.is_set(): | ||
| await asyncio.sleep(60) | ||
| status = await processor.get_status() | ||
| logger.info(f"ResultProcessor status: {status}") | ||
| # Cancel remaining tasks | ||
| for task in pending: | ||
| task.cancel() | ||
| try: | ||
| await task | ||
| except asyncio.CancelledError: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cat -n backend/workers/run_result_processor.py | head -80Repository: HardMax71/Integr8sCode
Length of output: 3310
🏁 Script executed:
cat -n backend/workers/run_result_processor.py | tail -30Repository: HardMax71/Integr8sCode
Length of output: 1203
🏁 Script executed:
python3 << 'EOF'
import asyncio
# Test 1: What happens if a task in done raises an exception?
async def test_exception_handling():
async def failing_task():
raise ValueError("Task failed")
async def wait_task():
await asyncio.sleep(0.1)
task1 = asyncio.create_task(failing_task())
task2 = asyncio.create_task(wait_task())
done, pending = await asyncio.wait(
[task1, task2],
return_when=asyncio.FIRST_COMPLETED,
)
print("Done tasks:", done)
print("Pending tasks:", pending)
print("task1 in done:", task1 in done)
print("task1 cancelled:", task1.cancelled())
try:
exc = task1.exception()
print("Exception from task1:", exc)
except asyncio.CancelledError:
print("Task1 was cancelled")
except Exception as e:
print("Error getting exception:", e)
# Now cancel task2
for task in pending:
task.cancel()
try:
await task
except asyncio.CancelledError:
print("Pending task was cancelled")
asyncio.run(test_exception_handling())
EOFRepository: HardMax71/Integr8sCode
Length of output: 152
Surface consumer.run() failures instead of swallowing them.
If run_task completes with an exception, it will be in the done set, but the exception is never retrieved. The worker exits normally without signaling failure. Check if run_task completed with an exception and either log it or re-raise before shutdown.
Suggested fix
done, pending = await asyncio.wait(
[run_task, shutdown_task],
return_when=asyncio.FIRST_COMPLETED,
)
+ if run_task in done and not run_task.cancelled():
+ exc = run_task.exception()
+ if exc:
+ logger.error("ResultProcessor consumer crashed", exc_info=exc)
+ raise exc
+
# Cancel remaining tasks
for task in pending:
task.cancel()
try:
await task
except asyncio.CancelledError:
pass📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| logger.info("ResultProcessor consumer initialized, starting run...") | |
| logger.info("ResultProcessor started and running") | |
| try: | |
| # Run consumer until shutdown signal | |
| run_task = asyncio.create_task(consumer.run()) | |
| shutdown_task = asyncio.create_task(shutdown_event.wait()) | |
| done, pending = await asyncio.wait( | |
| [run_task, shutdown_task], | |
| return_when=asyncio.FIRST_COMPLETED, | |
| ) | |
| # Wait for shutdown signal or service to stop | |
| while processor._state == ProcessingState.PROCESSING and not shutdown_event.is_set(): | |
| await asyncio.sleep(60) | |
| status = await processor.get_status() | |
| logger.info(f"ResultProcessor status: {status}") | |
| # Cancel remaining tasks | |
| for task in pending: | |
| task.cancel() | |
| try: | |
| await task | |
| except asyncio.CancelledError: | |
| pass | |
| logger.info("ResultProcessor consumer initialized, starting run...") | |
| try: | |
| # Run consumer until shutdown signal | |
| run_task = asyncio.create_task(consumer.run()) | |
| shutdown_task = asyncio.create_task(shutdown_event.wait()) | |
| done, pending = await asyncio.wait( | |
| [run_task, shutdown_task], | |
| return_when=asyncio.FIRST_COMPLETED, | |
| ) | |
| if run_task in done and not run_task.cancelled(): | |
| exc = run_task.exception() | |
| if exc: | |
| logger.error("ResultProcessor consumer crashed", exc_info=exc) | |
| raise exc | |
| # Cancel remaining tasks | |
| for task in pending: | |
| task.cancel() | |
| try: | |
| await task | |
| except asyncio.CancelledError: | |
| pass |
🤖 Prompt for AI Agents
In `@backend/workers/run_result_processor.py` around lines 35 - 53, The current
shutdown sequence swallows exceptions from consumer.run() because run_task's
exception is never retrieved; after the asyncio.wait completes, inspect the
completed tasks (e.g., check if run_task is in the done set), call
run_task.result() (or handle/run_task.exception()) to surface any exception, log
it with logger.error including context, and re-raise or set an exit error before
cancelling pending tasks (shutdown_task) so failures in consumer.run() are not
treated as normal shutdown; update the logic around run_task/run_task.result(),
shutdown_task, and shutdown_event accordingly.
| try: | ||
| # Wait for shutdown signal or service to stop | ||
| while orchestrator.is_running and not shutdown_event.is_set(): | ||
| await asyncio.sleep(1) | ||
| # Run orchestrator until shutdown signal | ||
| run_task = asyncio.create_task(run_orchestrator_tasks()) | ||
| shutdown_task = asyncio.create_task(shutdown_event.wait()) | ||
|
|
||
| done, pending = await asyncio.wait( | ||
| [run_task, shutdown_task], | ||
| return_when=asyncio.FIRST_COMPLETED, | ||
| ) | ||
|
|
||
| # Cancel remaining tasks | ||
| for task in pending: | ||
| task.cancel() | ||
| try: | ||
| await task | ||
| except asyncio.CancelledError: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exceptions from run_task may be silently swallowed.
If run_orchestrator_tasks() raises an exception (propagated from the TaskGroup), the current flow catches pending tasks but doesn't re-raise the exception from done tasks. This could hide failures.
🔧 Proposed fix to propagate exceptions
done, pending = await asyncio.wait(
[run_task, shutdown_task],
return_when=asyncio.FIRST_COMPLETED,
)
# Cancel remaining tasks
for task in pending:
task.cancel()
try:
await task
except asyncio.CancelledError:
pass
+ # Re-raise any exception from completed tasks (except shutdown)
+ for task in done:
+ if task is not shutdown_task:
+ task.result() # Raises if task had an exception📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| # Wait for shutdown signal or service to stop | |
| while orchestrator.is_running and not shutdown_event.is_set(): | |
| await asyncio.sleep(1) | |
| # Run orchestrator until shutdown signal | |
| run_task = asyncio.create_task(run_orchestrator_tasks()) | |
| shutdown_task = asyncio.create_task(shutdown_event.wait()) | |
| done, pending = await asyncio.wait( | |
| [run_task, shutdown_task], | |
| return_when=asyncio.FIRST_COMPLETED, | |
| ) | |
| # Cancel remaining tasks | |
| for task in pending: | |
| task.cancel() | |
| try: | |
| await task | |
| except asyncio.CancelledError: | |
| pass | |
| try: | |
| # Run orchestrator until shutdown signal | |
| run_task = asyncio.create_task(run_orchestrator_tasks()) | |
| shutdown_task = asyncio.create_task(shutdown_event.wait()) | |
| done, pending = await asyncio.wait( | |
| [run_task, shutdown_task], | |
| return_when=asyncio.FIRST_COMPLETED, | |
| ) | |
| # Cancel remaining tasks | |
| for task in pending: | |
| task.cancel() | |
| try: | |
| await task | |
| except asyncio.CancelledError: | |
| pass | |
| # Re-raise any exception from completed tasks (except shutdown) | |
| for task in done: | |
| if task is not shutdown_task: | |
| task.result() # Raises if task had an exception |
🤖 Prompt for AI Agents
In `@backend/workers/run_saga_orchestrator.py` around lines 54 - 70, The current
shutdown logic cancels pending tasks but can silently swallow exceptions from
completed tasks like run_task; after asyncio.wait returns, inspect the done set
(e.g., run_task in done) and propagate any errors by awaiting the completed
tasks or checking task.exception() and re-raising it before returning/finishing;
keep existing cancellation/await handling for pending tasks (shutdown_task) but
ensure exceptions from run_orchestrator_tasks are not ignored so the
orchestrator failure surfaces to the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 issues found across 38 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/app/services/idempotency/faststream_middleware.py">
<violation number="1" location="backend/app/services/idempotency/faststream_middleware.py:80">
P2: If `mark_failed` throws an exception (e.g., network error), the original handler exception will be masked. Wrap `mark_failed` in a try-except to preserve the original exception.</violation>
</file>
<file name="backend/pyproject.toml">
<violation number="1" location="backend/pyproject.toml:92">
P2: Pin the dataclasses-avroschema dependency to a specific version to keep dependency resolution reproducible and consistent with the rest of the project.</violation>
<violation number="2" location="backend/pyproject.toml:127">
P2: Pin faststream to a specific version (or add an upper bound) to avoid unexpected breaking changes during dependency resolution.</violation>
</file>
<file name="backend/workers/run_result_processor.py">
<violation number="1" location="backend/workers/run_result_processor.py:157">
P2: The catch-all handler silently discards unhandled events without logging. This could mask routing issues or missing handlers for new event types. Consider logging a warning when unknown events are received.</violation>
</file>
<file name="backend/workers/run_k8s_worker.py">
<violation number="1" location="backend/workers/run_k8s_worker.py:142">
P2: Silent catch-all handler will discard unmatched events without any logging. This can mask bugs in filter logic or cause silent data loss when new event types are added. Consider logging unmatched events at warning level to aid debugging.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| await idempotency.mark_completed(event=event, key_strategy="event_based") | ||
| return handler_result | ||
| except Exception as e: | ||
| await idempotency.mark_failed(event=event, error=str(e), key_strategy="event_based") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: If mark_failed throws an exception (e.g., network error), the original handler exception will be masked. Wrap mark_failed in a try-except to preserve the original exception.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/services/idempotency/faststream_middleware.py, line 80:
<comment>If `mark_failed` throws an exception (e.g., network error), the original handler exception will be masked. Wrap `mark_failed` in a try-except to preserve the original exception.</comment>
<file context>
@@ -0,0 +1,81 @@
+ await idempotency.mark_completed(event=event, key_strategy="event_based")
+ return handler_result
+ except Exception as e:
+ await idempotency.mark_failed(event=event, error=str(e), key_strategy="event_based")
+ raise
</file context>
| "yarl==1.20.1", | ||
| "zipp==3.20.2", | ||
| "monggregate==0.22.1", | ||
| "faststream[kafka]>=0.6.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Pin faststream to a specific version (or add an upper bound) to avoid unexpected breaking changes during dependency resolution.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/pyproject.toml, line 127:
<comment>Pin faststream to a specific version (or add an upper bound) to avoid unexpected breaking changes during dependency resolution.</comment>
<file context>
@@ -124,6 +124,7 @@ dependencies = [
"yarl==1.20.1",
"zipp==3.20.2",
"monggregate==0.22.1",
+ "faststream[kafka]>=0.6.0",
]
</file context>
| "faststream[kafka]>=0.6.0", | |
| "faststream[kafka]==0.6.0", |
| "pyasn1_modules==0.4.2", | ||
| "pydantic==2.9.2", | ||
| "pydantic-avro==0.9.1", | ||
| "dataclasses-avroschema[pydantic]>=0.65.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Pin the dataclasses-avroschema dependency to a specific version to keep dependency resolution reproducible and consistent with the rest of the project.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/pyproject.toml, line 92:
<comment>Pin the dataclasses-avroschema dependency to a specific version to keep dependency resolution reproducible and consistent with the rest of the project.</comment>
<file context>
@@ -89,7 +89,7 @@ dependencies = [
"pyasn1_modules==0.4.2",
"pydantic==2.9.2",
- "pydantic-avro==0.9.1",
+ "dataclasses-avroschema[pydantic]>=0.65.0",
"pydantic-settings==2.5.2",
"pydantic_core==2.23.4",
</file context>
| "dataclasses-avroschema[pydantic]>=0.65.0", | |
| "dataclasses-avroschema[pydantic]==0.65.0", |
|
|
||
| # Default handler for unmatched events (prevents message loss) | ||
| @subscriber() | ||
| async def handle_other(event: DomainEvent) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: The catch-all handler silently discards unhandled events without logging. This could mask routing issues or missing handlers for new event types. Consider logging a warning when unknown events are received.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/workers/run_result_processor.py, line 157:
<comment>The catch-all handler silently discards unhandled events without logging. This could mask routing issues or missing handlers for new event types. Consider logging a warning when unknown events are received.</comment>
<file context>
@@ -76,9 +75,106 @@ def main() -> None:
+
+ # Default handler for unmatched events (prevents message loss)
+ @subscriber()
+ async def handle_other(event: DomainEvent) -> None:
+ pass
+
</file context>
|
|
||
| # Default handler for unmatched events (prevents message loss) | ||
| @subscriber() | ||
| async def handle_other(event: DomainEvent) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Silent catch-all handler will discard unmatched events without any logging. This can mask bugs in filter logic or cause silent data loss when new event types are added. Consider logging unmatched events at warning level to aid debugging.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/workers/run_k8s_worker.py, line 142:
<comment>Silent catch-all handler will discard unmatched events without any logging. This can mask bugs in filter logic or cause silent data loss when new event types are added. Consider logging unmatched events at warning level to aid debugging.</comment>
<file context>
@@ -88,9 +69,99 @@ def main() -> None:
+
+ # Default handler for unmatched events (prevents message loss)
+ @subscriber()
+ async def handle_other(event: DomainEvent) -> None:
+ pass
+
</file context>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/services/pod_monitor/event_mapper.py (1)
11-21: GuardResourceUsageAvroconstruction whenresource_usageis null.
If executor output includes"resource_usage": null, the current call raises and drops the event. Use a safe fallback to{}.🐛 Suggested fix
- return PodLogs( + return PodLogs( stdout=data.get("stdout", ""), stderr=data.get("stderr", ""), exit_code=data.get("exit_code", 0), - resource_usage=ResourceUsageAvro(**data.get("resource_usage", {})), + resource_usage=ResourceUsageAvro(**(data.get("resource_usage") or {})), )Also applies to: 40-47, 491-502
🤖 Fix all issues with AI agents
In `@backend/tests/integration/conftest.py`:
- Around line 74-83: The sse_redis_bus fixture calls SSERedisBus with
redis_client as the only positional argument but SSERedisBus.__init__ expects
logger as the second positional parameter; fix by passing the test logger as the
second positional argument (e.g., SSERedisBus(redis_client, _test_logger,
exec_prefix=..., notif_prefix=...)) or convert to all keywords including
logger=_test_logger to ensure the logger is provided correctly.
In `@backend/tests/integration/result_processor/test_result_processor.py`:
- Around line 72-83: Change the consumer setup in the test to avoid replaying
historical Kafka events: when creating the test ConsumerConfig (the instance
assigned to processor_consumer_config) set auto_offset_reset="latest" instead of
"earliest" and avoid extremely small batch sizes (e.g., remove or increase
max_poll_records=1) so the consumer won't process long retention history slowly;
also ensure test code starts the consumer (the consumer created from
processor_consumer_config) before producing the test events so only new messages
are consumed.
♻️ Duplicate comments (4)
backend/app/core/providers.py (1)
461-470: SSEConnectionRegistry must useScope.APP, notScope.REQUEST.With
Scope.REQUEST, each HTTP request creates a new registry instance with an empty_active_connectionsdictionary. This causesget_connection_count()to always return 0 from the health endpoint since registrations from SSE streams are invisible to other requests. The registry is designed as application-wide infrastructure to track connections globally.- `@provide`(scope=Scope.REQUEST) + `@provide`(scope=Scope.APP) def get_sse_connection_registry(backend/workers/run_pod_monitor.py (1)
74-88: Exceptions frommonitor.run()are silently swallowed.When
asyncio.wait()returns, ifmonitor_taskcompleted with an exception (e.g., K8s API failure), that exception is never retrieved from thedoneset. The worker exits normally without logging the actual failure cause.🐛 Proposed fix
done, pending = await asyncio.wait( [monitor_task, shutdown_task], return_when=asyncio.FIRST_COMPLETED, ) + # Check for exceptions in completed tasks + for task in done: + if task is not shutdown_task and not task.cancelled(): + exc = task.exception() + if exc: + logger.error("PodMonitor crashed", exc_info=exc) + raise exc + for task in pending: task.cancel() with suppress(asyncio.CancelledError): await taskbackend/app/core/dishka_lifespan.py (1)
105-119: Background task failures may go unnoticed until shutdown.Tasks created with
asyncio.create_task()that fail during normal operation will have their exceptions silently swallowed until shutdown whengather(*tasks, return_exceptions=True)is called. Ifrun_sse_consumers(),event_store_consumer.run(), ornotification_service.run()raises an exception, the application continues running in a degraded state without any indication.Consider adding done callbacks to log failures and optionally trigger shutdown on critical task failures.
🛠️ Suggested supervision pattern
tasks = [ asyncio.create_task(run_sse_consumers(), name="sse_consumers"), asyncio.create_task(event_store_consumer.run(), name="event_store_consumer"), asyncio.create_task(notification_service.run(), name="notification_service"), ] + + def _handle_task_done(task: asyncio.Task) -> None: + if task.cancelled(): + return + exc = task.exception() + if exc: + logger.error( + "Background task failed", + extra={"task_name": task.get_name()}, + exc_info=exc, + ) + + for task in tasks: + task.add_done_callback(_handle_task_done) + logger.info( "Background services started", extra={"sse_consumer_count": len(sse_consumers)}, )backend/app/services/saga/saga_logic.py (1)
315-325: Timeout loop should not die on transient exceptions.Any unexpected exception inside
check_timeouts_once()will terminate the loop and stop timeout enforcement. Wrap the loop body to log-and-continue on generic exceptions.🛠️ Suggested hardening
async def check_timeouts_loop(self) -> None: """Check for saga timeouts (runs until cancelled).""" try: while True: # Check every 30 seconds await asyncio.sleep(30) - await self.check_timeouts_once() + try: + await self.check_timeouts_once() + except Exception: + self.logger.error("Timeout checker failed; continuing loop", exc_info=True) except asyncio.CancelledError: self.logger.info("Timeout checker cancelled")
🧹 Nitpick comments (8)
backend/app/services/saga/execution_saga.py (1)
62-100: Useevent.execution_idas a fallback for robustness.
Ifexecution_idis ever missing in context (future refactors or alternate entry points), this step will fail before publishing. A simple fallback makes it resilient.♻️ Suggested tweak
- execution_id = context.get("execution_id") + execution_id = context.get("execution_id") or event.execution_idbackend/workers/run_result_processor.py (1)
134-158: Handlers call private methods (_handle_*) onProcessorLogic.The handlers directly invoke methods prefixed with
_(e.g.,_handle_completed,_handle_failed), which conventionally indicates internal/private methods not intended for external use. Consider either making these methods public (remove underscore prefix) or providing a public interface method.backend/app/services/idempotency/faststream_middleware.py (2)
44-50: Fail-open approach silently skips idempotency on DI failures.The broad
except Exceptioncatch ensures message processing continues even when dependencies are unavailable. Consider logging a warning here to aid debugging when idempotency is unexpectedly skipped due to misconfiguration.♻️ Suggested improvement
# Resolve dependencies from request-scoped container try: idempotency = await container.get(IdempotencyManager) schema_registry = await container.get(SchemaRegistryManager) - except Exception: + except Exception as e: # Dependencies not available - skip idempotency + # Log at debug level to help diagnose misconfiguration + import logging + logging.getLogger(__name__).debug( + "Skipping idempotency: failed to resolve dependencies: %s", e + ) return await call_next(msg)
58-62: Hardcoded topic string "idempotency" passed to deserialize_event.The
topicparameter indeserialize_eventappears to be used only for error context. If the actual topic is available frommsg, consider passing it for more accurate error messages during debugging.backend/workers/run_saga_orchestrator.py (1)
160-161: Consider structured logging for consistency.Using f-string interpolation for logging is inconsistent with the structured logging pattern (
extra={}) used elsewhere in the codebase.♻️ Suggested change
- app_logger.info(f"Subscribing to topics: {topics}") + app_logger.info("Subscribing to topics", extra={"topics": topics})backend/app/events/schema/schema_registry.py (1)
84-102: Minor redundancy: AvroSchema created twice.
_ensure_schema_registeredalready creates anAvroSchemaobject (line 61), butserialize_eventcreates it again (line 92). Consider caching theAvroSchemaobject alongside the schema ID to avoid redundant creation.backend/app/services/result_processor/processor_logic.py (2)
54-66: Consider explicit type check instead of assert for production safety.Using
assert isinstance(event, ...)for type narrowing works, but assertions are stripped when Python runs with-O(optimized mode). If this could happen in production, the type check would be silently skipped.♻️ Safer alternative
async def _handle_completed_wrapper(self, event: DomainEvent) -> None: - assert isinstance(event, ExecutionCompletedEvent) + if not isinstance(event, ExecutionCompletedEvent): + raise TypeError(f"Expected ExecutionCompletedEvent, got {type(event).__name__}") await self._handle_completed(event)Apply similarly to
_handle_failed_wrapperand_handle_timeout_wrapper.
177-193: Hardcodedservice_version="1.0.0"in event metadata.Consider reading the version from settings or a shared constant to avoid manual updates when the service version changes.
| @pytest_asyncio.fixture | ||
| async def sse_redis_bus(redis_client: redis.Redis) -> SSERedisBus: | ||
| """Provide SSERedisBus with unique prefixes for test isolation.""" | ||
| suffix = uuid.uuid4().hex[:6] | ||
| return SSERedisBus( | ||
| redis_client, | ||
| exec_prefix=f"sse:exec:{suffix}:", | ||
| notif_prefix=f"sse:notif:{suffix}:", | ||
| logger=_test_logger, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find SSERedisBus class definition
fd -t f -e py . | xargs rg -l "class SSERedisBus"Repository: HardMax71/Integr8sCode
Length of output: 103
🏁 Script executed:
#!/bin/bash
# Read the SSERedisBus class definition to check __init__ signature
cat -n backend/app/services/sse/redis_bus.py | head -80Repository: HardMax71/Integr8sCode
Length of output: 3507
Fix argument order in SSERedisBus initialization.
The SSERedisBus.__init__ constructor requires logger as the second positional parameter (not optional). The fixture passes redis_client positionally but omits logger from the positional arguments, causing a runtime TypeError. Either pass logger as the second positional argument or use all keyword arguments:
SSERedisBus(
redis_client,
_test_logger, # Pass as 2nd positional arg
exec_prefix=f"sse:exec:{suffix}:",
notif_prefix=f"sse:notif:{suffix}:",
)🤖 Prompt for AI Agents
In `@backend/tests/integration/conftest.py` around lines 74 - 83, The
sse_redis_bus fixture calls SSERedisBus with redis_client as the only positional
argument but SSERedisBus.__init__ expects logger as the second positional
parameter; fix by passing the test logger as the second positional argument
(e.g., SSERedisBus(redis_client, _test_logger, exec_prefix=...,
notif_prefix=...)) or convert to all keywords including logger=_test_logger to
ensure the logger is provided correctly.
backend/tests/integration/result_processor/test_result_processor.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/events/core/producer.py (1)
44-77: PotentialUnboundLocalErrorif serialization fails before topic assignment.If
serialize_eventraises an exception at line 49,topicis never assigned, but line 75 references it in the error handler.Proposed fix
async def produce( self, event_to_produce: DomainEvent, key: str | None = None, headers: dict[str, str] | None = None ) -> None: """Produce a message to Kafka.""" + topic = f"{self._topic_prefix}{EVENT_TYPE_TO_TOPIC[event_to_produce.event_type]}" try: serialized_value = await self._schema_registry.serialize_event(event_to_produce) - topic = f"{self._topic_prefix}{EVENT_TYPE_TO_TOPIC[event_to_produce.event_type]}" # Convert headers to list of tuples format header_list = [(k, v.encode()) for k, v in headers.items()] if headers else None
🤖 Fix all issues with AI agents
In `@backend/app/events/event_store_consumer.py`:
- Around line 41-69: The process_batch method currently logs deserialization
errors from schema_registry_manager.deserialize_event and then commits offsets
via consumer.commit, which drops bad messages; change process_batch to collect
failures (e.g., a failed_messages list or counter) while iterating messages, for
each failed msg call a DLQ sender or increment event_metrics (e.g.,
event_metrics.record_deserialization_failure(topic=..., consumer_group=...)) and
only commit offsets for partitions/messages that succeeded (or skip committing
if any partition has failures), ensuring _store_batch is called only with
successfully deserialized events and consumer.commit is adjusted to avoid losing
failed messages; update logger.error to include a unique identifier for the
failed message to aid tracing.
In `@backend/app/services/k8s_worker/worker_logic.py`:
- Around line 70-163: The duplicate-creation bug occurs because
handle_create_pod_command checks _active_creations but only schedules the task,
and the ID is added later inside _create_pod_for_execution, so two concurrent
calls can both schedule tasks; fix by reserving the execution_id immediately in
handle_create_pod_command (add execution_id to self._active_creations and call
self.metrics.update_k8s_active_creations) before asyncio.create_task, and remove
the duplicate add inside _create_pod_for_execution (leave the finally block that
discards the ID and updates metrics). Also guard the scheduling with a
try/except so if task creation fails you remove the reserved ID and update
metrics.
In `@backend/app/services/pod_monitor/monitor.py`:
- Around line 160-182: The synchronous kubernetes iterator returned by
self._watch.stream in _run_watch is blocking the event loop; replace the direct
call with an awaited asyncio.to_thread call (await
asyncio.to_thread(self._watch.stream, self._v1.list_namespaced_pod, **kwargs))
so the blocking stream creation runs off the event loop, then keep iterating and
awaiting self._process_raw_event(event) as before and call
_update_resource_version(stream) in the finally block; target the _run_watch
method and the usages of self._watch.stream, _process_raw_event, and
_update_resource_version when making this change.
In `@backend/tests/unit/services/saga/test_execution_saga_steps.py`:
- Line 204: The assertion in test_execution_saga_steps.py is a no-op because `or
True` makes it always pass; update the test to either remove the `or True` or
replace the assertion with a real type check such as asserting that
QueueExecutionStep().get_compensation() returns an instance/subclass of the
SagaCompensation base (e.g., use isinstance(result, SagaCompensation) or assert
issubclass(type(result), SagaCompensation) and reference
QueueExecutionStep.get_compensation and DeletePodCompensation to ensure the
intended compensation type is validated).
♻️ Duplicate comments (9)
backend/app/services/sse/event_router.py (1)
45-46: Inefficient use ofmodel_dump()to extract a single attribute.Since SSE-relevant events have
execution_idas a direct attribute, usegetattr(event, 'execution_id', None)to avoid serializing the entire model. Additionally,SSERedisBus.publish_eventcallsmodel_dump()internally, resulting in double serialization.♻️ Proposed fix
async def route_event(self, event: DomainEvent) -> None: """Route an event to Redis for SSE delivery.""" - data = event.model_dump() - execution_id = data.get("execution_id") + execution_id = getattr(event, "execution_id", None)backend/app/services/pod_monitor/monitor.py (2)
276-295:RuntimeErroron max reconnect attempts terminates monitor without recovery path.As noted in previous reviews, raising
RuntimeErrorhere terminates_watch_loopand therun()method without any recovery mechanism. The monitor silently stops functioning.Consider either:
- Set a paused/failed state flag and exit gracefully
- Add a supervisor/restart mechanism
- Propagate via a more specific exception that callers can handle
109-129:CancelledErrorshould be re-raised to maintain async cancellation contract.As noted in previous reviews, catching
CancelledErrorwithout re-raising breaks Python's async cancellation semantics. The task will appear to complete normally instead of being cancelled.Fix: Re-raise CancelledError
try: await self._watch_loop() except asyncio.CancelledError: self.logger.info("PodMonitor cancelled") + raise finally:backend/app/services/saga/saga_logic.py (1)
308-318: Keep the timeout checker alive after transient errors.
check_timeouts_looponly catchesCancelledError, so any repository error will terminate the loop permanently. Wrap each iteration in a catch-all and log.💡 Proposed fix
async def check_timeouts_loop(self) -> None: """Check for saga timeouts (runs until cancelled).""" try: while True: # Check every 30 seconds await asyncio.sleep(30) - await self.check_timeouts_once() + try: + await self.check_timeouts_once() + except Exception as e: + self.logger.error(f"Timeout checker error: {e}", exc_info=True) except asyncio.CancelledError: self.logger.info("Timeout checker cancelled")backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (1)
67-69: Strengthen the success-flow assertion.
len(logic._sagas) > 0is already true afterregister_saga, so it doesn’t validatehandle_event. Assert on a side effect, e.g., saga persisted in the repo.💡 Proposed fix
- assert len(logic._sagas) > 0 # noqa: SLF001 + found = await saga_repo.get_saga_by_execution_and_name("e", "s") + assert found is not Nonebackend/app/core/providers.py (1)
567-576: SSEConnectionRegistry should be app-scoped.With
Scope.REQUEST, each request gets a fresh registry and_active_connectionsresets, so connection counts and cross-request visibility break. UseScope.APPto keep a singleton registry.💡 Proposed fix
- `@provide`(scope=Scope.REQUEST) + `@provide`(scope=Scope.APP) def get_sse_connection_registry( self, logger: logging.Logger, connection_metrics: ConnectionMetrics, ) -> SSEConnectionRegistry:backend/workers/run_k8s_worker.py (1)
147-150: Silent catch-all handler discards unmatched events without logging.This catch-all handler silently discards any events that don't match the specific filters. While the comment says "prevents message loss," it actually causes silent data loss. Consider logging unmatched events at warning level to aid debugging and detect routing issues or missing handlers for new event types.
💡 Suggested fix
# Default handler for unmatched events (prevents message loss) `@subscriber`() - async def handle_other(event: DomainEvent) -> None: - pass + async def handle_other(event: DomainEvent) -> None: + app_logger.warning( + "Unhandled event type received", + extra={"event_type": getattr(event, "event_type", "unknown")}, + )backend/workers/run_pod_monitor.py (1)
84-92: Exceptions from completed tasks are silently swallowed.When
monitor.run()fails with an exception, the task will be in thedoneset, but the exception is never retrieved or logged. This masks critical failures and makes debugging production issues difficult.🐛 Proposed fix
done, pending = await asyncio.wait( [monitor_task, shutdown_task], return_when=asyncio.FIRST_COMPLETED, ) + # Surface failures from completed tasks + for task in done: + if task is not shutdown_task and not task.cancelled(): + if exc := task.exception(): + logger.error("PodMonitor task failed", exc_info=exc) + raise exc + for task in pending: task.cancel() with suppress(asyncio.CancelledError): await taskbackend/workers/run_result_processor.py (1)
162-165: Silent catch-all handler discards unmatched events without logging.Same issue as in
run_k8s_worker.py. The catch-all handler silently discards events that don't match specific filters. Add logging to aid debugging.💡 Suggested fix
# Default handler for unmatched events (prevents message loss) `@subscriber`() - async def handle_other(event: DomainEvent) -> None: - pass + async def handle_other(event: DomainEvent) -> None: + app_logger.warning( + "Unhandled event type received", + extra={"event_type": getattr(event, "event_type", "unknown")}, + )
🧹 Nitpick comments (10)
backend/workers/run_saga_orchestrator.py (1)
161-165: Opportunistic timeout check may miss timeouts during idle periods.If no events arrive,
check_timeouts_once()won't be called. For sagas with strict timeout requirements, consider adding a dedicated background task or using FastStream's scheduling capabilities.💡 Alternative: Add a background timeout checker
# In lifespan, after handler registration: async def timeout_loop() -> None: while True: await asyncio.sleep(30) await logic.check_timeouts_once() timeout_task = asyncio.create_task(timeout_loop()) # Cancel in finally blockHowever, if event throughput is consistently high, the opportunistic approach may be sufficient.
backend/tests/helpers/fakes/kubernetes.py (1)
37-41:FakeK8sAppsV1Apihas no methods.If tests need to interact with AppsV1Api methods (e.g., deployments, statefulsets), you'll need to add stub methods. Currently, it only satisfies basic DI wiring.
backend/tests/helpers/fakes/kafka.py (1)
38-78: Consider guarding__anext__against consuming when not started.The async iterator doesn't check
_startedstate before yielding messages. While acceptable for test fakes, this could mask bugs where tests consume from an unstarted consumer.Optional: Add started guard
async def __anext__(self) -> Any: + if not self._started: + raise StopAsyncIteration if self._messages: return self._messages.pop(0) raise StopAsyncIterationbackend/tests/helpers/fakes/schema_registry.py (2)
17-22: Type introspection is fragile - consider adding defensive handling.The
_get_all_event_classesfunction assumesDomainEventisAnnotated[Union[...], Discriminator]. If the type structure changes, this will silently return incorrect results.Add defensive check
`@lru_cache`(maxsize=1) def _get_all_event_classes() -> list[type[DomainEvent]]: """Get all concrete event classes from DomainEvent union.""" - union_type = get_args(DomainEvent)[0] # Annotated[Union[...], Discriminator] -> Union - return list(get_args(union_type)) if get_origin(union_type) else [union_type] + args = get_args(DomainEvent) + if not args: + raise TypeError("DomainEvent must be a generic type with arguments") + union_type = args[0] # Annotated[Union[...], Discriminator] -> Union + inner_args = get_args(union_type) + if not inner_args: + return [union_type] if isinstance(union_type, type) else [] + return list(inner_args)
99-106: Consider caching the event class mapping.The mapping is rebuilt on every
deserialize_jsoncall. Since_get_all_event_classes()is already cached, the mapping could be cached too.Cache the mapping
+ `@lru_cache`(maxsize=1) + def _get_event_class_mapping(self) -> dict[EventType, type[DomainEvent]]: + """Get cached mapping from event type to class.""" + return {cls.model_fields["event_type"].default: cls for cls in _get_all_event_classes()} + def deserialize_json(self, data: dict[str, Any]) -> DomainEvent: """Deserialize JSON data to DomainEvent using event_type field.""" if not (event_type_str := data.get("event_type")): raise ValueError("Missing event_type in event data") - mapping = {cls.model_fields["event_type"].default: cls for cls in _get_all_event_classes()} + mapping = self._get_event_class_mapping() if not (event_class := mapping.get(EventType(event_type_str))): raise ValueError(f"No event class found for event type: {event_type_str}") return event_class.model_validate(data)Note: Since
selfis involved, you'd needfunctools.cached_propertyor a class-level cache instead oflru_cache.backend/workers/run_result_processor.py (1)
140-160: Consider exposing public handler methods instead of calling private methods.The handlers directly call underscore-prefixed methods (
logic._handle_completed,logic._handle_failed,logic._handle_timeout), which are conventionally private. While this works, it couples the handlers to internal implementation details.Consider adding public wrapper methods in
ProcessorLogic(e.g.,handle_completed,handle_failed,handle_timeout) that delegate to the private implementations, or rename the existing methods to be public if they're intended to be called externally.backend/tests/e2e/test_k8s_worker_create_pod.py (1)
85-92: Consider wrapping cleanup in try/except to handle partial creation failures.If the test fails between ConfigMap creation and Pod creation, the cleanup at lines 91-92 will attempt to delete a non-existent pod, potentially raising an exception that masks the original failure.
💡 Suggested improvement
# Cleanup - logic.v1.delete_namespaced_pod(name=f"executor-{exec_id}", namespace=ns) - logic.v1.delete_namespaced_config_map(name=f"script-{exec_id}", namespace=ns) + try: + logic.v1.delete_namespaced_pod(name=f"executor-{exec_id}", namespace=ns) + except ApiException: + pass # Pod may not exist if test failed early + try: + logic.v1.delete_namespaced_config_map(name=f"script-{exec_id}", namespace=ns) + except ApiException: + pass # ConfigMap may not exist if test failed earlyAlternatively, consider using a pytest fixture with cleanup as a finalizer.
backend/app/services/sse/sse_service.py (1)
236-254: Consider dynamic health status instead of hardcoded values.The health status returns hardcoded
status="healthy"andkafka_enabled=True, but:
- The service doesn't validate its actual health (e.g., Redis connectivity)
kafka_enabled=Truemay be misleading since Kafka is now handled by the separate SSE bridge workerIf this is intentional for the refactoring phase, consider adding a TODO or updating the field semantics in
SSEHealthDomain.backend/workers/run_sse_bridge.py (1)
92-129: Lifespan structure is correct but consider error handling for schema initialization.The lifespan correctly:
- Initializes schema registry before defining the subscriber
- Cleans up the container in the finally block
However, if
initialize_event_schemasfails, the exception will propagate and prevent startup, but no specific error logging occurs. Consider wrapping with explicit error handling for better diagnostics.💡 Optional: Add explicit error handling
# Initialize schema registry schema_registry = await container.get(SchemaRegistryManager) - await initialize_event_schemas(schema_registry) + try: + await initialize_event_schemas(schema_registry) + except Exception as e: + app_logger.error(f"Failed to initialize schemas: {e}") + raisebackend/app/services/event_bus.py (1)
111-123: Consider propagating publish errors to callers.The
publishmethod catches all exceptions and logs them but doesn't raise, which means callers won't know if their event failed to publish. For cross-instance cache invalidation, this might be acceptable, but for critical events, callers may need to know about failures.If silent failures are intentional for this use case, consider documenting this behavior.
| async def process_batch(self) -> None: | ||
| """Process a single batch of messages from Kafka. | ||
| self.consumer = UnifiedConsumer( | ||
| config, | ||
| event_dispatcher=self.dispatcher, | ||
| schema_registry=self.schema_registry_manager, | ||
| settings=self.settings, | ||
| logger=self.logger, | ||
| event_metrics=self.event_metrics, | ||
| Called repeatedly by DI provider's background task. | ||
| """ | ||
| batch_data: dict[TopicPartition, list[ConsumerRecord[bytes, bytes]]] = await self.consumer.getmany( | ||
| timeout_ms=self.batch_timeout_ms, | ||
| max_records=self.batch_size, | ||
| ) | ||
|
|
||
| # Register handler for all event types - store everything | ||
| for event_type in EventType: | ||
| self.dispatcher.register(event_type)(self._handle_event) | ||
|
|
||
| # Register error callback - use DLQ if producer is available | ||
| if self.producer: | ||
| # Use DLQ handler with retry logic | ||
| dlq_handler = create_dlq_error_handler( | ||
| producer=self.producer, | ||
| original_topic="event-store", # Generic topic name for event store | ||
| logger=self.logger, | ||
| max_retries=3, | ||
| ) | ||
| self.consumer.register_error_callback(dlq_handler) | ||
| else: | ||
| # Fallback to simple logging | ||
| self.consumer.register_error_callback(self._handle_error_with_event) | ||
|
|
||
| await self.consumer.start(self.topics) | ||
|
|
||
| self._batch_task = asyncio.create_task(self._batch_processor()) | ||
|
|
||
| self.logger.info(f"Event store consumer started for topics: {self.topics}") | ||
|
|
||
| async def _on_stop(self) -> None: | ||
| """Stop consumer.""" | ||
| await self._flush_batch() | ||
|
|
||
| if self._batch_task: | ||
| self._batch_task.cancel() | ||
| try: | ||
| await self._batch_task | ||
| except asyncio.CancelledError: | ||
| pass | ||
|
|
||
| if self.consumer: | ||
| await self.consumer.stop() | ||
|
|
||
| self.logger.info("Event store consumer stopped") | ||
|
|
||
| async def _handle_event(self, event: DomainEvent) -> None: | ||
| """Handle incoming event from dispatcher.""" | ||
| self.logger.info(f"Event store received event: {event.event_type} - {event.event_id}") | ||
|
|
||
| async with self._batch_lock: | ||
| self._batch_buffer.append(event) | ||
|
|
||
| if len(self._batch_buffer) >= self.batch_size: | ||
| await self._flush_batch() | ||
|
|
||
| async def _handle_error_with_event(self, error: Exception, event: DomainEvent) -> None: | ||
| """Handle processing errors with event context.""" | ||
| self.logger.error(f"Error processing event {event.event_id} ({event.event_type}): {error}", exc_info=True) | ||
|
|
||
| async def _batch_processor(self) -> None: | ||
| """Periodically flush batches based on timeout.""" | ||
| while self.is_running: | ||
| try: | ||
| await asyncio.sleep(1) | ||
|
|
||
| async with self._batch_lock: | ||
| time_since_last_batch = asyncio.get_running_loop().time() - self._last_batch_time | ||
|
|
||
| if self._batch_buffer and time_since_last_batch >= self.batch_timeout: | ||
| await self._flush_batch() | ||
|
|
||
| except Exception as e: | ||
| self.logger.error(f"Error in batch processor: {e}") | ||
|
|
||
| async def _flush_batch(self) -> None: | ||
| if not self._batch_buffer: | ||
| if not batch_data: | ||
| return | ||
|
|
||
| batch = self._batch_buffer.copy() | ||
| self._batch_buffer.clear() | ||
| self._last_batch_time = asyncio.get_running_loop().time() | ||
| events: list[DomainEvent] = [] | ||
| for tp, messages in batch_data.items(): | ||
| for msg in messages: | ||
| try: | ||
| event = await self.schema_registry_manager.deserialize_event(msg.value, msg.topic) | ||
| events.append(event) | ||
| self.event_metrics.record_kafka_message_consumed( | ||
| topic=msg.topic, | ||
| consumer_group=str(self.group_id), | ||
| ) | ||
| except Exception as e: | ||
| self.logger.error(f"Failed to deserialize message from {tp}: {e}", exc_info=True) | ||
|
|
||
| if events: | ||
| await self._store_batch(events) | ||
| await self.consumer.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Messages that fail deserialization are silently dropped after commit.
When deserialization fails (line 64-65), the message is logged but the batch continues. After _store_batch, offsets are committed (line 69), permanently losing the failed messages. Consider either:
- Sending failed messages to DLQ before commit
- Tracking deserialization failures in metrics for alerting
Proposed fix: Track failed messages and optionally skip commit if all fail
events: list[DomainEvent] = []
+ failed_count = 0
for tp, messages in batch_data.items():
for msg in messages:
try:
event = await self.schema_registry_manager.deserialize_event(msg.value, msg.topic)
events.append(event)
self.event_metrics.record_kafka_message_consumed(
topic=msg.topic,
consumer_group=str(self.group_id),
)
except Exception as e:
self.logger.error(f"Failed to deserialize message from {tp}: {e}", exc_info=True)
+ failed_count += 1
+ # Consider: self.event_metrics.record_deserialization_error(...)
if events:
await self._store_batch(events)
- await self.consumer.commit()
+
+ # Commit even if some failed - they're lost but we progress
+ # Alternative: send failed to DLQ before committing
+ if events or failed_count > 0:
+ await self.consumer.commit()
+ if failed_count > 0:
+ self.logger.warning(f"Committed batch with {failed_count} deserialization failures")🤖 Prompt for AI Agents
In `@backend/app/events/event_store_consumer.py` around lines 41 - 69, The
process_batch method currently logs deserialization errors from
schema_registry_manager.deserialize_event and then commits offsets via
consumer.commit, which drops bad messages; change process_batch to collect
failures (e.g., a failed_messages list or counter) while iterating messages, for
each failed msg call a DLQ sender or increment event_metrics (e.g.,
event_metrics.record_deserialization_failure(topic=..., consumer_group=...)) and
only commit offsets for partitions/messages that succeeded (or skip committing
if any partition has failures), ensuring _store_batch is called only with
successfully deserialized events and consumer.commit is adjusted to avoid losing
failed messages; update logger.error to include a unique identifier for the
failed message to aid tracing.
| async def handle_create_pod_command(self, command: CreatePodCommandEvent) -> None: | ||
| """Handle create pod command from saga orchestrator.""" | ||
| execution_id = command.execution_id | ||
|
|
||
| # Check if already processing | ||
| if execution_id in self._active_creations: | ||
| self.logger.warning(f"Already creating pod for execution {execution_id}") | ||
| return | ||
|
|
||
| # Create pod asynchronously | ||
| asyncio.create_task(self._create_pod_for_execution(command)) | ||
|
|
||
| async def handle_delete_pod_command(self, command: DeletePodCommandEvent) -> None: | ||
| """Handle delete pod command from saga orchestrator (compensation).""" | ||
| execution_id = command.execution_id | ||
| self.logger.info(f"Deleting pod for execution {execution_id} due to: {command.reason}") | ||
|
|
||
| try: | ||
| # Delete the pod | ||
| pod_name = f"executor-{execution_id}" | ||
| await asyncio.to_thread( | ||
| self.v1.delete_namespaced_pod, | ||
| name=pod_name, | ||
| namespace=self.config.namespace, | ||
| grace_period_seconds=30, | ||
| ) | ||
| self.logger.info(f"Successfully deleted pod {pod_name}") | ||
|
|
||
| # Delete associated ConfigMap | ||
| configmap_name = f"script-{execution_id}" | ||
| await asyncio.to_thread( | ||
| self.v1.delete_namespaced_config_map, name=configmap_name, namespace=self.config.namespace | ||
| ) | ||
| self.logger.info(f"Successfully deleted ConfigMap {configmap_name}") | ||
|
|
||
| # NetworkPolicy cleanup is managed via a static cluster policy; no per-execution NP deletion | ||
|
|
||
| except ApiException as e: | ||
| if e.status == 404: | ||
| self.logger.warning( | ||
| f"Resources for execution {execution_id} not found (may have already been deleted)" | ||
| ) | ||
| else: | ||
| self.logger.error(f"Failed to delete resources for execution {execution_id}: {e}") | ||
|
|
||
| async def _create_pod_for_execution(self, command: CreatePodCommandEvent) -> None: | ||
| """Create pod for execution.""" | ||
| async with self._creation_semaphore: | ||
| execution_id = command.execution_id | ||
| self._active_creations.add(execution_id) | ||
| self.metrics.update_k8s_active_creations(len(self._active_creations)) | ||
|
|
||
| start_time = time.time() | ||
|
|
||
| try: | ||
| # We now have the CreatePodCommandEvent directly from saga | ||
| script_content = command.script | ||
| entrypoint_content = await self._get_entrypoint_script() | ||
|
|
||
| # Create ConfigMap | ||
| config_map = self.pod_builder.build_config_map( | ||
| command=command, script_content=script_content, entrypoint_content=entrypoint_content | ||
| ) | ||
|
|
||
| await self._create_config_map(config_map) | ||
|
|
||
| pod = self.pod_builder.build_pod_manifest(command=command) | ||
| await self._create_pod(pod) | ||
|
|
||
| # Publish PodCreated event | ||
| await self._publish_pod_created(command, pod) | ||
|
|
||
| # Update metrics | ||
| duration = time.time() - start_time | ||
| self.metrics.record_k8s_pod_creation_duration(duration, command.language) | ||
| self.metrics.record_k8s_pod_created("success", command.language) | ||
|
|
||
| self.logger.info( | ||
| f"Successfully created pod {pod.metadata.name} for execution {execution_id}. " | ||
| f"Duration: {duration:.2f}s" | ||
| ) | ||
|
|
||
| except Exception as e: | ||
| self.logger.error(f"Failed to create pod for execution {execution_id}: {e}", exc_info=True) | ||
|
|
||
| # Update metrics | ||
| self.metrics.record_k8s_pod_created("failed", "unknown") | ||
|
|
||
| # Publish failure event | ||
| await self._publish_pod_creation_failed(command, str(e)) | ||
|
|
||
| finally: | ||
| self._active_creations.discard(execution_id) | ||
| self.metrics.update_k8s_active_creations(len(self._active_creations)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prevent duplicate pod creation on concurrent commands.
handle_create_pod_command checks _active_creations before the task adds the ID, so two concurrent calls for the same execution can both enqueue tasks and emit duplicate events. Reserve the ID before scheduling the task and avoid double-add inside _create_pod_for_execution.
💡 Proposed fix
async def handle_create_pod_command(self, command: CreatePodCommandEvent) -> None:
"""Handle create pod command from saga orchestrator."""
execution_id = command.execution_id
# Check if already processing
if execution_id in self._active_creations:
self.logger.warning(f"Already creating pod for execution {execution_id}")
return
+ # Reserve execution_id before scheduling to prevent dupes
+ self._active_creations.add(execution_id)
+ self.metrics.update_k8s_active_creations(len(self._active_creations))
+
# Create pod asynchronously
asyncio.create_task(self._create_pod_for_execution(command))
async def _create_pod_for_execution(self, command: CreatePodCommandEvent) -> None:
"""Create pod for execution."""
async with self._creation_semaphore:
execution_id = command.execution_id
- self._active_creations.add(execution_id)
- self.metrics.update_k8s_active_creations(len(self._active_creations))🤖 Prompt for AI Agents
In `@backend/app/services/k8s_worker/worker_logic.py` around lines 70 - 163, The
duplicate-creation bug occurs because handle_create_pod_command checks
_active_creations but only schedules the task, and the ID is added later inside
_create_pod_for_execution, so two concurrent calls can both schedule tasks; fix
by reserving the execution_id immediately in handle_create_pod_command (add
execution_id to self._active_creations and call
self.metrics.update_k8s_active_creations) before asyncio.create_task, and remove
the duplicate add inside _create_pod_for_execution (leave the finally block that
discards the ID and updates metrics). Also guard the scheduling with a
try/except so if task creation fails you remove the reserved ID and update
metrics.
| async def _run_watch(self) -> None: | ||
| """Run a single watch session.""" | ||
| self.logger.info( | ||
| f"Starting pod watch: selector={self.config.label_selector}, namespace={self.config.namespace}" | ||
| ) | ||
|
|
||
| # Create watch stream | ||
| kwargs = { | ||
| "namespace": context.namespace, | ||
| "label_selector": context.label_selector, | ||
| "timeout_seconds": context.timeout_seconds, | ||
| kwargs: dict[str, Any] = { | ||
| "namespace": self.config.namespace, | ||
| "label_selector": self.config.label_selector, | ||
| "timeout_seconds": self.config.watch_timeout_seconds, | ||
| } | ||
| if self.config.field_selector: | ||
| kwargs["field_selector"] = self.config.field_selector | ||
| if self._last_resource_version: | ||
| kwargs["resource_version"] = self._last_resource_version | ||
|
|
||
| if context.field_selector: | ||
| kwargs["field_selector"] = context.field_selector | ||
|
|
||
| if context.resource_version: | ||
| kwargs["resource_version"] = context.resource_version | ||
|
|
||
| # Watch stream (clients guaranteed by __init__) | ||
| stream = self._watch.stream(self._v1.list_namespaced_pod, **kwargs) | ||
|
|
||
| try: | ||
| for event in stream: | ||
| if self._state != MonitorState.RUNNING: | ||
| break | ||
|
|
||
| await self._process_raw_event(event) | ||
|
|
||
| finally: | ||
| # Store resource version for next watch | ||
| self._update_resource_version(stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the kubernetes watch.stream implementation and usage patterns
rg -n "watch.stream" --type=py -B 2 -A 5
# Check if there's async handling elsewhere
rg -n "to_thread.*stream" --type=pyRepository: HardMax71/Integr8sCode
Length of output: 1456
🏁 Script executed:
# Also check how self._watch is initialized
rg -n "self\._watch.*=" --type=py -B 1 -A 1Repository: HardMax71/Integr8sCode
Length of output: 265
🏁 Script executed:
# Check the full monitor.py file to understand the context
wc -l backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 113
🏁 Script executed:
# Get the imports in monitor.py to understand what libraries are being used
head -50 backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 1273
🏁 Script executed:
# Check for asyncio.to_thread or executor usage around this code
rg -n "to_thread|run_in_executor|ThreadPoolExecutor" --type=py backend/app/services/pod_monitor/Repository: HardMax71/Integr8sCode
Length of output: 265
🏁 Script executed:
# Get the full _run_watch method and surrounding context
sed -n '160,185p' backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 1085
🏁 Script executed:
# Check for any middleware or context managers that might handle this
rg -n "contextmanager|generator|async for" --type=py backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# Check what _process_raw_event does
rg -n "def _process_raw_event" --type=py -A 15 backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 762
🏁 Script executed:
# Look at the broader _run_watch context including the try-except
sed -n '155,195p' backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 1706
🏁 Script executed:
# Check if there are any known async watch patterns in tests or elsewhere
rg -n "watch|stream" --type=py backend/tests/ | head -30Repository: HardMax71/Integr8sCode
Length of output: 3314
🏁 Script executed:
# Check the requirements to see which version of kubernetes library is used
cat backend/requirements.txt | grep -i kubernetesRepository: HardMax71/Integr8sCode
Length of output: 122
🏁 Script executed:
# Look for any existing issues or comments about watch blocking
rg -n "block|thread|concurrent" --type=py backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 177
🏁 Script executed:
# Check if there's any configuration or documentation about this behavior
find backend -name "*.md" -o -name "*.rst" -o -name "*.txt" | xargs grep -l "watch\|async" 2>/dev/null | head -5Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# Search for requirements files in various locations
find backend -name "*requirements*" -o -name "pyproject.toml" -o -name "poetry.lock" | head -10Repository: HardMax71/Integr8sCode
Length of output: 88
🏁 Script executed:
# Check the FakeWatch implementation to understand the expected pattern
sed -n '150,180p' backend/tests/helpers/k8s_fakes.pyRepository: HardMax71/Integr8sCode
Length of output: 1005
🏁 Script executed:
# Check if there are any GitHub issues or comments in the codebase about async watch
rg -n "TODO|FIXME|XXX|HACK|BUG" --type=py backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# Check pyproject.toml for kubernetes dependency
cat backend/pyproject.toml | grep -A 10 -B 2 "kubernetes\|dependencies"Repository: HardMax71/Integr8sCode
Length of output: 804
Wrap the watch stream iteration in asyncio.to_thread() to avoid blocking the event loop.
The self._watch.stream() call returns a synchronous iterator from the kubernetes library. Using for event in stream: blocks the event loop while waiting for events, preventing other async tasks from running. The codebase already uses the asyncio.to_thread() pattern elsewhere (lines 117, 305) for blocking I/O. Apply the same pattern here:
stream = await asyncio.to_thread(
self._watch.stream, self._v1.list_namespaced_pod, **kwargs
)
try:
for event in stream:
await self._process_raw_event(event)🤖 Prompt for AI Agents
In `@backend/app/services/pod_monitor/monitor.py` around lines 160 - 182, The
synchronous kubernetes iterator returned by self._watch.stream in _run_watch is
blocking the event loop; replace the direct call with an awaited
asyncio.to_thread call (await asyncio.to_thread(self._watch.stream,
self._v1.list_namespaced_pod, **kwargs)) so the blocking stream creation runs
off the event loop, then keep iterating and awaiting
self._process_raw_event(event) as before and call
_update_resource_version(stream) in the finally block; target the _run_watch
method and the usages of self._watch.stream, _process_raw_event, and
_update_resource_version when making this change.
| assert ValidateExecutionStep().get_compensation() is None | ||
| assert isinstance(AllocateResourcesStep(_FakeAllocRepo()).get_compensation(), ReleaseResourcesCompensation) | ||
| assert isinstance(AllocateResourcesStep(alloc_repo).get_compensation(), ReleaseResourcesCompensation) | ||
| assert isinstance(QueueExecutionStep().get_compensation(), type(DeletePodCompensation(None)).__bases__[0]) or True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assertion is a no-op (or True makes it always pass).
The expression isinstance(...) or True will always evaluate to True, making this assertion meaningless. Either fix the assertion or remove it if the check isn't needed.
🐛 Proposed fix
- assert isinstance(QueueExecutionStep().get_compensation(), type(DeletePodCompensation(None)).__bases__[0]) or True
+ # QueueExecutionStep has no compensation
+ assert QueueExecutionStep().get_compensation() is NoneOr if the intent was to check that it returns a SagaCompensation subclass:
- assert isinstance(QueueExecutionStep().get_compensation(), type(DeletePodCompensation(None)).__bases__[0]) or True
+ from app.services.saga.saga_step import SagaCompensation
+ comp = QueueExecutionStep().get_compensation()
+ assert comp is None or isinstance(comp, SagaCompensation)🤖 Prompt for AI Agents
In `@backend/tests/unit/services/saga/test_execution_saga_steps.py` at line 204,
The assertion in test_execution_saga_steps.py is a no-op because `or True` makes
it always pass; update the test to either remove the `or True` or replace the
assertion with a real type check such as asserting that
QueueExecutionStep().get_compensation() returns an instance/subclass of the
SagaCompensation base (e.g., use isinstance(result, SagaCompensation) or assert
issubclass(type(result), SagaCompensation) and reference
QueueExecutionStep.get_compensation and DeletePodCompensation to ensure the
intended compensation type is validated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 47 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/tests/unit/services/saga/test_execution_saga_steps.py">
<violation number="1" location="backend/tests/unit/services/saga/test_execution_saga_steps.py:60">
P2: This test inserts 100 active allocation documents into the session-scoped unit DB but never removes them, so data persists across tests and can cause order-dependent failures. Clean up the inserted allocations (e.g., delete them in a finally block) to keep unit tests isolated.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| assert ctx.get("allocation_id") is not None | ||
|
|
||
| # Test 2: Limit exceeded (insert 100 active allocations) | ||
| for i in range(100): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: This test inserts 100 active allocation documents into the session-scoped unit DB but never removes them, so data persists across tests and can cause order-dependent failures. Clean up the inserted allocations (e.g., delete them in a finally block) to keep unit tests isolated.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/unit/services/saga/test_execution_saga_steps.py, line 60:
<comment>This test inserts 100 active allocation documents into the session-scoped unit DB but never removes them, so data persists across tests and can cause order-dependent failures. Clean up the inserted allocations (e.g., delete them in a finally block) to keep unit tests isolated.</comment>
<file context>
@@ -42,50 +44,41 @@ async def test_validate_execution_step_success_and_failures() -> None:
+ assert ctx.get("allocation_id") is not None
+
+ # Test 2: Limit exceeded (insert 100 active allocations)
+ for i in range(100):
+ doc = ResourceAllocationDocument(
+ allocation_id=f"limit-test-alloc-{i}",
</file context>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 issues found across 49 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/pyproject.toml">
<violation number="1" location="backend/pyproject.toml:140">
P2: New dev dependencies are added with >= ranges while the rest of the group is pinned to exact versions, which makes dev/test environments non-reproducible. Pin these to exact versions to align with the existing dependency policy.</violation>
</file>
<file name="backend/tests/unit/services/saga/test_execution_saga_steps.py">
<violation number="1" location="backend/tests/unit/services/saga/test_execution_saga_steps.py:71">
P2: This test inserts 100 active ResourceAllocationDocument entries but never cleans them up. Since unit_container is session-scoped and there’s no per-test DB cleanup, these documents can leak into other unit tests and make them flaky (e.g., causing active-allocation limits to trip). Add cleanup to delete the inserted documents or use a fixture that isolates the collection for this test.</violation>
</file>
<file name="backend/app/core/providers.py">
<violation number="1" location="backend/app/core/providers.py:192">
P2: MongoDB client is never closed after yielding the database, which can leak connections on shutdown. Wrap the yield in a try/finally and close the client in the finally block.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| [dependency-groups] | ||
| dev = [ | ||
| "coverage==7.13.0", | ||
| "fakeredis>=2.33.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: New dev dependencies are added with >= ranges while the rest of the group is pinned to exact versions, which makes dev/test environments non-reproducible. Pin these to exact versions to align with the existing dependency policy.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/pyproject.toml, line 140:
<comment>New dev dependencies are added with >= ranges while the rest of the group is pinned to exact versions, which makes dev/test environments non-reproducible. Pin these to exact versions to align with the existing dependency policy.</comment>
<file context>
@@ -137,9 +137,11 @@ packages = ["app", "workers"]
[dependency-groups]
dev = [
"coverage==7.13.0",
+ "fakeredis>=2.33.0",
"hypothesis==6.103.4",
"iniconfig==2.0.0",
</file context>
| "fakeredis>=2.33.0", | |
| "fakeredis==2.33.0", |
| memory_limit="512Mi", | ||
| status="active", | ||
| ) | ||
| await doc.insert() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: This test inserts 100 active ResourceAllocationDocument entries but never cleans them up. Since unit_container is session-scoped and there’s no per-test DB cleanup, these documents can leak into other unit tests and make them flaky (e.g., causing active-allocation limits to trip). Add cleanup to delete the inserted documents or use a fixture that isolates the collection for this test.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/unit/services/saga/test_execution_saga_steps.py, line 71:
<comment>This test inserts 100 active ResourceAllocationDocument entries but never cleans them up. Since unit_container is session-scoped and there’s no per-test DB cleanup, these documents can leak into other unit tests and make them flaky (e.g., causing active-allocation limits to trip). Add cleanup to delete the inserted documents or use a fixture that isolates the collection for this test.</comment>
<file context>
@@ -42,50 +44,41 @@ async def test_validate_execution_step_success_and_failures() -> None:
+ memory_limit="512Mi",
+ status="active",
+ )
+ await doc.insert()
- # Limit exceeded
</file context>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@backend/app/core/dishka_lifespan.py`:
- Around line 7-11: The AsyncIterable providers are not closing their resources
on shutdown: update get_database to close the AsyncMongoClient after the yield
(await client.close()), update get_dlq_manager to call await consumer.stop()
(and any DLQ manager cleanup/close) after the yield instead of only cancelling
tasks, and add proper cleanup for get_schema_registry (e.g., close or shutdown
the schema registry client/resource after yield); follow the same pattern used
in get_event_bus and get_event_store_consumer where consumers are stopped with
await consumer.stop() after yielding.
In `@backend/workers/run_sse_bridge.py`:
- Around line 101-102: The decode_avro function currently calls
schema_registry.deserialize_event(body, "sse_bridge") with a hardcoded topic
string; update decode_avro to either pass the real subscribed topic from the
message context (e.g., use the topic variable available where messages are
received) into schema_registry.deserialize_event, or remove the topic argument
entirely if deserialize_event truly does not use it. Locate decode_avro and the
call sites where messages and topic are available, then replace the literal
"sse_bridge" with the dynamic topic variable or adjust deserialize_event usage
accordingly to avoid the misleading hardcoded value.
♻️ Duplicate comments (4)
backend/workers/run_result_processor.py (1)
156-159: Catch-all handler silently discards unmatched events.This handler prevents message loss but provides no visibility into unexpected events. Consider logging at debug/warning level to aid troubleshooting.
Suggested fix
# Default handler for unmatched events (prevents message loss) `@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + app_logger.debug( + "Received unhandled event type", + extra={"event_type": getattr(event, "event_type", "unknown")}, + )backend/app/core/providers.py (1)
577-586: SSEConnectionRegistry scope must beScope.APP, notScope.REQUEST.With
Scope.REQUEST, each HTTP request gets a new registry instance with an empty_active_connectionsdictionary. This breaks cross-request connection tracking and causesget_connection_count()to always return 0 in health checks.Suggested fix
- `@provide`(scope=Scope.REQUEST) + `@provide`(scope=Scope.APP) def get_sse_connection_registry( self, logger: logging.Logger, connection_metrics: ConnectionMetrics, ) -> SSEConnectionRegistry:backend/workers/run_pod_monitor.py (1)
75-88: Exceptions from completed tasks are still silently swallowed.The
asyncio.waitpattern returns completed tasks indone, but exceptions are never checked. Ifmonitor.run()raises an exception, it ends up indonebut the exception is silently ignored. This makes debugging production failures difficult.Suggested fix
done, pending = await asyncio.wait( [monitor_task, shutdown_task], return_when=asyncio.FIRST_COMPLETED, ) + # Check for exceptions in completed tasks + for task in done: + if task is not shutdown_task and not task.cancelled(): + exc = task.exception() + if exc: + logger.error("Monitor task failed", exc_info=exc) + raise exc + for task in pending: task.cancel() with suppress(asyncio.CancelledError): await taskbackend/workers/run_k8s_worker.py (1)
146-149: Don’t silently drop unmatched events.The no‑op catch‑all handler will ack and discard events that don’t match filters, masking new event types or header bugs. Consider logging or routing to a dead‑letter path.
🪵 Suggested minimal logging
`@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + app_logger.warning( + "Unhandled event received: %s", + getattr(event, "event_type", type(event).__name__), + )
🧹 Nitpick comments (5)
backend/app/events/schema/schema_registry.py (1)
82-102: Schema object is recreated unnecessarily in serialize_event.
_ensure_schema_registeredalready creates anAvroSchemaobject internally but doesn't return it. Thenserialize_eventcreates it again on line 92. Consider returning the schema object from_ensure_schema_registeredto avoid redundant schema parsing.Suggested optimization
- async def _ensure_schema_registered(self, event_class: type[DomainEvent]) -> int: + async def _ensure_schema_registered(self, event_class: type[DomainEvent]) -> tuple[int, schema.AvroSchema]: """Lazily register schema and return schema ID.""" if event_class in self._schema_id_cache: - return self._schema_id_cache[event_class] + avro_schema_dict = event_class.avro_schema_to_python() + return self._schema_id_cache[event_class], schema.AvroSchema(avro_schema_dict) subject = f"{self.subject_prefix}{event_class.__name__}-value" avro_schema_dict = event_class.avro_schema_to_python() avro_schema_obj = schema.AvroSchema(avro_schema_dict) schema_id: int = await self._client.register(subject, avro_schema_obj) self._schema_id_cache[event_class] = schema_id self._id_to_class_cache[schema_id] = event_class self.logger.debug(f"Registered schema {event_class.__name__}: ID {schema_id}") - return schema_id + return schema_id, avro_schema_obj async def serialize_event(self, event: DomainEvent) -> bytes: """Serialize event to Confluent wire format.""" event_class = event.__class__ subject = f"{self.subject_prefix}{event_class.__name__}-value" - await self._ensure_schema_registered(event_class) - - avro_schema_dict = event_class.avro_schema_to_python() - avro_schema_obj = schema.AvroSchema(avro_schema_dict) + _, avro_schema_obj = await self._ensure_schema_registered(event_class)backend/app/core/providers.py (1)
271-293: DLQManager background tasks lack graceful shutdown signaling.The
monitor_loopuseswhile Trueand relies on task cancellation. Consider adding an explicit stop flag to allow graceful completion of the current iteration before shutdown.backend/workers/run_saga_orchestrator.py (3)
98-99: Closure-based state for timeout checking works but could be clearer.Using a mutable dict for state across handler invocations is a valid pattern. Consider using a simple class or dataclass for better readability.
161-162: Minor: f-string logging is used inconsistently.Line 161 uses f-string formatting while other parts of the codebase use structured logging with
extra. This is minor but affects log aggregation consistency.
142-159: Opportunistic timeout check may miss sagas during low traffic periods.Timeout checking only runs when events arrive via
check_timeouts_once(). During periods of no traffic, timed-out sagas won't be detected until the next event. Acheck_timeouts_loop()method exists inSagaLogicthat runs every 30 seconds but is not currently used. Consider whether to use the available background loop instead of the opportunistic approach, especially given the 5-minute maximum timeout constraint.
backend/app/core/dishka_lifespan.py
Outdated
| @asynccontextmanager | ||
| async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]: | ||
| """ | ||
| Application lifespan with dishka dependency injection. | ||
| This is much cleaner than the old lifespan.py: | ||
| - No dependency_overrides | ||
| - No manual service management | ||
| - Dishka handles all lifecycle automatically | ||
| """ | ||
| # Get settings and logger from DI container (uses test settings in tests) | ||
| container: AsyncContainer = app.state.dishka_container | ||
| settings = await container.get(Settings) | ||
| logger = await container.get(logging.Logger) | ||
|
|
||
| logger.info( | ||
| "Starting application with dishka DI", | ||
| extra={ | ||
| "project_name": settings.PROJECT_NAME, | ||
| "environment": "test" if settings.TESTING else "production", | ||
| }, | ||
| ) | ||
|
|
||
| # Metrics setup moved to app creation to allow middleware registration | ||
| logger.info("Lifespan start: tracing and services initialization") | ||
|
|
||
| # Initialize tracing only when enabled (avoid exporter retries in tests) | ||
| if settings.ENABLE_TRACING and not settings.TESTING: | ||
| instrumentation_report = init_tracing( | ||
| service_name=settings.TRACING_SERVICE_NAME, | ||
| settings=settings, | ||
| logger=logger, | ||
| service_version=settings.TRACING_SERVICE_VERSION, | ||
| sampling_rate=settings.TRACING_SAMPLING_RATE, | ||
| enable_console_exporter=settings.TESTING, | ||
| adaptive_sampling=settings.TRACING_ADAPTIVE_SAMPLING, | ||
| ) | ||
|
|
||
| if instrumentation_report.has_failures(): | ||
| logger.warning( | ||
| "Some instrumentation libraries failed to initialize", | ||
| extra={"instrumentation_summary": instrumentation_report.get_summary()}, | ||
| ) | ||
| else: | ||
| logger.info( | ||
| "Distributed tracing initialized successfully", | ||
| extra={"instrumentation_summary": instrumentation_report.get_summary()}, | ||
| ) | ||
| else: | ||
| logger.info( | ||
| "Distributed tracing disabled", | ||
| extra={"testing": settings.TESTING, "enable_tracing": settings.ENABLE_TRACING}, | ||
| ) | ||
|
|
||
| # Phase 1: Resolve all DI dependencies in parallel | ||
| ( | ||
| schema_registry, | ||
| database, | ||
| redis_client, | ||
| rate_limit_metrics, | ||
| sse_bridge, | ||
| event_store_consumer, | ||
| ) = await asyncio.gather( | ||
| container.get(SchemaRegistryManager), | ||
| container.get(Database), | ||
| container.get(redis.Redis), | ||
| container.get(RateLimitMetrics), | ||
| container.get(SSEKafkaRedisBridge), | ||
| container.get(EventStoreConsumer), | ||
| ) | ||
|
|
||
| # Phase 2: Initialize infrastructure in parallel (independent subsystems) | ||
| await asyncio.gather( | ||
| initialize_event_schemas(schema_registry), | ||
| init_beanie(database=database, document_models=ALL_DOCUMENTS), | ||
| initialize_rate_limits(redis_client, settings, logger, rate_limit_metrics), | ||
| ) | ||
| logger.info("Infrastructure initialized (schemas, beanie, rate limits)") | ||
|
|
||
| # Phase 3: Start Kafka consumers in parallel | ||
| async with AsyncExitStack() as stack: | ||
| stack.push_async_callback(sse_bridge.aclose) | ||
| stack.push_async_callback(event_store_consumer.aclose) | ||
| await asyncio.gather( | ||
| sse_bridge.__aenter__(), | ||
| event_store_consumer.__aenter__(), | ||
| ) | ||
| logger.info("SSE bridge and EventStoreConsumer started") | ||
| yield | ||
| """Minimal lifespan - container.close() triggers all provider cleanup.""" | ||
| yield | ||
| await app.state.dishka_container.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Check if the files exist first
git ls-files backend/app/core/ | head -20Repository: HardMax71/Integr8sCode
Length of output: 783
🏁 Script executed:
# Run the verification command from the review comment
rg -n "async def get_.*AsyncIterable" --type py backend/app/core/ -A 20 | head -100Repository: HardMax71/Integr8sCode
Length of output: 1605
🏁 Script executed:
# Also check the actual lifespan.py file to confirm it matches the snippet
cat -n backend/app/core/dishka_lifespan.pyRepository: HardMax71/Integr8sCode
Length of output: 477
🏁 Script executed:
# Search for all AsyncIterable usage in the codebase
rg "AsyncIterable" --type py backend/app/core/ -B 2 -A 10Repository: HardMax71/Integr8sCode
Length of output: 7552
🏁 Script executed:
# Get the full get_database implementation with cleanup
sed -n '185,210p' backend/app/core/providers.pyRepository: HardMax71/Integr8sCode
Length of output: 963
🏁 Script executed:
# Get full implementations of all AsyncIterable providers
rg -n "async def get_" backend/app/core/providers.py --type py -A 30 | grep -A 25 "AsyncIterable" | head -150Repository: HardMax71/Integr8sCode
Length of output: 6544
🏁 Script executed:
# Search for finally blocks in providers.py to see cleanup patterns
rg "finally:" backend/app/core/providers.py -B 5 -A 5Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# Check for try/except patterns in async generators
rg "try:" backend/app/core/providers.py --type py -A 15 | head -100Repository: HardMax71/Integr8sCode
Length of output: 3479
🏁 Script executed:
# Get full DLQManager provider implementation including cleanup
sed -n '237,290p' backend/app/core/providers.pyRepository: HardMax71/Integr8sCode
Length of output: 2239
🏁 Script executed:
# Get full EventBus provider implementation including cleanup
sed -n '395,450p' backend/app/core/providers.pyRepository: HardMax71/Integr8sCode
Length of output: 1988
🏁 Script executed:
# Get full NotificationService provider implementation including cleanup
rg "async def get_notification_service" backend/app/core/providers.py -A 40Repository: HardMax71/Integr8sCode
Length of output: 1552
🏁 Script executed:
# Check if get_database has a finally block we might have missed
sed -n '185,200p' backend/app/core/providers.py | cat -ARepository: HardMax71/Integr8sCode
Length of output: 749
🏁 Script executed:
# Search for any cleanup of MongoDB client anywhere in the codebase
rg "client.close\|client\.close" --type py backend/app/Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# Check if there's any implicit cleanup of the MongoDB client when the database object is garbage collected
# Look for __del__ or context manager implementations in the Database or AsyncMongoClient types
rg "close\|__del__\|__aexit__" --type py backend/app/core/ | grep -i "mongo\|database"Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# See how AsyncMongoClient handles cleanup when the reference is released
# Check if Beanie or Motor (used by AsyncMongoClient) provides automatic cleanup
rg "AsyncMongoClient\|motor" --type py backend/app/core/ -B 2 -A 2Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# Check the full provider file to see if there's a try/finally pattern we might have missed in get_database
sed -n '183,195p' backend/app/core/providers.pyRepository: HardMax71/Integr8sCode
Length of output: 595
🏁 Script executed:
# Verify the complete structure - check what comes after the yield in get_database
python3 << 'EOF'
with open('backend/app/core/providers.py', 'r') as f:
lines = f.readlines()
# Look at lines 185-200 to see full get_database
for i in range(184, min(200, len(lines))):
print(f"{i+1}: {lines[i]}", end='')
EOFRepository: HardMax71/Integr8sCode
Length of output: 152
🏁 Script executed:
# Check if get_dlq_manager has consumer.stop() that we might have missed
sed -n '237,290p' backend/app/core/providers.py | tail -20Repository: HardMax71/Integr8sCode
Length of output: 808
🏁 Script executed:
# Check for any cleanup of MongoDB client or schema registry after yield
sed -n '185,220p' backend/app/core/providers.pyRepository: HardMax71/Integr8sCode
Length of output: 1373
🏁 Script executed:
# Get full DLQManager output to see what happens after yield
rg "get_dlq_manager" backend/app/core/providers.py -A 60 | tail -30Repository: HardMax71/Integr8sCode
Length of output: 1212
🏁 Script executed:
# Check get_schema_registry for cleanup
sed -n '323,335p' backend/app/core/providers.pyRepository: HardMax71/Integr8sCode
Length of output: 627
MongoDB client and DLQ Kafka consumer lack cleanup in their providers.
The minimal lifespan pattern is appropriate, but not all AsyncIterable providers properly implement cleanup:
- get_database: AsyncMongoClient is never closed after yield
- get_dlq_manager: Kafka consumer is never stopped after yield (tasks are cancelled, but
await consumer.stop()is missing) - get_schema_registry: No cleanup implemented
Providers like get_event_bus and get_event_store_consumer correctly call await consumer.stop() after yield; follow this pattern for all resource-managing providers.
🤖 Prompt for AI Agents
In `@backend/app/core/dishka_lifespan.py` around lines 7 - 11, The AsyncIterable
providers are not closing their resources on shutdown: update get_database to
close the AsyncMongoClient after the yield (await client.close()), update
get_dlq_manager to call await consumer.stop() (and any DLQ manager
cleanup/close) after the yield instead of only cancelling tasks, and add proper
cleanup for get_schema_registry (e.g., close or shutdown the schema registry
client/resource after yield); follow the same pattern used in get_event_bus and
get_event_store_consumer where consumers are stopped with await consumer.stop()
after yielding.
| async def decode_avro(body: bytes) -> DomainEvent: | ||
| return await schema_registry.deserialize_event(body, "sse_bridge") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how deserialize_event uses the topic parameter
rg -nA 20 "async def deserialize_event" backend/app/events/schema/schema_registry.pyRepository: HardMax71/Integr8sCode
Length of output: 1185
🏁 Script executed:
#!/bin/bash
# Check the Kafka subscriber configuration in the same file
rg -nB 5 -A 10 "KafkaSubscriber\|@router\|broker\.subscribe" backend/workers/run_sse_bridge.pyRepository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
#!/bin/bash
# Look for any topic configuration or constants used with sse_bridge
rg -n "sse_bridge\|SSE_RELEVANT_EVENTS" backend/workers/run_sse_bridge.pyRepository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
#!/bin/bash
# Check the schema registry manager's deserialize_event implementation more broadly
rg -n "deserialize_event" backend/app/events/schema/ -A 3Repository: HardMax71/Integr8sCode
Length of output: 522
🏁 Script executed:
#!/bin/bash
# Search for the entire run_sse_bridge.py file to find subscriber configuration
cat -n backend/workers/run_sse_bridge.py | head -150Repository: HardMax71/Integr8sCode
Length of output: 5383
🏁 Script executed:
#!/bin/bash
# Look for topic configuration or constants
rg -n "topic\|Topic" backend/workers/run_sse_bridge.py -B 2 -A 2Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
#!/bin/bash
# Check if the topic parameter is used anywhere in schema_registry or related files
rg -n "topic" backend/app/events/schema/schema_registry.pyRepository: HardMax71/Integr8sCode
Length of output: 148
Remove hardcoded topic string or pass the actual subscribed topic.
The decode_avro function passes the hardcoded string "sse_bridge" as the topic, but the subscriber is configured to handle multiple dynamically-constructed topics from configuration (lines 81-84). While the topic parameter is currently unused in deserialize_event, the mismatch is misleading and creates maintenance risk if the parameter ever becomes used. Either remove the parameter if genuinely unused, or pass the actual topic from the message context.
🤖 Prompt for AI Agents
In `@backend/workers/run_sse_bridge.py` around lines 101 - 102, The decode_avro
function currently calls schema_registry.deserialize_event(body, "sse_bridge")
with a hardcoded topic string; update decode_avro to either pass the real
subscribed topic from the message context (e.g., use the topic variable
available where messages are received) into schema_registry.deserialize_event,
or remove the topic argument entirely if deserialize_event truly does not use
it. Locate decode_avro and the call sites where messages and topic are
available, then replace the literal "sse_bridge" with the dynamic topic variable
or adjust deserialize_event usage accordingly to avoid the misleading hardcoded
value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@backend/tests/unit/conftest.py`:
- Around line 58-90: The session-scoped unit_container fixture holds a single
AsyncMongoMockClient across the whole test session causing DB state leakage; fix
by making isolation explicit: either change unit_container to function scope or
create/attach a fresh AsyncMongoMockClient per test and re-run init_beanie for
that client (references: unit_container, AsyncMongoMockClient, init_beanie), or
add a teardown that drops all collections between tests (callable in unit_scope)
and ensure tests use the existing unit_scope fixture instead of unit_container
so each test gets a clean DB state.
In `@backend/workers/run_event_replay.py`:
- Around line 34-38: The AsyncMongoClient instance (mongo_client) is created
before calling init_beanie and may leak if init_beanie raises; wrap the
init_beanie call in a try/finally so that on any exception you call await
mongo_client.close() to ensure the client is closed, then re-raise the
exception; keep the AsyncExitStack usage for normal shutdown but ensure the
finally branch only closes mongo_client when initialization fails (i.e., before
the stack takes ownership).
In `@backend/workers/run_result_processor.py`:
- Around line 132-137: The subscriber is created with auto_commit=False which
prevents offsets from being committed and combined with FastStream's default
AckPolicy.ACK_FIRST causes messages to never be acknowledged; update the
broker.subscriber(...) invocation (the call that currently passes topics,
group_id, auto_commit=False, decoder=decode_avro) to either set
ack_policy=AckPolicy.ACK to commit offsets after successful handling or remove
auto_commit=False to enable auto-commit, and then modify the no-op handle_other
handler so it explicitly acknowledges or returns success for unknown events to
ensure offsets advance for those messages as well.
In `@backend/workers/run_saga_orchestrator.py`:
- Around line 142-147: The subscriber decorator currently uses the deprecated
auto_commit parameter; replace auto_commit=False in the broker.subscriber(...)
call with an explicit ack_policy by importing AckPolicy from faststream and
using ack_policy=AckPolicy.ACK (recommended) or ack_policy=AckPolicy.MANUAL if
you want manual control; if you choose MANUAL, update the corresponding message
handler function (the consumer callback tied to broker.subscriber with topics,
group_id and decoder=decode_avro) to call message.ack() after successful
processing, and remove the auto_commit argument; apply the same change to the
identical patterns in run_result_processor.py and run_k8s_worker.py.
♻️ Duplicate comments (4)
backend/app/core/providers.py (1)
559-568: SSEConnectionRegistry should be app-scoped (singleton).With
Scope.REQUEST, the registry is recreated per request and_active_connectionsis always empty across requests (e.g., health checks). Please switch this provider toScope.APP.🔧 Suggested fix
- `@provide`(scope=Scope.REQUEST) + `@provide`(scope=Scope.APP) def get_sse_connection_registry( self, logger: logging.Logger, connection_metrics: ConnectionMetrics, ) -> SSEConnectionRegistry:backend/workers/run_result_processor.py (1)
161-164: Don’t silently drop unmatched events.
handle_othercurrently no-ops, which can hide routing/config errors and new event types. Add at least a warning log.🔧 Suggested fix
`@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + logger.warning("Unhandled result-processor event_type: %s", event.event_type)backend/workers/run_k8s_worker.py (1)
142-145: Log unmatched events to aid debugging.The catch-all handler silently discards events that don't match the CREATE_POD_COMMAND or DELETE_POD_COMMAND filters. This can mask filter bugs or cause silent data loss when new event types are introduced.
♻️ Proposed fix: add warning log for unmatched events
# Default handler for unmatched events (prevents message loss) `@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + app_logger.warning( + "Unmatched event received", + extra={"event_type": getattr(event, "event_type", "unknown")}, + )backend/workers/run_pod_monitor.py (1)
85-93: Exceptions from completed tasks are still silently swallowed.When
monitor.run()raises an exception, that exception is stored inmonitor_taskbut never retrieved. The process will exit through thefinallyblock without logging or propagating the error, masking critical failures.🐛 Proposed fix: surface exceptions from completed tasks
done, pending = await asyncio.wait( [monitor_task, shutdown_task], return_when=asyncio.FIRST_COMPLETED, ) + # Surface failures from completed tasks + for task in done: + if task != shutdown_task: + try: + task.result() # Raises if task failed + except Exception: + logger.exception("Monitor task failed") + raise + for task in pending: task.cancel() with suppress(asyncio.CancelledError): await task
🧹 Nitpick comments (4)
backend/workers/dlq_processor.py (1)
87-90: MongoDB connection parameters are hardcoded.The connection settings (
tz_aware=True,serverSelectionTimeoutMS=5000) are duplicated here and indishka_lifespan.py. Consider extracting these toSettingsor a shared factory if consistency matters across workers and the main app.backend/app/core/dishka_lifespan.py (1)
17-25: Consider try/finally to ensure MongoDB client cleanup on initialization failure.If
init_beanie()raises an exception, theAsyncMongoClientwon't be closed since the code won't reach line 24. While unlikely in practice, wrapping the yield block ensures proper cleanup.Suggested fix
client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 ) - await init_beanie(database=client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) - - yield - - await client.close() - await container.close() + try: + await init_beanie(database=client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) + yield + finally: + await client.close() + await container.close()backend/app/core/providers.py (1)
253-265: Guard long‑running background loops from silent task death.These
while Truetasks (DLQ processing, event store batch loop, event bus listener, notification loops) will stop permanently if an exception escapes. Consider adding a try/except with logging and a short backoff so failures surface without killing the loop.Example pattern
async def batch_loop() -> None: while True: - await event_store_consumer.process_batch() + try: + await event_store_consumer.process_batch() + except Exception: + logger.exception("Event store consumer loop crashed; continuing") + await asyncio.sleep(1)Also applies to: 359-371, 413-428, 693-712
backend/workers/run_pod_monitor.py (1)
57-61: Consider moving MongoDB client lifecycle into DI for consistency.The MongoDB client is created manually outside the DI container while other boundary clients (Kafka, Redis) are DI-managed. This works correctly since it's properly closed in the
finallyblock, but moving it into a provider would improve consistency with other workers.
| @pytest_asyncio.fixture(scope="session") | ||
| async def unit_container(test_settings: Settings) -> AsyncGenerator[AsyncContainer, None]: | ||
| """DI container for unit tests with fake boundary clients. | ||
| Provides: | ||
| - Fake Redis, Kafka, K8s (boundary clients) | ||
| - Real metrics, repositories, services (internal) | ||
| - Real MongoDB via init_beanie | ||
| """ | ||
| container = make_async_container( | ||
| SettingsProvider(), | ||
| LoggingProvider(), | ||
| FakeBoundaryClientProvider(), | ||
| RedisServicesProvider(), | ||
| MetricsProvider(), | ||
| EventProvider(), | ||
| FakeSchemaRegistryProvider(), | ||
| MessagingProvider(), | ||
| CoreServicesProvider(), | ||
| KafkaServicesProvider(), | ||
| RepositoryProvider(), | ||
| context={Settings: test_settings}, | ||
| ) | ||
|
|
||
| mongo_client: AsyncMongoMockClient[dict[str, object]] = AsyncMongoMockClient() | ||
| await init_beanie( | ||
| database=mongo_client[test_settings.DATABASE_NAME], # type: ignore[arg-type] | ||
| document_models=ALL_DOCUMENTS, | ||
| ) | ||
|
|
||
| yield container | ||
|
|
||
| await container.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Look for test cleanup patterns mentioned in the concern
rg -n "clean|reset|clear|delete_many|drop" backend/tests/unit/ --type py -C2 | head -100Repository: HardMax71/Integr8sCode
Length of output: 7819
🏁 Script executed:
#!/bin/bash
# Check how the unit_container fixture is actually used in tests
rg -n "def test_.*unit_container" backend/tests/unit/ --type py -C1Repository: HardMax71/Integr8sCode
Length of output: 1803
🏁 Script executed:
#!/bin/bash
# Look for fixture scope declarations and test isolation patterns
rg -n "@pytest\.fixture|@pytest_asyncio\.fixture" backend/tests/unit/conftest.py --type py -C1Repository: HardMax71/Integr8sCode
Length of output: 3240
🏁 Script executed:
#!/bin/bash
# Check if mongomock client is being reset or if there's any state cleanup
rg -n "AsyncMongoMockClient|mongo_client" backend/tests/unit/conftest.py --type py -C3Repository: HardMax71/Integr8sCode
Length of output: 721
🏁 Script executed:
#!/bin/bash
# Examine the full conftest.py structure to understand fixture relationships
head -300 backend/tests/unit/conftest.py | tail -250Repository: HardMax71/Integr8sCode
Length of output: 7553
Confirm: session-scoped container creates genuine state leakage risk.
The unit_container fixture is session-scoped with a single AsyncMongoMockClient instance that persists across all unit tests. No database cleanup is performed between tests—the mongomock instance accumulates data from all writes. Tests using repositories that modify database state will pollute subsequent tests in the session.
A function-scoped unit_scope fixture exists (line 93) but tests use unit_container directly, bypassing isolation. Either switch test usage to unit_scope, add cleanup hooks between tests, or accept and document the leakage risk.
🤖 Prompt for AI Agents
In `@backend/tests/unit/conftest.py` around lines 58 - 90, The session-scoped
unit_container fixture holds a single AsyncMongoMockClient across the whole test
session causing DB state leakage; fix by making isolation explicit: either
change unit_container to function scope or create/attach a fresh
AsyncMongoMockClient per test and re-run init_beanie for that client
(references: unit_container, AsyncMongoMockClient, init_beanie), or add a
teardown that drops all collections between tests (callable in unit_scope) and
ensure tests use the existing unit_scope fixture instead of unit_container so
each test gets a clean DB state.
| mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( | ||
| settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 | ||
| ) | ||
| await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure Mongo client is closed if initialization fails.
If init_beanie raises, the mongo_client isn’t closed because the AsyncExitStack is entered later. Consider a small try/finally to avoid leaking the client on startup failure.
🔧 Suggested fix
- mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
- settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
- )
- await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+ mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
+ settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
+ )
+ try:
+ await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+ except Exception:
+ await mongo_client.close()
+ raise📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( | |
| settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 | |
| ) | |
| await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) | |
| mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( | |
| settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 | |
| ) | |
| try: | |
| await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) | |
| except Exception: | |
| await mongo_client.close() | |
| raise |
🤖 Prompt for AI Agents
In `@backend/workers/run_event_replay.py` around lines 34 - 38, The
AsyncMongoClient instance (mongo_client) is created before calling init_beanie
and may leak if init_beanie raises; wrap the init_beanie call in a try/finally
so that on any exception you call await mongo_client.close() to ensure the
client is closed, then re-raise the exception; keep the AsyncExitStack usage for
normal shutdown but ensure the finally branch only closes mongo_client when
initialization fails (i.e., before the stack takes ownership).
| subscriber = broker.subscriber( | ||
| *topics, | ||
| group_id=group_id, | ||
| auto_commit=False, | ||
| decoder=decode_avro, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cd backend/workers && wc -l run_result_processor.pyRepository: HardMax71/Integr8sCode
Length of output: 93
🏁 Script executed:
cat -n backend/workers/run_result_processor.py | head -200Repository: HardMax71/Integr8sCode
Length of output: 7837
🏁 Script executed:
cat -n backend/workers/run_result_processor.py | sed -n '120,160p'Repository: HardMax71/Integr8sCode
Length of output: 2074
🌐 Web query:
FastStream 0.6 Kafka subscriber auto_commit=False offset commit behavior handler success
💡 Result:
Short answer: With FastStream 0.6 you must use the ACK-style acknowledgement (not auto-commit) to have offsets committed only after handler success — AckPolicy.ACK commits the offset after the handler returns (ACK_FIRST uses client auto-commit; MANUAL does nothing until you call msg.ack(); NACK_ON_ERROR will avoid committing on errors). See FastStream docs. [1]
Sources
- FastStream 0.6 acknowledgement docs (AckPolicy behavior and examples). [1]
- FastStream 0.5 docs showing auto_commit=False example (for comparison). [2]
🏁 Script executed:
grep -rn "AckPolicy\|\.ack()" backend/workers/Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
grep -rn "from faststream" backend/workers/run_result_processor.pyRepository: HardMax71/Integr8sCode
Length of output: 146
🌐 Web query:
FastStream 0.6 KafkaBroker subscriber default AckPolicy when auto_commit=False
💡 Result:
Default AckPolicy is AckPolicy.ACK_FIRST. FastStream’s docs state ACK_FIRST (which relies on Kafka’s enable.auto.commit) is the default; to get at-least-once you must set ack_policy=AckPolicy.ACK (or use manual ack). [1][2]
Sources:
- FastStream Acknowledgement docs (0.6). [1]
- KafkaBroker/subscriber API showing subscriber options (auto_commit / ack config). [2]
Add explicit offset commit strategy or enable auto-commit.
With auto_commit=False (line 135), offsets are not committed automatically. FastStream's default AckPolicy.ACK_FIRST relies on Kafka's enable.auto.commit, which is disabled here. This results in messages never being marked as processed—they will be redelivered on consumer restart.
Set ack_policy=AckPolicy.ACK in the broker.subscriber() call to commit offsets after handler success, or remove auto_commit=False to use default auto-commit. Additionally, the no-op handle_other handler silently drops unknown events without acknowledging them, preventing offset advancement for those messages.
🤖 Prompt for AI Agents
In `@backend/workers/run_result_processor.py` around lines 132 - 137, The
subscriber is created with auto_commit=False which prevents offsets from being
committed and combined with FastStream's default AckPolicy.ACK_FIRST causes
messages to never be acknowledged; update the broker.subscriber(...) invocation
(the call that currently passes topics, group_id, auto_commit=False,
decoder=decode_avro) to either set ack_policy=AckPolicy.ACK to commit offsets
after successful handling or remove auto_commit=False to enable auto-commit, and
then modify the no-op handle_other handler so it explicitly acknowledges or
returns success for unknown events to ensure offsets advance for those messages
as well.
| @broker.subscriber( | ||
| *topics, | ||
| group_id=group_id, | ||
| auto_commit=False, | ||
| decoder=decode_avro, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the full handler implementation around lines 142-147
sed -n '130,180p' backend/workers/run_saga_orchestrator.pyRepository: HardMax71/Integr8sCode
Length of output: 2040
🏁 Script executed:
# Check if there are any explicit ack/commit calls in the handler
rg -A 20 '@broker\.subscriber' backend/workers/run_saga_orchestrator.py | head -50Repository: HardMax71/Integr8sCode
Length of output: 798
🏁 Script executed:
# Search for any usage patterns of ack or commit in this file
rg -i '(\.ack\(|\.commit\(|message\.ack|offset)' backend/workers/run_saga_orchestrator.pyRepository: HardMax71/Integr8sCode
Length of output: 48
🌐 Web query:
FastStream 0.6 Kafka subscriber auto_commit=False offset commit behavior documentation
💡 Result:
Short answer: In FastStream 0.6 the library no longer relies on an async subscriber arg named auto_commit to control commit semantics; instead it uses AckPolicy. The default is AckPolicy.ACK_FIRST (which uses the consumer's enable.auto.commit behavior). To get at‑least‑once/manual control you should use:
- AckPolicy.ACK — FastStream commits the offset after the handler returns (ack after processing). (faststream.ag2.ai)
- AckPolicy.NACK_ON_ERROR — on error FastStream will not commit and will seek to re-read the message. (faststream.ag2.ai)
- AckPolicy.MANUAL — FastStream does not commit offsets; you must call msg.ack() / msg.nack() yourself. The docs show how to access KafkaMessage and call ack()/nack(). (faststream.ag2.ai)
Implementation note: the underlying async confluent consumer exposes enable_auto_commit (enable.auto.commit) in its config, but FastStream’s AckPolicy determines whether FastStream lets the client auto-commit (ACK_FIRST) or performs/omits commits itself (ACK, NACK_ON_ERROR, MANUAL). (faststream.ag2.ai)
If you want, I can paste the exact example snippets from the 0.6 docs.
Citations:
- 1: https://faststream.ag2.ai/0.6/confluent/ack/?utm_source=openai
- 2: https://faststream.ag2.ai/0.6/confluent/ack/?utm_source=openai
- 3: https://faststream.ag2.ai/0.6/confluent/ack/?utm_source=openai
- 4: https://faststream.ag2.ai/latest/api/faststream/confluent/client/AsyncConfluentConsumer/?utm_source=openai
🏁 Script executed:
# Let's also check if there are any other handlers in the codebase using auto_commit=False
# to see how they handle offsets
rg -B 5 'auto_commit\s*=\s*False' --type pyRepository: HardMax71/Integr8sCode
Length of output: 2865
🏁 Script executed:
# Check FastStream package imports and usage in this project
rg 'from faststream|import faststream' --type py -A 2 | head -30Repository: HardMax71/Integr8sCode
Length of output: 1469
🏁 Script executed:
# Find FastStream version in requirements or pyproject.toml
fd -e txt -e toml -e cfg | xargs grep -l 'faststream' 2>/dev/nullRepository: HardMax71/Integr8sCode
Length of output: 88
🏁 Script executed:
# Check specific version specifications
rg 'faststream[=<>]' --type toml --type txtRepository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# Read the pyproject.toml to find FastStream version
cat backend/pyproject.toml | grep -A 2 -B 2 faststreamRepository: HardMax71/Integr8sCode
Length of output: 147
Replace deprecated auto_commit parameter with explicit offset management.
FastStream 0.6 has deprecated the auto_commit parameter in favor of the AckPolicy enum. The current code uses auto_commit=False, which is ignored in FastStream 0.6, leaving offset commits uncontrolled. Since the handler has no explicit ack/commit calls, offsets will not advance and the consumer will stall.
Replace this with one of:
ack_policy=AckPolicy.ACK— FastStream commits the offset after handler success (recommended for at-least-once semantics)ack_policy=AckPolicy.MANUAL— Callmessage.ack()explicitly in the handler for manual control
Also check run_result_processor.py and run_k8s_worker.py which have the same pattern.
🤖 Prompt for AI Agents
In `@backend/workers/run_saga_orchestrator.py` around lines 142 - 147, The
subscriber decorator currently uses the deprecated auto_commit parameter;
replace auto_commit=False in the broker.subscriber(...) call with an explicit
ack_policy by importing AckPolicy from faststream and using
ack_policy=AckPolicy.ACK (recommended) or ack_policy=AckPolicy.MANUAL if you
want manual control; if you choose MANUAL, update the corresponding message
handler function (the consumer callback tied to broker.subscriber with topics,
group_id and decoder=decode_avro) to call message.ack() after successful
processing, and remove the auto_commit argument; apply the same change to the
identical patterns in run_result_processor.py and run_k8s_worker.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@backend/app/services/notification_service.py`:
- Around line 85-95: The docstring incorrectly mentions a run() method; update
it to remove the run() reference and instead state that background work is
started by the application's DI/lifespan scheduler which calls this service's
batch and cleanup methods (for example
NotificationService.batch_process_pending_notifications and
NotificationService.cleanup_notifications); briefly describe that these methods
handle pending notification processing and periodic cleanup and that they should
be wired into app lifespan/DI.
In `@backend/app/services/sse/redis_bus.py`:
- Around line 156-174: The start() method can be called multiple times which
will spawn duplicate listener tasks; add a guard in start() that checks
self._task (and whether it's not done) and either returns early or raises if a
listener is already running, or cancel/wait for the existing task before
creating a new one; update the async def start(self) to inspect self._task and
only call asyncio.create_task(listener()) when there is no active task (use
self._task.done() to determine activity) and ensure subscription/confirmation
logic is not re-run for an already-started listener (references: start, _task,
listener, _pubsub).
In `@backend/tests/integration/db/repositories/test_admin_settings_repository.py`:
- Around line 29-36: The test test_update_and_reset_settings currently asserts
AuditLogDocument.count() >= N which is brittle; capture a baseline count before
calling repo.update_system_settings and repo.reset_system_settings (use
AuditLogDocument.count()), then after update_system_settings assert the count
increased by one and after reset_system_settings assert the count increased by
two total (or increased by one relative to the post-update count). Reference the
test function test_update_and_reset_settings, the repository methods
update_system_settings and reset_system_settings, and AuditLogDocument.count()
when making these delta-based assertions.
♻️ Duplicate comments (5)
backend/app/core/dishka_lifespan.py (1)
51-64: Task supervision during runtime is still missing.If either
pending_notification_taskorcleanup_notification_taskfails with an unexpected exception during normal operation (betweenyieldand shutdown), the failure goes unnoticed until shutdown. The exception handling inside the loops catches and logs errors, but a complete task crash (e.g., from an unhandled exception type) would be silent.Consider adding a done callback to log failures immediately:
♻️ Suggested supervision pattern
pending_task = asyncio.create_task(pending_notification_task()) cleanup_task = asyncio.create_task(cleanup_notification_task()) + + def _on_task_done(task: asyncio.Task[None]) -> None: + if task.cancelled(): + return + if exc := task.exception(): + logger.error("Background task failed unexpectedly", exc_info=exc) + + pending_task.add_done_callback(_on_task_done) + cleanup_task.add_done_callback(_on_task_done) logger.info("NotificationService background tasks started")backend/app/services/pod_monitor/monitor.py (2)
120-129: Re-raiseasyncio.CancelledErrorto preserve cancellation semantics.Catching
CancelledErrorwithout re-raising breaks Python's async cancellation contract. The task will appear to complete normally instead of being cancelled, causingtask.cancelled()to returnFalse.🛠️ Suggested fix
try: await self._watch_loop() except asyncio.CancelledError: self.logger.info("PodMonitor cancelled") + raise finally:
176-182: The synchronouswatch.stream()iteration blocks the event loop.The
for event in stream:loop uses a synchronous iterator from the kubernetes library, which blocks the event loop while waiting for events. This prevents other async tasks from running. The codebase already usesasyncio.to_thread()elsewhere (lines 117, 295).Consider wrapping the blocking iteration or processing events in a thread:
🛠️ Suggested approach
One option is to process each event fetch in a thread:
async def _run_watch(self) -> None: # ... setup code ... stream = self._watch.stream(self._v1.list_namespaced_pod, **kwargs) try: while True: event = await asyncio.to_thread(next, iter(stream), None) if event is None: break await self._process_raw_event(event) finally: self._update_resource_version(stream)Alternatively, consider using an async kubernetes client if available.
backend/app/services/sse/sse_connection_registry.py (1)
30-38: Metrics can still drift on duplicateregister_connectioncalls.The
unregister_connectionmethod now correctly guards against decrementing when the connection doesn't exist. However,register_connectionstill unconditionally increments metrics even if theconnection_idis already in the set (sinceset.add()is idempotent but the metrics call isn't).🛠️ Suggested fix
async def register_connection(self, execution_id: str, connection_id: str) -> None: """Register a new SSE connection.""" async with self._lock: - if execution_id not in self._active_connections: - self._active_connections[execution_id] = set() - - self._active_connections[execution_id].add(connection_id) - self.logger.debug("Registered SSE connection", extra={"connection_id": connection_id}) - self.metrics.increment_sse_connections("executions") + conns = self._active_connections.setdefault(execution_id, set()) + if connection_id not in conns: + conns.add(connection_id) + self.metrics.increment_sse_connections("executions") + self.logger.debug("Registered SSE connection", extra={"connection_id": connection_id})backend/app/core/providers.py (1)
409-446: SSEConnectionRegistry should be app‑scoped.Request scope recreates the registry per request, which breaks shared connection tracking across requests. This should be app‑wide.
🔧 Proposed fix
- `@provide`(scope=Scope.REQUEST) + `@provide`(scope=Scope.APP) def get_sse_connection_registry(
🧹 Nitpick comments (2)
backend/app/events/event_store_consumer.py (1)
33-46: Trace span closes before metrics recording.The
trace_spancontext manager exits after line 38, so the metrics recording (line 41) and logging happen outside the span. This is minor, but for observability consistency, consider including the post-storage operations within the span.♻️ Optional: Include metrics/logging in span scope
async def store_event(self, event: DomainEvent, topic: str, consumer_group: str) -> bool: """Store a single event. Called by FastStream handler. Returns True if stored, False if duplicate/failed. """ with trace_span( name="event_store_service.store_event", kind=SpanKind.CONSUMER, attributes={"event.type": event.event_type, "event.id": event.event_id}, ): stored = await self.event_store.store_event(event) - if stored: - self.event_metrics.record_kafka_message_consumed(topic=topic, consumer_group=consumer_group) - self.logger.debug(f"Stored event {event.event_id}") - else: - self.logger.debug(f"Duplicate event {event.event_id}, skipped") + if stored: + self.event_metrics.record_kafka_message_consumed(topic=topic, consumer_group=consumer_group) + self.logger.debug(f"Stored event {event.event_id}") + else: + self.logger.debug(f"Duplicate event {event.event_id}, skipped") return storedbackend/app/services/notification_service.py (1)
126-149: Preferlogger.exceptionfor batch/cleanup failures.This preserves stack traces and is more actionable in ops.
♻️ Suggested change
- except Exception as e: - self.logger.error(f"Error processing pending notifications: {e}") + except Exception: + self.logger.exception("Error processing pending notifications") ... - except Exception as e: - self.logger.error(f"Error cleaning up old notifications: {e}") + except Exception: + self.logger.exception("Error cleaning up old notifications")
| """Service for creating and managing notifications. | ||
| This service handles: | ||
| - Creating notifications (user and system) | ||
| - Delivering notifications via channels (in-app, webhook, slack) | ||
| - Managing notification subscriptions | ||
| - Rate limiting via throttle cache | ||
| Background tasks (pending notification processing, cleanup) are started | ||
| via the run() method, which should be called from app lifespan. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring references a non‑existent run() method.
NotificationService no longer defines run(), so this is misleading. Please update the docstring to reflect the DI/lifespan scheduler that calls the new batch/cleanup methods.
📝 Proposed fix
- Background tasks (pending notification processing, cleanup) are started
- via the run() method, which should be called from app lifespan.
+ Background tasks (pending notification processing, cleanup) are started
+ by the DI/lifespan scheduler that calls process_pending_batch() and cleanup_old().🤖 Prompt for AI Agents
In `@backend/app/services/notification_service.py` around lines 85 - 95, The
docstring incorrectly mentions a run() method; update it to remove the run()
reference and instead state that background work is started by the application's
DI/lifespan scheduler which calls this service's batch and cleanup methods (for
example NotificationService.batch_process_pending_notifications and
NotificationService.cleanup_notifications); briefly describe that these methods
handle pending notification processing and periodic cleanup and that they should
be wired into app lifespan/DI.
| async def start(self) -> None: | ||
| """Start listening for messages.""" | ||
| await self._pubsub.subscribe(self._channel) | ||
| await self._pubsub.get_message(timeout=1.0) # Consume subscribe confirmation | ||
|
|
||
| async def listener() -> None: | ||
| while True: | ||
| try: | ||
| msg = await self._pubsub.get_message(ignore_subscribe_messages=True, timeout=0.5) | ||
| if msg and msg.get("type") == "message": | ||
| data = json.loads(msg["data"]) | ||
| await self._handler(data) | ||
| except asyncio.CancelledError: | ||
| break | ||
| except Exception as e: | ||
| self._logger.error(f"Error processing internal message on {self._channel}: {e}") | ||
|
|
||
| self._task = asyncio.create_task(listener()) | ||
| self._logger.debug(f"Started internal subscription on {self._channel}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against double start() calls.
If start() is called twice without an intervening close(), a new listener task is created while the old one continues running, causing duplicate message handling and a leaked task.
🛠️ Suggested fix
async def start(self) -> None:
"""Start listening for messages."""
+ if self._task is not None:
+ return # Already started
+
await self._pubsub.subscribe(self._channel)
await self._pubsub.get_message(timeout=1.0) # Consume subscribe confirmation📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def start(self) -> None: | |
| """Start listening for messages.""" | |
| await self._pubsub.subscribe(self._channel) | |
| await self._pubsub.get_message(timeout=1.0) # Consume subscribe confirmation | |
| async def listener() -> None: | |
| while True: | |
| try: | |
| msg = await self._pubsub.get_message(ignore_subscribe_messages=True, timeout=0.5) | |
| if msg and msg.get("type") == "message": | |
| data = json.loads(msg["data"]) | |
| await self._handler(data) | |
| except asyncio.CancelledError: | |
| break | |
| except Exception as e: | |
| self._logger.error(f"Error processing internal message on {self._channel}: {e}") | |
| self._task = asyncio.create_task(listener()) | |
| self._logger.debug(f"Started internal subscription on {self._channel}") | |
| async def start(self) -> None: | |
| """Start listening for messages.""" | |
| if self._task is not None: | |
| return # Already started | |
| await self._pubsub.subscribe(self._channel) | |
| await self._pubsub.get_message(timeout=1.0) # Consume subscribe confirmation | |
| async def listener() -> None: | |
| while True: | |
| try: | |
| msg = await self._pubsub.get_message(ignore_subscribe_messages=True, timeout=0.5) | |
| if msg and msg.get("type") == "message": | |
| data = json.loads(msg["data"]) | |
| await self._handler(data) | |
| except asyncio.CancelledError: | |
| break | |
| except Exception as e: | |
| self._logger.error(f"Error processing internal message on {self._channel}: {e}") | |
| self._task = asyncio.create_task(listener()) | |
| self._logger.debug(f"Started internal subscription on {self._channel}") |
🤖 Prompt for AI Agents
In `@backend/app/services/sse/redis_bus.py` around lines 156 - 174, The start()
method can be called multiple times which will spawn duplicate listener tasks;
add a guard in start() that checks self._task (and whether it's not done) and
either returns early or raises if a listener is already running, or cancel/wait
for the existing task before creating a new one; update the async def
start(self) to inspect self._task and only call asyncio.create_task(listener())
when there is no active task (use self._task.done() to determine activity) and
ensure subscription/confirmation logic is not re-run for an already-started
listener (references: start, _task, listener, _pubsub).
| async def test_update_and_reset_settings(repo: AdminSettingsRepository) -> None: | ||
| s = SystemSettings() | ||
| updated = await repo.update_system_settings(s, updated_by="admin", user_id="u1") | ||
| assert isinstance(updated, SystemSettings) | ||
| # verify audit log entry exists | ||
| assert await db.get_collection("audit_log").count_documents({}) >= 1 | ||
| assert await AuditLogDocument.count() >= 1 | ||
| reset = await repo.reset_system_settings("admin", "u1") | ||
| assert isinstance(reset, SystemSettings) | ||
| assert await db.get_collection("audit_log").count_documents({}) >= 2 | ||
| assert await AuditLogDocument.count() >= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "test_admin_settings_repository.py" 2>/dev/null | head -20Repository: HardMax71/Integr8sCode
Length of output: 188
🏁 Script executed:
git ls-files | grep -E "(test_admin|admin_settings)" | head -20Repository: HardMax71/Integr8sCode
Length of output: 457
🏁 Script executed:
cat -n backend/tests/integration/db/repositories/test_admin_settings_repository.py | head -50Repository: HardMax71/Integr8sCode
Length of output: 1653
🏁 Script executed:
rg "def db" backend/tests --type py -B 2 -A 5 | head -100Repository: HardMax71/Integr8sCode
Length of output: 485
🏁 Script executed:
rg "fixture|scope|autouse" backend/tests -B 2 -A 2 --type py | grep -E "(db|audit)" -A 3 -B 1 | head -80Repository: HardMax71/Integr8sCode
Length of output: 2217
🏁 Script executed:
find . -path "*/tests/integration/conftest.py" -type fRepository: HardMax71/Integr8sCode
Length of output: 105
🏁 Script executed:
cat -n backend/tests/integration/conftest.py 2>/dev/null | head -100Repository: HardMax71/Integr8sCode
Length of output: 4140
🏁 Script executed:
cat -n backend/app/db/docs/admin_settings.pyRepository: HardMax71/Integr8sCode
Length of output: 2515
🏁 Script executed:
rg "AuditLogDocument" backend/tests --type py -B 2 -A 2 | head -80Repository: HardMax71/Integr8sCode
Length of output: 1361
🏁 Script executed:
find . -path "*/tests/helpers/cleanup.py" -type fRepository: HardMax71/Integr8sCode
Length of output: 100
🏁 Script executed:
cat -n tests/helpers/cleanup.pyRepository: HardMax71/Integr8sCode
Length of output: 122
🏁 Script executed:
rg "scope.*AsyncContainer|@pytest.fixture.*scope" backend/tests --type py -B 2 -A 3 | head -80Repository: HardMax71/Integr8sCode
Length of output: 5389
🏁 Script executed:
cat -n backend/tests/helpers/cleanup.pyRepository: HardMax71/Integr8sCode
Length of output: 911
🏁 Script executed:
rg "def db\(\)" backend/tests --type py -B 3 -A 10Repository: HardMax71/Integr8sCode
Length of output: 795
🏁 Script executed:
head -100 backend/tests/conftest.pyRepository: HardMax71/Integr8sCode
Length of output: 3987
Make audit-log assertions delta-based to avoid false positives.
Global AuditLogDocument.count() assertions are fragile; they pass whenever any audit logs exist, not necessarily ones from this test. Use a baseline capture and assert the increment to ensure the test actually validates its own operations.
🔧 Suggested adjustment (baseline + delta)
async def test_update_and_reset_settings(repo: AdminSettingsRepository) -> None:
+ baseline = await AuditLogDocument.count()
s = SystemSettings()
updated = await repo.update_system_settings(s, updated_by="admin", user_id="u1")
assert isinstance(updated, SystemSettings)
- assert await AuditLogDocument.count() >= 1
+ assert await AuditLogDocument.count() >= baseline + 1
reset = await repo.reset_system_settings("admin", "u1")
assert isinstance(reset, SystemSettings)
- assert await AuditLogDocument.count() >= 2
+ assert await AuditLogDocument.count() >= baseline + 2🤖 Prompt for AI Agents
In `@backend/tests/integration/db/repositories/test_admin_settings_repository.py`
around lines 29 - 36, The test test_update_and_reset_settings currently asserts
AuditLogDocument.count() >= N which is brittle; capture a baseline count before
calling repo.update_system_settings and repo.reset_system_settings (use
AuditLogDocument.count()), then after update_system_settings assert the count
increased by one and after reset_system_settings assert the count increased by
two total (or increased by one relative to the post-update count). Reference the
test function test_update_and_reset_settings, the repository methods
update_system_settings and reset_system_settings, and AuditLogDocument.count()
when making these delta-based assertions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 issues found across 33 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/workers/dlq_processor.py">
<violation number="1" location="backend/workers/dlq_processor.py:185">
P2: auto_commit=False disables auto-acknowledgment, but the handler never explicitly acknowledges messages. This leaves Kafka offsets uncommitted, so DLQ messages will be reprocessed on restart or rebalance. Either enable auto-commit or accept a message object and call ack/nack after processing.</violation>
</file>
<file name="backend/tests/unit/conftest.py">
<violation number="1" location="backend/tests/unit/conftest.py:65">
P3: The fixture uses `AsyncMongoMockClient`, so the docstring claiming a "Real MongoDB" is misleading. Update it to reflect that this is an in-memory/mock MongoDB instance.</violation>
</file>
<file name="backend/app/dlq/manager.py">
<violation number="1" location="backend/app/dlq/manager.py:80">
P2: Wrap the processing block in try/except/finally so failures are logged and duration metrics are recorded even when _process_dlq_message raises.</violation>
</file>
<file name="backend/workers/run_saga_orchestrator.py">
<violation number="1" location="backend/workers/run_saga_orchestrator.py:109">
P2: Mongo client is created before the `try/finally`, so failures during `init_beanie` or schema registry initialization skip cleanup. Wrap the initialization in the `try/finally` (or use an async context manager) so the client is always closed on startup errors.</violation>
</file>
<file name="backend/workers/run_result_processor.py">
<violation number="1" location="backend/workers/run_result_processor.py:115">
P2: Mongo client initialization is outside the `try/finally`, so a failure during `init_beanie` or other setup will skip cleanup and leave the client unclosed. Wrap initialization in the same `try/finally` (or guard `close()` when init fails) to avoid resource leaks on startup errors.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| @broker.subscriber( | ||
| *topics, | ||
| group_id=group_id, | ||
| auto_commit=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: auto_commit=False disables auto-acknowledgment, but the handler never explicitly acknowledges messages. This leaves Kafka offsets uncommitted, so DLQ messages will be reprocessed on restart or rebalance. Either enable auto-commit or accept a message object and call ack/nack after processing.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/workers/dlq_processor.py, line 185:
<comment>auto_commit=False disables auto-acknowledgment, but the handler never explicitly acknowledges messages. This leaves Kafka offsets uncommitted, so DLQ messages will be reprocessed on restart or rebalance. Either enable auto-commit or accept a message object and call ack/nack after processing.</comment>
<file context>
@@ -73,39 +103,127 @@ def filter_old_messages(message: DLQMessage) -> bool:
+ @broker.subscriber(
+ *topics,
+ group_id=group_id,
+ auto_commit=False,
+ decoder=decode_dlq_json,
+ )
</file context>
| start = asyncio.get_running_loop().time() | ||
|
|
||
| # Record metrics | ||
| self.metrics.record_dlq_message_received(message.original_topic, message.event.event_type) | ||
| self.metrics.record_dlq_message_age((datetime.now(timezone.utc) - message.failed_at).total_seconds()) | ||
|
|
||
| # Process with tracing | ||
| ctx = extract_trace_context(message.headers) | ||
| with get_tracer().start_as_current_span( | ||
| name="dlq.consume", | ||
| context=ctx, | ||
| kind=SpanKind.CONSUMER, | ||
| attributes={ | ||
| EventAttributes.KAFKA_TOPIC: self.dlq_topic, | ||
| EventAttributes.EVENT_TYPE: message.event.event_type, | ||
| EventAttributes.EVENT_ID: message.event.event_id, | ||
| }, | ||
| ): | ||
| await self._process_dlq_message(message) | ||
|
|
||
| self.metrics.record_dlq_processing_duration(asyncio.get_running_loop().time() - start, "process") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Wrap the processing block in try/except/finally so failures are logged and duration metrics are recorded even when _process_dlq_message raises.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/dlq/manager.py, line 80:
<comment>Wrap the processing block in try/except/finally so failures are logged and duration metrics are recorded even when _process_dlq_message raises.</comment>
<file context>
@@ -70,42 +69,35 @@ def __init__(
+ Args:
+ message: Typed DLQMessage (deserialized by FastStream/Avro)
+ """
+ start = asyncio.get_running_loop().time()
+
+ # Record metrics
</file context>
| start = asyncio.get_running_loop().time() | |
| # Record metrics | |
| self.metrics.record_dlq_message_received(message.original_topic, message.event.event_type) | |
| self.metrics.record_dlq_message_age((datetime.now(timezone.utc) - message.failed_at).total_seconds()) | |
| # Process with tracing | |
| ctx = extract_trace_context(message.headers) | |
| with get_tracer().start_as_current_span( | |
| name="dlq.consume", | |
| context=ctx, | |
| kind=SpanKind.CONSUMER, | |
| attributes={ | |
| EventAttributes.KAFKA_TOPIC: self.dlq_topic, | |
| EventAttributes.EVENT_TYPE: message.event.event_type, | |
| EventAttributes.EVENT_ID: message.event.event_id, | |
| }, | |
| ): | |
| await self._process_dlq_message(message) | |
| self.metrics.record_dlq_processing_duration(asyncio.get_running_loop().time() - start, "process") | |
| start = asyncio.get_running_loop().time() | |
| try: | |
| # Record metrics | |
| self.metrics.record_dlq_message_received(message.original_topic, message.event.event_type) | |
| self.metrics.record_dlq_message_age((datetime.now(timezone.utc) - message.failed_at).total_seconds()) | |
| # Process with tracing | |
| ctx = extract_trace_context(message.headers) | |
| with get_tracer().start_as_current_span( | |
| name="dlq.consume", | |
| context=ctx, | |
| kind=SpanKind.CONSUMER, | |
| attributes={ | |
| EventAttributes.KAFKA_TOPIC: self.dlq_topic, | |
| EventAttributes.EVENT_TYPE: message.event.event_type, | |
| EventAttributes.EVENT_ID: message.event.event_id, | |
| }, | |
| ): | |
| await self._process_dlq_message(message) | |
| except Exception as e: | |
| self.logger.error(f"Error processing DLQ message: {e}") | |
| raise | |
| finally: | |
| self.metrics.record_dlq_processing_duration(asyncio.get_running_loop().time() - start, "process") |
| app_logger.info("SagaOrchestrator starting...") | ||
|
|
||
| # Initialize MongoDB + Beanie | ||
| mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Mongo client is created before the try/finally, so failures during init_beanie or schema registry initialization skip cleanup. Wrap the initialization in the try/finally (or use an async context manager) so the client is always closed on startup errors.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/workers/run_saga_orchestrator.py, line 109:
<comment>Mongo client is created before the `try/finally`, so failures during `init_beanie` or schema registry initialization skip cleanup. Wrap the initialization in the `try/finally` (or use an async context manager) so the client is always closed on startup errors.</comment>
<file context>
@@ -105,9 +105,14 @@ async def lifespan(app: FastStream) -> AsyncIterator[None]:
- await container.get(Database) # Triggers init_beanie via DatabaseProvider
- schema_registry = await container.get(SchemaRegistryManager) # Triggers initialize_schemas
+ # Initialize MongoDB + Beanie
+ mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
+ settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
+ )
</file context>
| app_logger.info("ResultProcessor starting...") | ||
|
|
||
| # Initialize MongoDB + Beanie | ||
| mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Mongo client initialization is outside the try/finally, so a failure during init_beanie or other setup will skip cleanup and leave the client unclosed. Wrap initialization in the same try/finally (or guard close() when init fails) to avoid resource leaks on startup errors.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/workers/run_result_processor.py, line 115:
<comment>Mongo client initialization is outside the `try/finally`, so a failure during `init_beanie` or other setup will skip cleanup and leave the client unclosed. Wrap initialization in the same `try/finally` (or guard `close()` when init fails) to avoid resource leaks on startup errors.</comment>
<file context>
@@ -111,9 +111,14 @@ async def lifespan(app: FastStream) -> AsyncIterator[None]:
- await container.get(Database) # Triggers init_beanie via DatabaseProvider
- schema_registry = await container.get(SchemaRegistryManager) # Triggers initialize_schemas
+ # Initialize MongoDB + Beanie
+ mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
+ settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
+ )
</file context>
| Provides: | ||
| - Fake Redis, Kafka, K8s (boundary clients) | ||
| - Real metrics, repositories, services (internal) | ||
| - Real MongoDB via init_beanie |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P3: The fixture uses AsyncMongoMockClient, so the docstring claiming a "Real MongoDB" is misleading. Update it to reflect that this is an in-memory/mock MongoDB instance.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/unit/conftest.py, line 65:
<comment>The fixture uses `AsyncMongoMockClient`, so the docstring claiming a "Real MongoDB" is misleading. Update it to reflect that this is an in-memory/mock MongoDB instance.</comment>
<file context>
@@ -60,29 +60,33 @@ async def unit_container(test_settings: Settings) -> AsyncGenerator[AsyncContain
- - Fake Redis, Kafka, K8s, MongoDB (boundary clients)
+ - Fake Redis, Kafka, K8s (boundary clients)
- Real metrics, repositories, services (internal)
+ - Real MongoDB via init_beanie
"""
container = make_async_container(
</file context>
| - Real MongoDB via init_beanie | |
| - MongoDB mock via init_beanie |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/app/core/providers.py (1)
103-116: Redis client lacks lifecycle management.The Kafka producer uses
AsyncIterablewith properstart()/stop()lifecycle, but the Redis client is returned directly without cleanup. Redis connections may leak on application shutdown.🔧 Proposed fix with proper lifecycle
- `@provide` - def get_redis_client(self, settings: Settings, logger: logging.Logger) -> redis.Redis: - logger.info(f"Redis configured: {settings.REDIS_HOST}:{settings.REDIS_PORT}/{settings.REDIS_DB}") - return redis.Redis( + `@provide` + async def get_redis_client(self, settings: Settings, logger: logging.Logger) -> AsyncIterable[redis.Redis]: + logger.info(f"Redis configured: {settings.REDIS_HOST}:{settings.REDIS_PORT}/{settings.REDIS_DB}") + client = redis.Redis( host=settings.REDIS_HOST, port=settings.REDIS_PORT, db=settings.REDIS_DB, password=settings.REDIS_PASSWORD, ssl=settings.REDIS_SSL, max_connections=settings.REDIS_MAX_CONNECTIONS, decode_responses=settings.REDIS_DECODE_RESPONSES, socket_connect_timeout=5, socket_timeout=5, ) + yield client + await client.aclose() + logger.info("Redis client closed")backend/tests/integration/dlq/test_dlq_manager.py (1)
55-85: Wait for partition assignment before processing to avoid test flakiness.With
auto_offset_reset="latest"and no committed offsets, the consumer's initial offset is set to the partition's log end when assignment completes. Ifprocess_message()produces the event before partition assignment finishes, the consumer may start fetching from a position after the message, causing it to be skipped and the test to flake.Add a check to ensure partition assignment is complete before creating the consume task:
🔧 Suggested stabilization
await consumer.start() +while not consumer.assignment(): + await asyncio.sleep(0.05) consume_task = asyncio.create_task(consume_dlq_events()) # Process message directly via DLQManager (simulating FastStream handler) await dlq_manager.process_message(dlq_message)
🤖 Fix all issues with AI agents
In `@backend/app/services/user_settings_service.py`:
- Around line 216-221: The invalidate_cache method currently calls
self._redis.delete(cache_key) without handling Redis errors; update
invalidate_cache to catch Redis-related exceptions (matching the pattern used in
your cache read/write methods), log an error via self.logger.error including the
cache_key and exception details, and swallow the exception so callers like
reset_user_settings won't fail due to cache layer issues while preserving the
existing debug log when deletion succeeds (references: invalidate_cache,
self._redis.delete, self.logger.debug, reset_user_settings).
- Around line 223-227: The _set_cache method currently lets Redis errors
propagate; change it to swallow cache write failures and only log them so
database operations like update_user_settings and restore_settings_to_point
don't fail when caching fails. Wrap the await self._redis.setex(...) call in a
try/except catching broad Redis/connection exceptions (or Exception if specific
client exceptions aren't imported), log the error with self.logger.warning or
self.logger.debug including the cache_key (from _cache_key) and
_CACHE_TTL_SECONDS, and return None silently on failure; leave successful
behavior unchanged (still log debug on success). Ensure no exception is
re-raised so callers of _set_cache (e.g., update_user_settings,
restore_settings_to_point) continue normally when caching fails.
- Around line 55-63: Wrap the Redis read in get_user_settings with a try/except
to catch connection/timeouts from self._redis.get(cache_key) (and optionally
DomainUserSettings.model_validate_json parsing errors); on exception log a
non-fatal warning via self.logger.warning including the exception and cache_key,
then fall back to calling and returning get_user_settings_fresh(user_id). Ensure
you still log the cache hit when cached succeeds and only use the cached data
when model_validate_json succeeds; do not let Redis exceptions propagate and
crash the request.
♻️ Duplicate comments (1)
backend/app/core/providers.py (1)
421-430: SSEConnectionRegistry scope issue persists from previous review.As flagged in a prior review,
Scope.REQUESTcreates a new registry per HTTP request, causingget_connection_count()to always return 0. The registry should useScope.APPto track connections globally across requests.🔧 Fix: Change scope to APP
- `@provide`(scope=Scope.REQUEST) + `@provide`(scope=Scope.APP) def get_sse_connection_registry( self, logger: logging.Logger, connection_metrics: ConnectionMetrics, ) -> SSEConnectionRegistry:
🧹 Nitpick comments (2)
backend/app/services/user_settings_service.py (1)
40-45: Type annotation uses syncredis.Redisbut operations are async.The
redis_clientparameter is typed asredis.Redis, but the code usesawaitwith Redis operations (e.g.,await self._redis.get(cache_key)). Theredis.asynciomodule providesredis.asyncio.Redisas the async client type. While the current annotation works at runtime due to Python's duck typing, it's technically incorrect for type checkers.🔧 Suggested fix
def __init__( self, repository: UserSettingsRepository, event_service: KafkaEventService, logger: logging.Logger, - redis_client: redis.Redis, + redis_client: redis.Redis, # type: ignore[type-arg] # async Redis client ) -> None:Alternatively, use the more explicit async type if available in your redis version:
redis_client: "redis.asyncio.Redis[str]",backend/tests/integration/services/user_settings/test_user_settings_service.py (1)
39-43: Consider asserting the results of update wrappers.The update wrapper calls at the end of the test don't verify their return values. While this exercises the code paths, adding assertions would catch regressions more effectively.
💡 Example assertions
# Update wrappers - await svc.update_theme(user_id, Theme.DARK) - await svc.update_notification_settings(user_id, DomainNotificationSettings()) - await svc.update_editor_settings(user_id, DomainEditorSettings(tab_size=2)) - await svc.update_custom_setting(user_id, "k", "v") + theme_result = await svc.update_theme(user_id, Theme.DARK) + assert theme_result.theme == Theme.DARK + + notif_result = await svc.update_notification_settings(user_id, DomainNotificationSettings()) + assert notif_result.notifications is not None + + editor_result = await svc.update_editor_settings(user_id, DomainEditorSettings(tab_size=2)) + assert editor_result.editor.tab_size == 2 + + custom_result = await svc.update_custom_setting(user_id, "k", "v") + assert custom_result.custom_settings.get("k") == "v"
| async def get_user_settings(self, user_id: str) -> DomainUserSettings: | ||
| """Get settings with cache; rebuild and cache on miss.""" | ||
| if user_id in self._cache: | ||
| cached = self._cache[user_id] | ||
| self.logger.debug(f"Settings cache hit for user {user_id}", extra={"cache_size": len(self._cache)}) | ||
| return cached | ||
| """Get settings with Redis cache; rebuild and cache on miss.""" | ||
| cache_key = self._cache_key(user_id) | ||
| cached = await self._redis.get(cache_key) | ||
| if cached: | ||
| self.logger.debug(f"Settings cache hit for user {user_id}") | ||
| return DomainUserSettings.model_validate_json(cached) | ||
|
|
||
| return await self.get_user_settings_fresh(user_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redis cache operations lack error handling.
If Redis becomes unavailable or times out, the await self._redis.get(cache_key) call will raise an exception, causing the entire request to fail. For a cache layer, Redis failures should be non-fatal—falling back to fresh data retrieval from the database.
🛡️ Proposed fix with graceful degradation
async def get_user_settings(self, user_id: str) -> DomainUserSettings:
"""Get settings with Redis cache; rebuild and cache on miss."""
cache_key = self._cache_key(user_id)
- cached = await self._redis.get(cache_key)
- if cached:
- self.logger.debug(f"Settings cache hit for user {user_id}")
- return DomainUserSettings.model_validate_json(cached)
+ try:
+ cached = await self._redis.get(cache_key)
+ if cached:
+ self.logger.debug(f"Settings cache hit for user {user_id}")
+ return DomainUserSettings.model_validate_json(cached)
+ except Exception as e:
+ self.logger.warning(f"Redis cache read failed for user {user_id}: {e}")
return await self.get_user_settings_fresh(user_id)🤖 Prompt for AI Agents
In `@backend/app/services/user_settings_service.py` around lines 55 - 63, Wrap the
Redis read in get_user_settings with a try/except to catch connection/timeouts
from self._redis.get(cache_key) (and optionally
DomainUserSettings.model_validate_json parsing errors); on exception log a
non-fatal warning via self.logger.warning including the exception and cache_key,
then fall back to calling and returning get_user_settings_fresh(user_id). Ensure
you still log the cache hit when cached succeeds and only use the cached data
when model_validate_json succeeds; do not let Redis exceptions propagate and
crash the request.
| async def invalidate_cache(self, user_id: str) -> None: | ||
| """Invalidate cached settings for a user.""" | ||
| if self._cache.pop(user_id, None) is not None: | ||
| self.logger.debug(f"Invalidated cache for user {user_id}", extra={"cache_size": len(self._cache)}) | ||
|
|
||
| def _add_to_cache(self, user_id: str, settings: DomainUserSettings) -> None: | ||
| """Add settings to TTL+LRU cache.""" | ||
| self._cache[user_id] = settings | ||
| self.logger.debug(f"Cached settings for user {user_id}", extra={"cache_size": len(self._cache)}) | ||
|
|
||
| def get_cache_stats(self) -> dict[str, Any]: | ||
| """Get cache statistics for monitoring.""" | ||
| return { | ||
| "cache_size": len(self._cache), | ||
| "max_cache_size": self._max_cache_size, | ||
| "expired_entries": 0, | ||
| "cache_ttl_seconds": self._cache_ttl.total_seconds(), | ||
| } | ||
| cache_key = self._cache_key(user_id) | ||
| deleted = await self._redis.delete(cache_key) | ||
| if deleted: | ||
| self.logger.debug(f"Invalidated cache for user {user_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache invalidation should also handle Redis failures.
For consistency with the cache read/write error handling strategy, invalidate_cache should catch Redis exceptions to ensure reset_user_settings doesn't fail due to cache layer issues.
🛡️ Proposed fix
async def invalidate_cache(self, user_id: str) -> None:
"""Invalidate cached settings for a user."""
cache_key = self._cache_key(user_id)
- deleted = await self._redis.delete(cache_key)
- if deleted:
- self.logger.debug(f"Invalidated cache for user {user_id}")
+ try:
+ deleted = await self._redis.delete(cache_key)
+ if deleted:
+ self.logger.debug(f"Invalidated cache for user {user_id}")
+ except Exception as e:
+ self.logger.warning(f"Failed to invalidate cache for user {user_id}: {e}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def invalidate_cache(self, user_id: str) -> None: | |
| """Invalidate cached settings for a user.""" | |
| if self._cache.pop(user_id, None) is not None: | |
| self.logger.debug(f"Invalidated cache for user {user_id}", extra={"cache_size": len(self._cache)}) | |
| def _add_to_cache(self, user_id: str, settings: DomainUserSettings) -> None: | |
| """Add settings to TTL+LRU cache.""" | |
| self._cache[user_id] = settings | |
| self.logger.debug(f"Cached settings for user {user_id}", extra={"cache_size": len(self._cache)}) | |
| def get_cache_stats(self) -> dict[str, Any]: | |
| """Get cache statistics for monitoring.""" | |
| return { | |
| "cache_size": len(self._cache), | |
| "max_cache_size": self._max_cache_size, | |
| "expired_entries": 0, | |
| "cache_ttl_seconds": self._cache_ttl.total_seconds(), | |
| } | |
| cache_key = self._cache_key(user_id) | |
| deleted = await self._redis.delete(cache_key) | |
| if deleted: | |
| self.logger.debug(f"Invalidated cache for user {user_id}") | |
| async def invalidate_cache(self, user_id: str) -> None: | |
| """Invalidate cached settings for a user.""" | |
| cache_key = self._cache_key(user_id) | |
| try: | |
| deleted = await self._redis.delete(cache_key) | |
| if deleted: | |
| self.logger.debug(f"Invalidated cache for user {user_id}") | |
| except Exception as e: | |
| self.logger.warning(f"Failed to invalidate cache for user {user_id}: {e}") |
🤖 Prompt for AI Agents
In `@backend/app/services/user_settings_service.py` around lines 216 - 221, The
invalidate_cache method currently calls self._redis.delete(cache_key) without
handling Redis errors; update invalidate_cache to catch Redis-related exceptions
(matching the pattern used in your cache read/write methods), log an error via
self.logger.error including the cache_key and exception details, and swallow the
exception so callers like reset_user_settings won't fail due to cache layer
issues while preserving the existing debug log when deletion succeeds
(references: invalidate_cache, self._redis.delete, self.logger.debug,
reset_user_settings).
| async def _set_cache(self, user_id: str, settings: DomainUserSettings) -> None: | ||
| """Set settings in Redis cache with TTL.""" | ||
| cache_key = self._cache_key(user_id) | ||
| await self._redis.setex(cache_key, self._CACHE_TTL_SECONDS, settings.model_dump_json()) | ||
| self.logger.debug(f"Cached settings for user {user_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache write failures should not propagate.
Similar to the read path, _set_cache should handle Redis failures gracefully. A failed cache write shouldn't cause update_user_settings or restore_settings_to_point to fail when the primary database operation succeeded.
🛡️ Proposed fix
async def _set_cache(self, user_id: str, settings: DomainUserSettings) -> None:
"""Set settings in Redis cache with TTL."""
cache_key = self._cache_key(user_id)
- await self._redis.setex(cache_key, self._CACHE_TTL_SECONDS, settings.model_dump_json())
- self.logger.debug(f"Cached settings for user {user_id}")
+ try:
+ await self._redis.setex(cache_key, self._CACHE_TTL_SECONDS, settings.model_dump_json())
+ self.logger.debug(f"Cached settings for user {user_id}")
+ except Exception as e:
+ self.logger.warning(f"Failed to cache settings for user {user_id}: {e}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def _set_cache(self, user_id: str, settings: DomainUserSettings) -> None: | |
| """Set settings in Redis cache with TTL.""" | |
| cache_key = self._cache_key(user_id) | |
| await self._redis.setex(cache_key, self._CACHE_TTL_SECONDS, settings.model_dump_json()) | |
| self.logger.debug(f"Cached settings for user {user_id}") | |
| async def _set_cache(self, user_id: str, settings: DomainUserSettings) -> None: | |
| """Set settings in Redis cache with TTL.""" | |
| cache_key = self._cache_key(user_id) | |
| try: | |
| await self._redis.setex(cache_key, self._CACHE_TTL_SECONDS, settings.model_dump_json()) | |
| self.logger.debug(f"Cached settings for user {user_id}") | |
| except Exception as e: | |
| self.logger.warning(f"Failed to cache settings for user {user_id}: {e}") |
🤖 Prompt for AI Agents
In `@backend/app/services/user_settings_service.py` around lines 223 - 227, The
_set_cache method currently lets Redis errors propagate; change it to swallow
cache write failures and only log them so database operations like
update_user_settings and restore_settings_to_point don't fail when caching
fails. Wrap the await self._redis.setex(...) call in a try/except catching broad
Redis/connection exceptions (or Exception if specific client exceptions aren't
imported), log the error with self.logger.warning or self.logger.debug including
the cache_key (from _cache_key) and _CACHE_TTL_SECONDS, and return None silently
on failure; leave successful behavior unchanged (still log debug on success).
Ensure no exception is re-raised so callers of _set_cache (e.g.,
update_user_settings, restore_settings_to_point) continue normally when caching
fails.




Summary
Removing nullable fields where possible - source of errors.
Type of Change
Test Plan
uv run pytest)npx playwright test)Manual test steps:
Checklist
pre-commit run --all-filesand all checks passSummary by cubic
Replaced in-process lifecycles with DI and moved message handling to FastStream. Removed UnifiedConsumer/EventDispatcher and the coordinator; EventBus is removed; NotificationService self-starts; SSE now uses a stateless event router with a connection registry.
Refactors
Migration
Written for commit 62c375c. Summary will update on new commits.
Summary by CodeRabbit
New Features
Refactor
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.