Skip to content

Rolled-back df.start() leaves failed duroxide.executions rows that inflate df.metrics() #213

Description

@oborchers

Summary

df.metrics() appears to count failed rows from duroxide.executions, including orphan executions created when df.start() is rolled back, while df.instances and df.list_instances('failed') only show persisted workflow instances.

That makes the failed instance count disagree across public APIs after rollback scenarios.

Observed

Tested against current main at 11ac64e3adb64c14386be5c737b3a3806d873fc4.

After rollback-oriented tests, the counts diverged:

source               total  completed  failed  running
df.metrics()         399    392        7       0
df.instances         396    392        4       0
duroxide.executions  399    392        7       0

The extra failed rows were in duroxide.executions with no matching row in df.instances:

SELECT
  e.instance_id,
  e.execution_id,
  e.status,
  left(e.output, 180) AS output_prefix,
  i.id AS df_instance_id
FROM duroxide.executions e
LEFT JOIN df.instances i ON i.id = e.instance_id
WHERE e.status = 'Failed'
  AND i.id IS NULL
ORDER BY e.instance_id;

Example output prefix:

Instance <id> not found after 5s (transaction may have been rolled back)

So df.metrics() reports these as failed instances even though df.instances and df.list_instances('failed') do not expose them as failed workflow instances.

Repro Shape

One way to trigger this is to start a workflow inside a transaction that later rolls back, wait for the worker to observe the missing instance, then compare the metrics API with df.instances.

BEGIN;
SELECT df.start('SELECT 1', 'rollback-metrics-probe');
ROLLBACK;

-- wait long enough for the worker to record the missing instance failure

SELECT * FROM df.metrics();

SELECT status, count(*)
FROM df.instances
GROUP BY status;

SELECT e.instance_id, e.status, e.output, i.id AS df_instance_id
FROM duroxide.executions e
LEFT JOIN df.instances i ON i.id = e.instance_id
WHERE e.status = 'Failed'
  AND i.id IS NULL;

Expected

Either:

  • df.metrics() should count the same persisted workflow instances that df.instances / df.list_instances() expose, or
  • the docs should clearly state that df.metrics().failed_instances includes lower-level failed duroxide.executions, including orphan executions created by rolled-back starts.

For dashboards and alerting, the current behavior makes rollback probes look like durable workflow failures.

Notes

From a quick source read, df.metrics() appears to come from the generated get_system_metrics() path and counts failed rows in duroxide.executions. That explains why it can diverge from df.instances after rollback.

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions