Skip to content

Latest commit

 

History

History
246 lines (191 loc) · 16.5 KB

File metadata and controls

246 lines (191 loc) · 16.5 KB

Last updated: 2026-05-05

Debug & Workflow Tooling Reference

Tooling reference for inspecting workflows, validating shipped JSONs, and correlating runtime artifacts. Symptom-first quality troubleshooting (why does the video look wrong) lives in docs/guides/debugging_guide.md.


Inspection scripts

Read-only against any workflow JSON; none mutate state.

Script Purpose Trigger
scripts/audit_workflows.py [--verbose] Health audit across all example_workflows/: sage, batch-encode, sigma chain, resolution, (L-1)%8, preprocess, decoder, F2/F3 symmetry, plus 3 generic audit invariants (cycle / widget shape / link integrity) plus 1 AST test (cond-metadata types). Exits 1 on ERR. After bulk edits, before commit, in CI.
scripts/analyze_workflow_dag.py <wf> --format <ascii|mermaid|dot|json> [--save-run] Topo-sorted execution order + graph rendering. --save-run lands the artifact under data/runs/${RUN_ID}/dag_<slug>.<ext> (correlates with sibling run logs) or data/runs/dag/dag_<slug>_<ts>.<ext> if RUN_ID unset. Diagnosing execution order, post-validate cycle suspicion, comparing iteration-state across two workflows.
scripts/trace_node_source.py <wf> <id> --include-inputs Resolve any node to AST-extracted source + wiring. Flags object_patches, captured tensors, bypasses, widget overrides. Before trusting any widget annotation, before assuming bypass is inert.
scripts/sage_telemetry_summary.py --sage-log <path> [--exec-log <path>] Outside-ComfyUI aggregator. Per-(kernel, mask) median/p90/count + Phase 0 gate verdict. Reads only; does not write. After a traced render, when comparing kernel routing.
scripts/verify_sage_iteration_trace.sh Diff per-iter sage kernel counts. Suspecting per-iter kernel-routing drift.
scripts/diagnose_overlap_seams.py --latent <path> --iteration-count N --window-latents W --overlap-latents O Per-frame ghost-residual scan |f[t] - (f[t-1] + f[t+1]) / 2| (inverted, normalized) on an assembled loop output latent. Reports top-K ghost-scoring frames, per-seam-band scores at each iteration-boundary, and a noise-floor baseline. CPU-only; reads saved latent tensors or video files. Investigating iteration-boundary artifacts (seam ghosting, blend-flicker). Gating evidence before building a corrective seam-zone pass.

audit_workflows.py is intentionally WorkflowEditor-independent — raw orjson.loads + inline link scans. Debug tools must stay usable when the editor they audit has a bug; don't DRY against WorkflowEditor.


Workflow validation

audit_workflows.py runs in CI on every push. New invariants land here paired with their apply scripts. Two flavors:

Named pattern checks (one per known fix):

Check Pairs with apply script What it ERRs on
sage / sage_mode / sage_active scripts/archive/apply_sage_mode.py Missing or non-auto_mask_aware AudioLoopHelperSageAttention
iteration_stamp scripts/archive/apply_iteration_stamp.py Missing LoopIterationStamp
preprocess_symmetry (F2) apply_loop_guide_preprocess_symmetry.py Loop guide branch skips LTXVPreprocess
loop_cropguides_symmetry (F3) apply_loop_cropguides_symmetry.py Loop CFGGuider not via LTXVCropGuides
alc_seed_legacy_name (F4) apply_alc_seed_rename.py AudioLoopController has legacy seed/noise_seed input
iterations_autowired (F5) apply_iterations_autowire.py TensorLoopOpen.iterations_in not from AudioLoopPlanner.total_iterations
alc_widget_drift (F6) apply_strip_alc_control_after_generate.py AudioLoopController widgets_values has stale 6th 'randomize' entry
planner_no_stride_input (F7) apply_planner_break_stride_cycle.py AudioLoopPlanner has legacy stride_seconds input (closes a cycle)
dead_lora_loader_scaffolding_absent (F11) apply_strip_dead_lora_loaders.py Bypassed #1625/#1626/#1627 LoRA scaffolding nodes (inert UI clutter) still in canonical
iclora_video_reference_guide_in_loop_with_cropguides (F12a) apply_iclora_video_reference.py In-loop LTXAddVideoICLoRAGuide CONDITIONING outputs feed CFGGuider directly (must pass through LTXVCropGuides[NoLatent])
iclora_loader_present_when_guide_present (F12b) apply_iclora_video_reference.py Subgraph has IC-LoRA guide but top-level has no LTXICLoRALoaderModelOnly
iclora_ref_video_preprocess_symmetry (F12c) apply_iclora_video_reference.py IC-LoRA guide present but no LTXVPreprocess(val=18) on the ref-video chain
model_sampling_shift (F13) apply_strip_sd3_shift_node.py ModelSamplingSD3 present and active on a distilled workflow (Lightricks's distilled inference applies no shift; the SD3 node distorts the sigma-to-timestep mapping). WARN-level.
trim_video_latent_to_audio_present + trim_image_batch_to_audio_present (F14, layered) apply_trim_video_latent_to_audio.py + apply_trim_image_batch_to_audio.py Loop workflows must have BOTH trims wired. Latent trim (pre-VAE-decode) snaps UP to the smallest LTX-valid count where decoded pixels ≥ int(audio*fps) — saves decode VRAM/time on overshoot frames. Image trim (post-decode) clips the 0-7 pixel-frame residue from snap-UP to exact audio length. Without the layered pair, ffmpeg -shortest either clips audio (when video < audio) or leaves silence at end (when video > audio and -c:v copy defeats -shortest). Reverted to layered architecture on 2026-05-10 after Option A (latent-only) caused user-reported audio clipping. Postmortem: internal/analysis/loop_audio_overshoot_analysis.md (private clone only).
run_id_layout_present (F15) apply_run_id_layout.py Loop workflow's VHS_VideoCombine.filename_prefix not fed by RunIdPrefix.video_prefix. WARN-level — without it, every render's mp4 + workflow-snapshot + audio-mux outputs spray flat with a global counter instead of clustering under <output>/<workflow_name>/<timestamp>/. Same apply script also adds a bypassed SaveLatent toggle wired from LatentConcat #1605 for the LoadLatent upscale path. User guide: guides/upscale_guide.md.

Generic structural invariants (catch CLASSES of drift without per-bug rules):

Check Catches
graph_acyclic Top-level dependency cycles. ComfyUI rejects with "Dependency cycle detected" before any node executes.
widget_shape Stray randomize/fixed/increment/decrement strings in widgets_values of nodes that don't legitimately have a control_after_generate dropdown. Catches partial schema migrations.
link_integrity Top-level link record vs node-level link references desync (slot out of range, source's outputs[].links doesn't list the link id, target's inputs[].link != id). Plus subgraph linkIds references to non-existent links.
layout_no_orphans Non-Note node at pos=[0, 0]. Catches the silent failure mode where an apply script inserts a node and never runs a layout pass — node lands at canvas origin and is hard to spot in a busy workflow. Allowlisted types: Note only.
(no audit; AST test) tests/test_node_schemas.py::test_keyframe_idxs_cleared_to_none_not_empty_list conditioning_set_values({"keyframe_idxs": []}) literal-list assignments. KJNodes' OuterSampleCallbackWrapper crashes on empty-list keyframe_idxs.

Bake new topology constraints into audit_workflows.py. Every fix that ships an apply script should ship a matching audit check (ERR with a Run scripts/apply_X.py remediation pointer). Prevents silent regression of fixes a sibling branch might revert.


Apply scripts

Workflow migrations live in scripts/apply_*.py. Each script:

  • Default: mutates example_workflows/audio-loop-music-video_latent.json in place (accepts an optional path arg).
  • Idempotent: md5sum before + after re-run must match. Guard with if _is_already_built(wf): return to avoid burning last_node_id on strip-then-readd.
  • Has --revert that restores the pre-fix shape.
  • Has --dry-run that reports what WOULD change without writing. Pair with audit_workflows.py to verify a hypothetical state (HyDE).

Three-tier staging:

  1. internal/scratch/ — exploratory, gitignored.
  2. example_workflows/experimental/ — cross-machine reviewable; opt-in to audit via EXPERIMENTAL_AUDITED_FILES allowlist in audit_workflows.py.
  3. example_workflows/ — production, "ships AND stabilizes" per internal/PLAN.md (private clone only).

POCs that intentionally break a production invariant (e.g. F3 asymmetry) ship a paired audit check that dispatches on a node-title prefix and ERRs only if the rewire is damaged. Canonical TTC1 pair: apply_ttc_init_guide_amplification_poc.py + ttc1_init_guide_amplification.

Scratch-build apply scripts use WorkflowEditor.from_scratch(output_path)

  • add_top_level_node + add_link — returns an empty-skeleton editor with fresh uuid + reset last_node_id / last_link_id. Canonical: scripts/apply_spectrogram_iclora_minimal.py.

Shared apply-script helpers live in scripts/_helpers/_apply_helpers.py (add_link, find_node, remove_node_and_links, find_link_to_slot, next_id). Import with aliases to preserve call-site names; don't re-define inline.

Sweep orphan virtual GetNodes after fork-and-strip. A GetNode whose widgets_values[0] matches no live SetNode is orphaned; ComfyUI tolerates it at runtime but it clutters the graph. Add the ID to STRIP_IDS. Detect via:

[n["id"] for n in wf["nodes"] if n["type"]=="GetNode"
 and not (n.get("outputs",[{}])[0].get("links") or [])]

Templates: scripts/templates/apply_script_all_workflows.py (in-place edits) and scripts/templates/apply_script_staged_variant.py (experimental staging). Both include the canonical --revert, --dry-run, idempotence, and require_nodes guards.

Selected staged-variant apply scripts (stage drafts under internal/workflows/; promote to example_workflows/experimental/ after A/B validation):

Script Stages to What it does
apply_lanczos_init_preprocess.py loop_with_lanczos_preprocess.draft.json Inserts a supersample-then-decimate ImageResizeKJv2 pair in front of the init-image resize. Targets residual aliasing on faces / fine textures when the source image is much larger than the schedule target dims. Idempotent, --revert, --dry-run.
apply_p3_retake_edit_lora.py retake_edit.draft.json Wires the section-targeted retake-edit pattern into a copy of the canonical retake workflow: LTXICLoRALoaderModelOnly (edit-anything LoRA — ADD / REMOVE / REPLACE / RESTYLE) into the MODEL chain, plus LTXVAddGuideMulti (strength=1, frame_idx=0) between LatentTemporalMask and SamplerCustomAdvanced. Existing positive CLIPTextEncode becomes the edit instruction. Idempotent, --revert, --dry-run.

Workflow build scripts

Scratch-build new workflows from constants — distinct from apply scripts (which mutate or stage variants of an existing canonical workflow).

Script Builds Topology
scripts/build_keyframe_workflow.py (per script) Keyframe-schedule baseline.
scripts/build_upscale_workflow.py internal/workflows/upscale_loop_output.draft.json Post-loop spatial upscale: LoadLatent → LTXVLatentUpsampler (2×) → LTXVConcatAVLatent → SamplerCustomAdvanced (3-step σ-tail [0.85, 0.7250, 0.4219, 0.0], euler, CFG=1) → LTXVSeparateAVLatent → LTXVCropGuides → LTXVTiledVAEDecode → VHS_VideoCombine. Audio from LoadAudio of source mp3 (no mp4 needed). Empty audio latent sized via LatentFrameCount.pixel_frames. Model chain mirrors the canonical loop's perf/VRAM patches (UNETLoader → AudioLoopHelperSageAttention → LTXVChunkFeedForward → LTX2AttentionTunerPatch → CFGGuider). 27 nodes, 32 links; constants centralized at the top of the script. --dry-run, --revert. Pre-step: run loop with scripts/apply_save_assembled_latent.py applied so the assembled .latent file exists; move into ComfyUI's input dir. Chain apply_trim_video_latent_to_audio.py after re-building so the F14 latent-trim gets re-spliced.

Build scripts share apply-script conventions for --dry-run / --revert / idempotence, but produce a deterministic file from constants rather than editing an existing one. Re-running overwrites with byte-identical output (constant node ids, deterministic link order).


Runtime telemetry

Source Default path RUN_ID path What it captures
Exec log (COMFYUI_EXEC_LOG=... env var enables) internal/analysis/runs/exec_log/exec_<ts>.jsonl data/runs/${RUN_ID}/exec.jsonl Per-node start/end with class_type, inputs, duration
Sage trace (AUDIOLOOPHELPER_SAGE_TRACE=auto) internal/analysis/runs/sage/sage_<ts>.jsonl data/runs/${RUN_ID}/sage.jsonl Per-(kernel, mask) timing for sage attention
Profiler internal/analysis/runs/profiler/<ts>/ data/runs/${RUN_ID}/profiler/ torch profiler trace.json + summary.txt + memory_timeline.html
ComfyUI stdout <comfyui>/user/comfyui_8188.log (and .prev.log, .prev2.log) (same) Validation errors, prompt accepted/rejected, exception tracebacks
Output mp4 (raw output dir) data/runs/${RUN_ID}/output.mp4 (symlink) The rendered video

AUDIOLOOPHELPER_SAGE_TRACE=auto is default in start_experiment.sh at this repo's root. Plain <comfyui>/start.sh does NOT export it. Run via start_experiment.sh for traced launches.

docs/reference/telemetry_and_tracing.md covers what each tracer captures (and doesn't), retention, on/off semantics, why prompt text can leak via the exec logger but not via the sage tracer.

docs/reference/environment.md — env-var registry, single-helper-call-site DRY rule.


Artifact paths

Conventions (post-2026-04-26 RUN_ID propagation):

  • New per-render artifacts: data/runs/${RUN_ID}/<category>.<ext> via scripts/workflow_utils.py::run_artifact_path(category, ext).
  • Multi-file artifacts: data/runs/${RUN_ID}/<subdir>/ via run_artifact_dir(subdir).
  • Without RUN_ID: legacy fallback to internal/analysis/runs/ for most loggers. analyze_workflow_dag.py --save-run is the exception — always writes under data/ (data/runs/${RUN_ID}/dag_<slug>.<ext> with RUN_ID, data/runs/dag/dag_<slug>_<ts>.<ext> without).

The single env-var read site for RUN_ID is scripts/workflow_utils.py::_current_run_id. Route all reads through it.

Rendered mp4 lands at data/runs/${RUN_ID}/output.mp4 (symlink) once internal/autoresearch/harness.py::_locate_and_link_output_mp4 runs after poll_until_done. Source dir comes from COMFYUI_OUTPUT_DIR env var (no hardcoded paths). Filename constant lives at internal/autoresearch/metrics/__init__.py::OUTPUT_MP4_FILENAME — every video-content metric (subject_consistency, av_consistency, future style/aesthetic/palette) imports it; never inline "output.mp4" in a new metric module.


Canonical first-pass

When a workflow fails to run, work in this order before going inline:

  1. Tail the ComfyUI log for the most recent prompt:

    tail -200 <comfyui>/user/comfyui_8188.log
    

    Validation errors ("Dependency cycle detected", "Failed to convert ... to a INT value") fail before any node executes — log shows them. Exception tracebacks fire mid-run and point you at the offending node.

  2. Run the audit:

    uv run --group dev python scripts/audit_workflows.py
    

    ERRs map 1:1 to remediation apply scripts. The 4 generic checks (graph_acyclic, widget_shape, link_integrity, cond_metadata_types AST test) catch classes of drift.

  3. Inspect execution order if a topology change is suspected:

    uv run --group dev python scripts/analyze_workflow_dag.py \
      example_workflows/audio-loop-music-video_latent.json \
      --format ascii --save-run
    

    Compare against a known-good baseline diff.

  4. Trace the suspect node for source + wiring:

    uv run --group dev python scripts/trace_node_source.py <wf> <node-id> \
      --include-inputs
    
  5. Cross-reference the exec log for what actually ran most recently:

    ls -lt internal/analysis/runs/exec_log/ | head
    

    (Or data/runs/${RUN_ID}/exec.jsonl if the run had RUN_ID.)

Skipping step 1 wastes the most time — ComfyUI's log usually identifies the failure class within the first matching line.

Iter-over-iter drift specifically: trace CONDITIONING paths in parallel (initial vs loop). Asymmetries (missing LTXVConditioning, frame_rate mismatch, CLIP in subgraph) are load-bearing bugs.