Recording pipeline robustness#1600
Closed
richiemcilroy wants to merge 28 commits into
Closed
Conversation
…-pause frame drops Previously the M4S encoder thread was started lazily on the first video frame. During multi-pause scenarios, each resume creates a new pipeline and the first frame would trigger encoder initialization (thread spawn, FFmpeg init, etc.), causing ~200-500ms latency and massive frame drops (up to 41% in benchmarks). Changes: - Start encoder eagerly in setup() for both screen and camera M4S muxers - Remove lazy start check from send_video_frame() - Increase default M4S buffer size from 60 to 120 frames (4s at 30fps) to better absorb frame bursts during segment transitions Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
…messages - Change pause_flag from Relaxed to Acquire ordering for proper visibility guarantees across threads - Use blocking send() for Pause/Resume control messages instead of try_send() to prevent them from being silently dropped when the channel is full - Increase studio mode MP4 buffer from 60 to 120 frames (4s at 30fps) to reduce frame drops under encoder pressure - Keep try_send for actual video frames (dropping a frame is acceptable, dropping a pause/resume message is not) Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Video encoder: - Increase MAX_RETRIES from 100 to 150 - Use exponential backoff: starts at 200µs, doubles every 20 retries, caps at 3ms (was fixed 500µs) - Total max wait time ~100ms vs old 50ms, better adapts to encoder load Audio encoder: - Increase MAX_RETRIES from 50 to 200 (audio should never be dropped) - Use exponential backoff: starts at 100µs, doubles every 25 retries, caps at 2ms - Add warning log when audio frames are dropped (was silent) - Total max wait ~120ms, prioritizing audio integrity over latency Camera encoder: - Same improvements as video encoder (150 retries, exponential backoff) Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
- Add 8-second timeout to Pipeline::stop() to prevent indefinite hangs when an encoder thread is stuck - Handle camera and microphone stop errors gracefully (log warning and continue) instead of failing the entire recording - Screen pipeline stop failure still propagates as a fatal error since the screen is the primary track - System audio errors already handled gracefully (unchanged) - This ensures recordings are saved even when one non-essential track has issues during finalization Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Timestamp monotonicity: - Add enforce_monotonicity() to TimestampAnomalyTracker that guarantees output timestamps are strictly non-decreasing - If a timestamp would go backward after all corrections, clamp it to last_valid + 1µs - This prevents corrupt muxer output from backward-jumping timestamps Audio gap tracker bounds: - Add MAX_TOTAL_SILENCE (30s) budget to prevent runaway silence insertion during long recordings with persistent audio issues - Stop inserting silence once budget is exhausted and log error - Use rate-limited logging: every 5s for first 100 insertions, then every 30s to avoid log spam in long problematic recordings - Remaining silence budget is tracked so partial insertions work correctly Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
- Check available disk space before creating each new segment - Refuse to create a new segment if disk space < 200MB (critical) - Log warning when disk space < 500MB - Cross-platform implementation: macOS (statvfs), Windows (GetDiskFreeSpaceExW) - Prevents recordings from filling disk completely and producing corrupt output files Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
- Refactor spawn_watcher to use cancel_all_others closure that cancels all pipeline tracks when the screen pipeline completes or fails - Screen failure cancels mic, camera, and system audio - Screen completion (normal finish) also cancels other tracks for cleanup - Error propagation to completion_tx remains unchanged (first error wins) - Simpler and more maintainable than previous approach Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
…ted segments In multi-pause scenarios, rapid pause/resume cycles would create segments with only a few frames because the pipeline barely had time to start producing data before being torn down. This resulted in: - Segment 2 having only 5 frames at 15fps (expected 60 frames) - Audio timing drifting by 500-1700ms cumulatively - Total recording duration of 3.78s instead of 6.0s The fix ensures each segment records for at least 500ms before a pause can take effect, giving all sources (screen, mic, camera) time to initialize and produce meaningful output. This matches the existing 1-second minimum on Stop. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Previously QueueFrameError::Failed immediately terminated the encoder thread and set a fatal error, losing the entire recording. This was too aggressive for transient failures. Now the encoder tolerates up to 10 QueueFrameError::Failed before declaring a fatal error. Each transient failure: - Skips the current frame (acceptable trade-off) - Logs a warning with the error count - Only terminates if errors exceed threshold WriterFailed and mutex poison remain immediately fatal since they indicate unrecoverable encoder state. Applied to both screen and camera MP4 encoders. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
The mic audio source typically takes 70-136ms to start producing frames after pipeline creation, while video starts immediately. This caused the audio track to be shorter than the video track by the same amount. Fix: start the AudioGapTracker's wall-clock measurement at task creation instead of at first-frame arrival. This allows the gap detector to notice the mic startup delay and insert silence to pad the beginning of the audio track, keeping audio and video durations aligned. Previous behavior: gap_tracker.mark_started() called on first frame New behavior: gap_tracker.mark_started() called at task creation This should reduce MP4 mic timing from 70-136ms to near 0ms, matching the fragmented mode's ~1ms performance. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
…ess fixes Document all 13 fixes implemented in this session: - Multi-pause frame drop storm (eager encoder start) - Multi-pause audio drift (minimum segment duration) - MP4 mic timing (startup silence insertion) - Buffer sizes (120 studio, 240 instant) - Encoder retry (exponential backoff, higher limits) - Pause/resume race (Acquire ordering, blocking sends) - Pipeline stop timeout (8s) - Encoder error tolerance (10 transient failures) - Disk space monitoring (studio mode) - Timestamp monotonicity - Audio silence budget (30s max) - Pipeline watcher improvements - Graceful track failure handling Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Required for GetDiskFreeSpaceExW used in Studio mode disk space checks on Windows. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
|
Cursor Agent can help with this pull request. Just |
Add 'pause-resume' subcommand to synthetic-test-runner that tests OutputPipeline's pause/resume mechanism with three scenarios: - Single pause: 3s record + 2s pause + 3s record - Triple pause: 4x 2s record with 3x 1s pauses - Rapid pause: 3x 1s record with 2x 500ms pauses Each test creates an MP4 pipeline with synthetic video+audio, exercises pause/resume, then validates that the output duration matches expected recording time (within 2s tolerance). Usage: cargo run -p cap-recording --example synthetic-test-runner -- pause-resume --keep-outputs Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
When an instant recording crashes, the MP4 file may be left without a valid moov atom, making it unplayable. RecoveryManager::try_recover_instant() now attempts to: 1. Detect failed/in-progress instant recordings 2. Check if the output.mp4 has decodable frames 3. If not directly decodable, attempt repair via ffmpeg remux 4. If repair produces decodable frames, replace the original 5. Update metadata to Complete with detected fps This allows partial instant recordings to be recovered after app crashes, preserving whatever video data was written before the crash. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
When the app starts and finds instant recordings in InProgress state (indicating a crash), attempt recovery using RecoveryManager before marking them as Failed. If the MP4 file has decodable frames, it will be repaired and marked as Complete, preserving the user's recording. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
…ry and tests Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
…er bounds New tests for enforce_monotonicity: - Clamps backward timestamps to last_valid + epsilon - Allows forward timestamps through unchanged - Works correctly with no previous timestamp - Equal timestamps pass through unchanged New tests for AudioGapTracker: - Silence budget exhaustion stops all insertions - Remaining budget limits individual insertion size - No gap detected when sample time exceeds wall clock - Gap below threshold (wired/wireless) not detected - Wireless has higher detection threshold than wired - Individual gaps capped to MAX_SILENCE_INSERTION Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
…tudio modes Previously OutputPipeline::stop() waited indefinitely for done_fut, which could hang if an encoder thread was stuck. This affected both instant and studio recording modes. Changes: - Add 10-second timeout to done_fut await during stop - Add 2-second timeout to first_timestamp_rx await - On timeout, proceed with best-effort finalization and log warning - Fallback first_timestamp if receiver times out - This protects instant mode (which lacked any stop timeout) and provides a lower-level safety net for studio mode in addition to the Pipeline-level 8s timeout already in place Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
… muxer Apply the same robustness improvements to the FFmpeg SegmentedVideoMuxer (used in synthetic tests and as a cross-platform fallback): - Start encoder eagerly in setup() instead of lazily on first frame - Remove lazy start check from send_video_frame() - Increase default buffer from 30 to 120 frames for consistency Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Tests for RecoveryManager::try_recover_instant(): - No output.mp4 file → returns false (not recoverable) - Output.mp4 too small (<1KB) → returns false - Complete recording → returns false (skips recovery) - Corrupt data → returns false (not decodable) These tests verify the recovery logic without requiring real video data or hardware encoders. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
The SegmentedVideoEncoder had a disk_space_callback field that was never actually invoked. Now it checks disk space at every segment boundary (every 3 seconds by default) with a 10-second rate limit. Changes to crates/enc-ffmpeg/src/mux/segmented_stream.rs: - Add check_disk_space() called from on_segment_boundary() - Add get_available_disk_space_mb() for Unix platforms - Log error when disk < 200MB (critical), warn when < 500MB - Invoke the disk_space_callback if set, allowing callers to take action (e.g., stop recording gracefully) - Add libc dependency for Unix statvfs call This ensures continuous fragmented recordings (no pauses) get periodic disk space monitoring, not just at segment creation. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
New tests for RecoveryManager: - find_incomplete with InProgress status and display.mp4 - find_incomplete with M4S manifest + init segment + fragments - Incomplete fragments in manifest are skipped (only complete counted) - File size mismatch in manifest causes fragment to be rejected - NeedsRemux status detected as incomplete for recovery These tests verify the recovery system correctly analyzes fragmented studio recordings with manifests, init segments, and various completeness states. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
When OutputPipeline::stop() times out waiting for first_timestamp_rx (meaning no frames were ever produced), the fallback timestamp was Instant::now() — the stop time. This would cause signed_duration_since to compute start_time as the full recording duration, producing incorrect metadata. Fix: store creation_instant at pipeline build time and use it as the fallback. This makes the fallback start_time compute to ~0s, which is correct for a track that started at pipeline creation time. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Changed 'attempting repair' to 'marking as recovered' when the MP4 file is already playable and doesn't need repair. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Improve recording pipeline robustness and stability, especially for Studio Mode multi-pause scenarios, by addressing encoder startup, buffering, error handling, and timing issues.
The "Multiple Pause Catastrophe" was a critical issue where Studio Mode recordings suffered severe frame drops (up to 41%) and audio drift (up to 1700ms) after multiple pauses. This was primarily due to encoders starting lazily on resume, causing significant latency and truncation of short segments, combined with insufficient buffering and race conditions. This PR addresses these root causes by eagerly initializing encoders, enforcing minimum segment durations on pause, increasing buffer sizes, and improving error handling.