Skip to content

Conversation

@mpragnay
Copy link

@mpragnay mpragnay commented Jan 21, 2026

Async Render Support for preventing busy waiting on a render interval for too long(sometimes 10 minute timeout hit which is bad):

  • Async rendering mode, where there is no busy wait at a render step
  • Number of processes in render_queue limited by num_workers
  • Especially helpful with a good average GPU_util and no real change in SPS intervals, although jobs are completed faster, example:

render_async=false, interval=50. https://wandb.ai/emerge_/async_render_support/runs/1sqs53pe/
render_async=true, interval=50. https://wandb.ai/emerge_/async_render_support/runs/pg03q1jo/

Both jobs run for 100M steps with same configs and second one completes in 23 mins, whereas first one completes in 41 mins.

Conclusion for this mode:

  • Only use if you have a small render interval and/or rendering many carla maps each time

@mpragnay mpragnay marked this pull request as ready for review January 28, 2026 22:16
Copilot AI review requested due to automatic review settings January 28, 2026 22:16
@greptile-apps
Copy link

greptile-apps bot commented Jan 28, 2026

Greptile Overview

Greptile Summary

This PR adds async rendering support to prevent busy-waiting at render intervals, allowing render jobs to run in background processes limited by num_workers. The implementation creates a multiprocessing queue for async communication and spawns render processes that put results back into the queue for later logging.

Key changes:

  • Added render_async config flag and multiprocessing infrastructure in pufferl.py
  • Modified render_videos() signature to support both sync and async modes with queue communication
  • Process management with proper cleanup in close() method
  • Queue checking in mean_and_log() to log completed async renders

Critical issues found:

  • Syntax error on line 660 - malformed comment causes entire code block to be commented out, breaking render process cleanup
  • File path bug - double .mp4 extension created for single map renders (output_agent_{epoch}.mp4.mp4)
  • KeyError risk - config["render_async"] accessed without checking if key exists (though it's now in config file)

Additional concerns:

  • 10x geometry simplification threshold increase (25→250) should be verified for quality impact
  • Previous thread concerns about bin_path_epoch deletion timing remain valid - async processes may still need the file when it's deleted in the finally block

Confidence Score: 1/5

  • Critical syntax error will break render cleanup; should not merge until fixed
  • Line 660 has a malformed comment that turns the entire render process cleanup block into a comment, meaning async render processes will never be terminated properly. This is a critical bug that will cause resource leaks.
  • pufferlib/pufferl.py (line 660 syntax error), pufferlib/utils.py (double .mp4 extension bug)

Important Files Changed

Filename Overview
pufferlib/utils.py Async rendering support added with file path bug causing double .mp4 extension
pufferlib/pufferl.py Async render process management added with queue communication
pufferlib/ocean/drive/drive.py Fixed variable references and increased geometry simplification threshold

Sequence Diagram

sequenceDiagram
    participant Main as Main Process
    participant PRL as PuffeRL
    participant RProc as Render Process
    participant Queue as Render Queue
    participant WandB as WandB Logger

    Main->>PRL: Initialize with config
    alt render_async enabled
        PRL->>Queue: Create multiprocessing.Queue()
        PRL->>PRL: Initialize render_processes list
    end

    loop Training Loop
        PRL->>PRL: Train and checkpoint
        alt At render_interval
            PRL->>PRL: Export model to bin_path
            PRL->>PRL: Copy bin_path → bin_path_epoch
            
            alt render_async mode
                PRL->>PRL: Clean up finished processes
                PRL->>PRL: Wait if >= num_workers processes
                PRL->>RProc: Start Process(render_videos, bin_path_epoch)
                RProc->>RProc: Generate videos (async)
                RProc->>Queue: Put {videos, step}
                RProc->>RProc: Delete bin_path_epoch
            else synchronous mode
                PRL->>PRL: Call render_videos(bin_path_epoch)
                PRL->>WandB: Log videos immediately
                PRL->>PRL: Delete bin_path_epoch
            end
        end
        
        PRL->>PRL: check_render_queue()
        alt render_async and queue not empty
            PRL->>Queue: Get result (non-blocking)
            PRL->>WandB: Log videos at render_step
        end
        
        PRL->>PRL: Continue training
    end

    Main->>PRL: close()
    alt render_async enabled
        PRL->>RProc: Terminate all processes
        PRL->>Queue: Close and join_thread()
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds asynchronous rendering support to prevent busy waiting during video generation. The main changes involve spawning render operations in separate processes using Python's multiprocessing module, allowing training to continue while videos are being rendered.

Changes:

  • Added async rendering mode with process-based parallelism to avoid blocking training during video generation
  • Refactored render_videos function signature to decouple from logger/vecenv dependencies
  • Fixed bug in drive.py where num_maps variable was used before being defined
  • Removed debug print statements from CARLA data generation utilities

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
pufferlib/utils.py Refactored render_videos function to support async mode with queue-based communication and updated video file handling
pufferlib/pufferl.py Added multiprocessing infrastructure for async rendering including queue management, process spawning, and result collection
pufferlib/ocean/drive/drive.py Fixed bug using self.num_maps instead of undefined num_maps variable; changed simplify_polyline tolerance
pufferlib/config/ocean/drive.ini Updated configuration with async rendering settings and modified training parameters
data_utils/carla/generate_carla_agents.py Removed debug print statements
Comments suppressed due to low confidence (1)

pufferlib/utils.py:186

  • The docstring is outdated and still describes the old function signature with vecenv and logger parameters. It should be updated to reflect the new parameters: env_cfg, run_id, wandb_log, render_async, render_queue, and wandb_run. The current documentation will mislead users about how to call this function.
    """
    Generate and log training videos using C-based rendering.

    Args:
        config: Configuration dictionary containing data_dir, env, and render settings
        vecenv: Vectorized environment with driver_env attribute
        logger: Logger object with run_id and optional wandb attribute
        epoch: Current training epoch
        global_step: Current global training step
        bin_path: Path to the exported .bin model weights file

    Returns:
        None. Prints error messages if rendering fails.
    """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

self.render_processes = [p for p in self.render_processes if p.is_alive()]

# Cap the number of processes to num_workers
max_processes = self.config.get("num_workers")
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If num_workers is not set in the config, config.get("num_workers") will return None, and the comparison len(self.render_processes) >= max_processes will fail with a TypeError. Consider providing a default value such as config.get("num_workers", 1) or validating that num_workers is set when render_async is enabled.

Suggested change
max_processes = self.config.get("num_workers")
max_processes = self.config.get("num_workers", 1)

Copilot uses AI. Check for mistakes.
@mpragnay mpragnay marked this pull request as draft January 28, 2026 22:50
@mpragnay mpragnay marked this pull request as ready for review January 29, 2026 16:35
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

@mpragnay mpragnay force-pushed the Async_Rendering_Support branch from c16efbd to 710601a Compare January 29, 2026 18:20
@mpragnay mpragnay force-pushed the Async_Rendering_Support branch from 710601a to dd1abac Compare January 29, 2026 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants