-
Notifications
You must be signed in to change notification settings - Fork 15
Async rendering support #269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 3.0
Are you sure you want to change the base?
Conversation
Greptile OverviewGreptile SummaryThis PR adds async rendering support to prevent busy-waiting at render intervals, allowing render jobs to run in background processes limited by Key changes:
Critical issues found:
Additional concerns:
Confidence Score: 1/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Main as Main Process
participant PRL as PuffeRL
participant RProc as Render Process
participant Queue as Render Queue
participant WandB as WandB Logger
Main->>PRL: Initialize with config
alt render_async enabled
PRL->>Queue: Create multiprocessing.Queue()
PRL->>PRL: Initialize render_processes list
end
loop Training Loop
PRL->>PRL: Train and checkpoint
alt At render_interval
PRL->>PRL: Export model to bin_path
PRL->>PRL: Copy bin_path → bin_path_epoch
alt render_async mode
PRL->>PRL: Clean up finished processes
PRL->>PRL: Wait if >= num_workers processes
PRL->>RProc: Start Process(render_videos, bin_path_epoch)
RProc->>RProc: Generate videos (async)
RProc->>Queue: Put {videos, step}
RProc->>RProc: Delete bin_path_epoch
else synchronous mode
PRL->>PRL: Call render_videos(bin_path_epoch)
PRL->>WandB: Log videos immediately
PRL->>PRL: Delete bin_path_epoch
end
end
PRL->>PRL: check_render_queue()
alt render_async and queue not empty
PRL->>Queue: Get result (non-blocking)
PRL->>WandB: Log videos at render_step
end
PRL->>PRL: Continue training
end
Main->>PRL: close()
alt render_async enabled
PRL->>RProc: Terminate all processes
PRL->>Queue: Close and join_thread()
end
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 5 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds asynchronous rendering support to prevent busy waiting during video generation. The main changes involve spawning render operations in separate processes using Python's multiprocessing module, allowing training to continue while videos are being rendered.
Changes:
- Added async rendering mode with process-based parallelism to avoid blocking training during video generation
- Refactored
render_videosfunction signature to decouple from logger/vecenv dependencies - Fixed bug in drive.py where
num_mapsvariable was used before being defined - Removed debug print statements from CARLA data generation utilities
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| pufferlib/utils.py | Refactored render_videos function to support async mode with queue-based communication and updated video file handling |
| pufferlib/pufferl.py | Added multiprocessing infrastructure for async rendering including queue management, process spawning, and result collection |
| pufferlib/ocean/drive/drive.py | Fixed bug using self.num_maps instead of undefined num_maps variable; changed simplify_polyline tolerance |
| pufferlib/config/ocean/drive.ini | Updated configuration with async rendering settings and modified training parameters |
| data_utils/carla/generate_carla_agents.py | Removed debug print statements |
Comments suppressed due to low confidence (1)
pufferlib/utils.py:186
- The docstring is outdated and still describes the old function signature with vecenv and logger parameters. It should be updated to reflect the new parameters: env_cfg, run_id, wandb_log, render_async, render_queue, and wandb_run. The current documentation will mislead users about how to call this function.
"""
Generate and log training videos using C-based rendering.
Args:
config: Configuration dictionary containing data_dir, env, and render settings
vecenv: Vectorized environment with driver_env attribute
logger: Logger object with run_id and optional wandb attribute
epoch: Current training epoch
global_step: Current global training step
bin_path: Path to the exported .bin model weights file
Returns:
None. Prints error messages if rendering fails.
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pufferlib/pufferl.py
Outdated
| self.render_processes = [p for p in self.render_processes if p.is_alive()] | ||
|
|
||
| # Cap the number of processes to num_workers | ||
| max_processes = self.config.get("num_workers") |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If num_workers is not set in the config, config.get("num_workers") will return None, and the comparison len(self.render_processes) >= max_processes will fail with a TypeError. Consider providing a default value such as config.get("num_workers", 1) or validating that num_workers is set when render_async is enabled.
| max_processes = self.config.get("num_workers") | |
| max_processes = self.config.get("num_workers", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 5 comments
c16efbd to
710601a
Compare
…k. Added render_queue for async rendering and logging videos
710601a to
dd1abac
Compare
Async Render Support for preventing busy waiting on a render interval for too long(sometimes 10 minute timeout hit which is bad):
render_async=false, interval=50. https://wandb.ai/emerge_/async_render_support/runs/1sqs53pe/
render_async=true, interval=50. https://wandb.ai/emerge_/async_render_support/runs/pg03q1jo/
Both jobs run for 100M steps with same configs and second one completes in 23 mins, whereas first one completes in 41 mins.
Conclusion for this mode: