[Backend] Ensure subtitle file cleanup on failure#664
Conversation
📝 WalkthroughWalkthroughA GSoC 2026 backend hardening proposal document was added outlining security and reliability improvements including file handling hardening, async task migration via Celery, and model consolidation. Additionally, the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Warning |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/server.py`:
- Around line 560-570: The cleanup in the finally block (references:
latest_subtitle, glob.glob, os.remove, app.logger) must be made best-effort so
any exception during os.remove does not propagate and turn a successful response
into a 500; wrap each file deletion (both the single latest_subtitle branch and
the stray-files loop) in a try/except that catches Exception, logs the failure
(use app.logger.error or app.logger.exception with the file path and error), and
continues without re-raising so cleanup failures are non-fatal to the endpoint
response.
- Around line 542-570: Validate and sanitize video_id before using it in file
paths: ensure video_id matches an allowed pattern (e.g., only alphanumeric,
hyphen, underscore or the expected YouTube ID length) or produce a safe_filename
via re.sub to strip path separators and dangerous chars; then use that sanitized
id (e.g., safe_id) to construct exact paths instead of globbing all subtitles
(replace glob.glob("subtitles/*.vtt") with a direct check of
f"subtitles/{safe_id}.vtt" or create a per-request temp dir for subtitles named
with safe_id); update the subprocess.run call and the cleanup logic to reference
safe_id/latest_subtitle only, and when you must glob for stray files limit the
pattern to the sanitized prefix (e.g., f"subtitles/{safe_id}*.vtt") so no
unintended files can be read or deleted (apply changes around the subprocess.run
block, the latest_subtitle assignment, clean_transcript call, and the finally
cleanup).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5f29e64e-49f1-4554-894e-31f017d4fe94
📒 Files selected for processing (2)
GSOC_PROPOSAL_DRAFT.mdbackend/server.py
| transcript_text = None | ||
| latest_subtitle = None | ||
|
|
||
| # Find the latest .vtt file in the "subtitles" folder | ||
| subtitle_files = glob.glob("subtitles/*.vtt") | ||
| if not subtitle_files: | ||
| return jsonify({"error": "No subtitles found"}), 404 | ||
|
|
||
| latest_subtitle = max(subtitle_files, key=os.path.getctime) | ||
| transcript_text = clean_transcript(latest_subtitle) | ||
|
|
||
| # Optional: Clean up the file after reading | ||
| os.remove(latest_subtitle) | ||
|
|
||
| return jsonify({"transcript": transcript_text}) | ||
| try: | ||
| subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download", | ||
| "--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"], | ||
| check=True, capture_output=True, text=True) | ||
|
|
||
| # Find the latest .vtt file in the "subtitles" folder | ||
| subtitle_files = glob.glob("subtitles/*.vtt") | ||
| if not subtitle_files: | ||
| return jsonify({"error": "No subtitles found"}), 404 | ||
|
|
||
| latest_subtitle = max(subtitle_files, key=os.path.getctime) | ||
| transcript_text = clean_transcript(latest_subtitle) | ||
|
|
||
| return jsonify({"transcript": transcript_text}) | ||
|
|
||
| finally: | ||
| # Guarantee cleanup runs even if subprocess or extraction fails | ||
| if latest_subtitle and os.path.exists(latest_subtitle): | ||
| app.logger.info(f"Deleting temporary subtitle file: {latest_subtitle}") | ||
| os.remove(latest_subtitle) | ||
| elif not latest_subtitle: | ||
| # Fallback cleanup: If yt-dlp crashed halfway but still emitted partial files | ||
| for stray in glob.glob(f"subtitles/{video_id}*.vtt"): | ||
| if os.path.exists(stray): | ||
| app.logger.info(f"Deleting stray temporary subtitle file: {stray}") | ||
| os.remove(stray) |
There was a problem hiding this comment.
Validate and scope video_id before using it in file paths/globs.
Line 547 and Line 567 use raw video_id in filesystem patterns. This allows wildcard/path manipulation and can delete or read unintended files. It also contributes to cross-request mixups because Line 551 scans all subtitle files globally.
🔒 Proposed fix
`@app.route`('/getTranscript', methods=['GET'])
def get_transcript():
video_id = request.args.get('videoId')
if not video_id:
return jsonify({"error": "No video ID provided"}), 400
+ if not re.fullmatch(r"[A-Za-z0-9_-]{11}", video_id):
+ return jsonify({"error": "Invalid video ID"}), 400
transcript_text = None
latest_subtitle = None
try:
subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download",
"--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"],
check=True, capture_output=True, text=True)
- # Find the latest .vtt file in the "subtitles" folder
- subtitle_files = glob.glob("subtitles/*.vtt")
+ # Only consider files generated for this request/video
+ subtitle_files = glob.glob(f"subtitles/{video_id}*.vtt")
if not subtitle_files:
return jsonify({"error": "No subtitles found"}), 404🧰 Tools
🪛 Ruff (0.15.9)
[error] 546-546: subprocess call: check for execution of untrusted input
(S603)
[error] 546-547: Starting a process with a partial executable path
(S607)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/server.py` around lines 542 - 570, Validate and sanitize video_id
before using it in file paths: ensure video_id matches an allowed pattern (e.g.,
only alphanumeric, hyphen, underscore or the expected YouTube ID length) or
produce a safe_filename via re.sub to strip path separators and dangerous chars;
then use that sanitized id (e.g., safe_id) to construct exact paths instead of
globbing all subtitles (replace glob.glob("subtitles/*.vtt") with a direct check
of f"subtitles/{safe_id}.vtt" or create a per-request temp dir for subtitles
named with safe_id); update the subprocess.run call and the cleanup logic to
reference safe_id/latest_subtitle only, and when you must glob for stray files
limit the pattern to the sanitized prefix (e.g., f"subtitles/{safe_id}*.vtt") so
no unintended files can be read or deleted (apply changes around the
subprocess.run block, the latest_subtitle assignment, clean_transcript call, and
the finally cleanup).
| finally: | ||
| # Guarantee cleanup runs even if subprocess or extraction fails | ||
| if latest_subtitle and os.path.exists(latest_subtitle): | ||
| app.logger.info(f"Deleting temporary subtitle file: {latest_subtitle}") | ||
| os.remove(latest_subtitle) | ||
| elif not latest_subtitle: | ||
| # Fallback cleanup: If yt-dlp crashed halfway but still emitted partial files | ||
| for stray in glob.glob(f"subtitles/{video_id}*.vtt"): | ||
| if os.path.exists(stray): | ||
| app.logger.info(f"Deleting stray temporary subtitle file: {stray}") | ||
| os.remove(stray) |
There was a problem hiding this comment.
Do not let cleanup failures break the endpoint response.
If os.remove() raises in finally (permission/race/missing file), it can override normal responses with a 500. Cleanup should be best-effort and logged.
🛠️ Proposed fix
finally:
# Guarantee cleanup runs even if subprocess or extraction fails
if latest_subtitle and os.path.exists(latest_subtitle):
- app.logger.info(f"Deleting temporary subtitle file: {latest_subtitle}")
- os.remove(latest_subtitle)
+ try:
+ app.logger.info("Deleting temporary subtitle file: %s", os.path.basename(latest_subtitle))
+ os.remove(latest_subtitle)
+ except OSError:
+ app.logger.exception("Failed to delete temporary subtitle file")
elif not latest_subtitle:
# Fallback cleanup: If yt-dlp crashed halfway but still emitted partial files
for stray in glob.glob(f"subtitles/{video_id}*.vtt"):
if os.path.exists(stray):
- app.logger.info(f"Deleting stray temporary subtitle file: {stray}")
- os.remove(stray)
+ try:
+ app.logger.info("Deleting stray temporary subtitle file: %s", os.path.basename(stray))
+ os.remove(stray)
+ except OSError:
+ app.logger.exception("Failed to delete stray temporary subtitle file")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| finally: | |
| # Guarantee cleanup runs even if subprocess or extraction fails | |
| if latest_subtitle and os.path.exists(latest_subtitle): | |
| app.logger.info(f"Deleting temporary subtitle file: {latest_subtitle}") | |
| os.remove(latest_subtitle) | |
| elif not latest_subtitle: | |
| # Fallback cleanup: If yt-dlp crashed halfway but still emitted partial files | |
| for stray in glob.glob(f"subtitles/{video_id}*.vtt"): | |
| if os.path.exists(stray): | |
| app.logger.info(f"Deleting stray temporary subtitle file: {stray}") | |
| os.remove(stray) | |
| finally: | |
| # Guarantee cleanup runs even if subprocess or extraction fails | |
| if latest_subtitle and os.path.exists(latest_subtitle): | |
| try: | |
| app.logger.info("Deleting temporary subtitle file: %s", os.path.basename(latest_subtitle)) | |
| os.remove(latest_subtitle) | |
| except OSError: | |
| app.logger.exception("Failed to delete temporary subtitle file") | |
| elif not latest_subtitle: | |
| # Fallback cleanup: If yt-dlp crashed halfway but still emitted partial files | |
| for stray in glob.glob(f"subtitles/{video_id}*.vtt"): | |
| if os.path.exists(stray): | |
| try: | |
| app.logger.info("Deleting stray temporary subtitle file: %s", os.path.basename(stray)) | |
| os.remove(stray) | |
| except OSError: | |
| app.logger.exception("Failed to delete stray temporary subtitle file") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/server.py` around lines 560 - 570, The cleanup in the finally block
(references: latest_subtitle, glob.glob, os.remove, app.logger) must be made
best-effort so any exception during os.remove does not propagate and turn a
successful response into a 500; wrap each file deletion (both the single
latest_subtitle branch and the stray-files loop) in a try/except that catches
Exception, logs the failure (use app.logger.error or app.logger.exception with
the file path and error), and continues without re-raising so cleanup failures
are non-fatal to the endpoint response.
Addressed Issues:
Fixes file lifecycle issue in subtitle processing
Screenshots/Recordings:
Not applicable (backend reliability fix)
Additional Notes:
Summary
This PR ensures that temporary subtitle files are always cleaned up, even if processing fails.
Previously, subtitle files were deleted only on successful execution. If an error occurred during transcript extraction or subprocess execution, temporary files could remain on disk.
Changes
Impact
This prevents:
AI Usage Disclosure:
I identified the issue and designed the solution through manual analysis. AI tools were used minimally to refine the implementation and review edge cases.
Checklist
Summary by CodeRabbit
Release Notes
Documentation
Bug Fixes