Skip to content

[Backend] Ensure subtitle file cleanup on failure#664

Open
SAI-HARISH2007 wants to merge 1 commit into
AOSSIE-Org:mainfrom
SAI-HARISH2007:fix-subtitle-cleanup
Open

[Backend] Ensure subtitle file cleanup on failure#664
SAI-HARISH2007 wants to merge 1 commit into
AOSSIE-Org:mainfrom
SAI-HARISH2007:fix-subtitle-cleanup

Conversation

@SAI-HARISH2007
Copy link
Copy Markdown

@SAI-HARISH2007 SAI-HARISH2007 commented Apr 13, 2026

Addressed Issues:

Fixes file lifecycle issue in subtitle processing


Screenshots/Recordings:

Not applicable (backend reliability fix)


Additional Notes:

Summary

This PR ensures that temporary subtitle files are always cleaned up, even if processing fails.

Previously, subtitle files were deleted only on successful execution. If an error occurred during transcript extraction or subprocess execution, temporary files could remain on disk.

Changes

  • Wrapped subprocess execution and transcript extraction in a try/finally block
  • Ensured cleanup runs regardless of success or failure
  • Added safety handling for missing subtitle files
  • Added fallback cleanup for partially created files
  • Added logging for cleanup operations

Impact

This prevents:

  • Storage leaks due to leftover subtitle files
  • Accumulation of temporary files over time
  • Potential exposure of intermediate data

AI Usage Disclosure:

  • This PR contains AI-assisted code. I have read the AI Usage Policy and this PR complies with it. I have tested the changes locally and take full responsibility.

I identified the issue and designed the solution through manual analysis. AI tools were used minimally to refine the implementation and review edge cases.


Checklist

  • My PR addresses a single issue
  • My code follows project conventions
  • Documentation updated (not applicable)
  • Tests added (can be added if required)
  • No new warnings/errors introduced
  • Will share PR on Discord
  • Read contribution guidelines
  • Will address CodeRabbit comments
  • Template filled correctly

Summary by CodeRabbit

Release Notes

  • Documentation

    • Added GSoC 2026 backend hardening and reliability proposal document.
  • Bug Fixes

    • Enhanced subtitle file handling and cleanup in the transcript extraction endpoint for improved reliability.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

A GSoC 2026 backend hardening proposal document was added outlining security and reliability improvements including file handling hardening, async task migration via Celery, and model consolidation. Additionally, the /getTranscript endpoint was updated to implement guaranteed cleanup of temporary subtitle files using try/finally blocks with improved logging.

Changes

Cohort / File(s) Summary
GSoC Proposal Documentation
GSOC_PROPOSAL_DRAFT.md
New proposal document detailing a comprehensive backend hardening roadmap covering security hardening (path traversal, input validation, shell injection prevention), reliability improvements (temp-file cleanup, per-request directory isolation, exception logging), async migration with Celery/Redis, rate limiting, and centralized model management.
Subtitle Cleanup Enhancement
backend/server.py
Modified /getTranscript endpoint to wrap yt-dlp subtitle extraction and cleanup in a try/finally block, ensuring guaranteed deletion of temporary subtitle files. Added fallback cleanup logic for stray files and logging of deleted filenames via app.logger.info.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 The files now cleanup with care,
No stray subtitles left there!
With try and finally bound tight,
Security shines ever bright,
Backend hardening takes its flight! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: ensuring subtitle file cleanup in the backend endpoint, particularly addressing cleanup on failure via try/finally logic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Warning

⚠️ This pull request might be slop. It has been flagged by CodeRabbit slop detection and should be reviewed carefully.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/server.py`:
- Around line 560-570: The cleanup in the finally block (references:
latest_subtitle, glob.glob, os.remove, app.logger) must be made best-effort so
any exception during os.remove does not propagate and turn a successful response
into a 500; wrap each file deletion (both the single latest_subtitle branch and
the stray-files loop) in a try/except that catches Exception, logs the failure
(use app.logger.error or app.logger.exception with the file path and error), and
continues without re-raising so cleanup failures are non-fatal to the endpoint
response.
- Around line 542-570: Validate and sanitize video_id before using it in file
paths: ensure video_id matches an allowed pattern (e.g., only alphanumeric,
hyphen, underscore or the expected YouTube ID length) or produce a safe_filename
via re.sub to strip path separators and dangerous chars; then use that sanitized
id (e.g., safe_id) to construct exact paths instead of globbing all subtitles
(replace glob.glob("subtitles/*.vtt") with a direct check of
f"subtitles/{safe_id}.vtt" or create a per-request temp dir for subtitles named
with safe_id); update the subprocess.run call and the cleanup logic to reference
safe_id/latest_subtitle only, and when you must glob for stray files limit the
pattern to the sanitized prefix (e.g., f"subtitles/{safe_id}*.vtt") so no
unintended files can be read or deleted (apply changes around the subprocess.run
block, the latest_subtitle assignment, clean_transcript call, and the finally
cleanup).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f29e64e-49f1-4554-894e-31f017d4fe94

📥 Commits

Reviewing files that changed from the base of the PR and between 2038116 and e589f97.

📒 Files selected for processing (2)
  • GSOC_PROPOSAL_DRAFT.md
  • backend/server.py

Comment thread backend/server.py
Comment on lines +542 to +570
transcript_text = None
latest_subtitle = None

# Find the latest .vtt file in the "subtitles" folder
subtitle_files = glob.glob("subtitles/*.vtt")
if not subtitle_files:
return jsonify({"error": "No subtitles found"}), 404

latest_subtitle = max(subtitle_files, key=os.path.getctime)
transcript_text = clean_transcript(latest_subtitle)

# Optional: Clean up the file after reading
os.remove(latest_subtitle)

return jsonify({"transcript": transcript_text})
try:
subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download",
"--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"],
check=True, capture_output=True, text=True)

# Find the latest .vtt file in the "subtitles" folder
subtitle_files = glob.glob("subtitles/*.vtt")
if not subtitle_files:
return jsonify({"error": "No subtitles found"}), 404

latest_subtitle = max(subtitle_files, key=os.path.getctime)
transcript_text = clean_transcript(latest_subtitle)

return jsonify({"transcript": transcript_text})

finally:
# Guarantee cleanup runs even if subprocess or extraction fails
if latest_subtitle and os.path.exists(latest_subtitle):
app.logger.info(f"Deleting temporary subtitle file: {latest_subtitle}")
os.remove(latest_subtitle)
elif not latest_subtitle:
# Fallback cleanup: If yt-dlp crashed halfway but still emitted partial files
for stray in glob.glob(f"subtitles/{video_id}*.vtt"):
if os.path.exists(stray):
app.logger.info(f"Deleting stray temporary subtitle file: {stray}")
os.remove(stray)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate and scope video_id before using it in file paths/globs.

Line 547 and Line 567 use raw video_id in filesystem patterns. This allows wildcard/path manipulation and can delete or read unintended files. It also contributes to cross-request mixups because Line 551 scans all subtitle files globally.

🔒 Proposed fix
 `@app.route`('/getTranscript', methods=['GET'])
 def get_transcript():
     video_id = request.args.get('videoId')
     if not video_id:
         return jsonify({"error": "No video ID provided"}), 400
+    if not re.fullmatch(r"[A-Za-z0-9_-]{11}", video_id):
+        return jsonify({"error": "Invalid video ID"}), 400
 
     transcript_text = None
     latest_subtitle = None
 
     try:
         subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download",
                     "--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"],
                    check=True, capture_output=True, text=True)
 
-        # Find the latest .vtt file in the "subtitles" folder
-        subtitle_files = glob.glob("subtitles/*.vtt")
+        # Only consider files generated for this request/video
+        subtitle_files = glob.glob(f"subtitles/{video_id}*.vtt")
         if not subtitle_files:
             return jsonify({"error": "No subtitles found"}), 404
🧰 Tools
🪛 Ruff (0.15.9)

[error] 546-546: subprocess call: check for execution of untrusted input

(S603)


[error] 546-547: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 542 - 570, Validate and sanitize video_id
before using it in file paths: ensure video_id matches an allowed pattern (e.g.,
only alphanumeric, hyphen, underscore or the expected YouTube ID length) or
produce a safe_filename via re.sub to strip path separators and dangerous chars;
then use that sanitized id (e.g., safe_id) to construct exact paths instead of
globbing all subtitles (replace glob.glob("subtitles/*.vtt") with a direct check
of f"subtitles/{safe_id}.vtt" or create a per-request temp dir for subtitles
named with safe_id); update the subprocess.run call and the cleanup logic to
reference safe_id/latest_subtitle only, and when you must glob for stray files
limit the pattern to the sanitized prefix (e.g., f"subtitles/{safe_id}*.vtt") so
no unintended files can be read or deleted (apply changes around the
subprocess.run block, the latest_subtitle assignment, clean_transcript call, and
the finally cleanup).

Comment thread backend/server.py
Comment on lines +560 to +570
finally:
# Guarantee cleanup runs even if subprocess or extraction fails
if latest_subtitle and os.path.exists(latest_subtitle):
app.logger.info(f"Deleting temporary subtitle file: {latest_subtitle}")
os.remove(latest_subtitle)
elif not latest_subtitle:
# Fallback cleanup: If yt-dlp crashed halfway but still emitted partial files
for stray in glob.glob(f"subtitles/{video_id}*.vtt"):
if os.path.exists(stray):
app.logger.info(f"Deleting stray temporary subtitle file: {stray}")
os.remove(stray)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not let cleanup failures break the endpoint response.

If os.remove() raises in finally (permission/race/missing file), it can override normal responses with a 500. Cleanup should be best-effort and logged.

🛠️ Proposed fix
     finally:
         # Guarantee cleanup runs even if subprocess or extraction fails
         if latest_subtitle and os.path.exists(latest_subtitle):
-            app.logger.info(f"Deleting temporary subtitle file: {latest_subtitle}")
-            os.remove(latest_subtitle)
+            try:
+                app.logger.info("Deleting temporary subtitle file: %s", os.path.basename(latest_subtitle))
+                os.remove(latest_subtitle)
+            except OSError:
+                app.logger.exception("Failed to delete temporary subtitle file")
         elif not latest_subtitle:
             # Fallback cleanup: If yt-dlp crashed halfway but still emitted partial files
             for stray in glob.glob(f"subtitles/{video_id}*.vtt"):
                 if os.path.exists(stray):
-                    app.logger.info(f"Deleting stray temporary subtitle file: {stray}")
-                    os.remove(stray)
+                    try:
+                        app.logger.info("Deleting stray temporary subtitle file: %s", os.path.basename(stray))
+                        os.remove(stray)
+                    except OSError:
+                        app.logger.exception("Failed to delete stray temporary subtitle file")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
finally:
# Guarantee cleanup runs even if subprocess or extraction fails
if latest_subtitle and os.path.exists(latest_subtitle):
app.logger.info(f"Deleting temporary subtitle file: {latest_subtitle}")
os.remove(latest_subtitle)
elif not latest_subtitle:
# Fallback cleanup: If yt-dlp crashed halfway but still emitted partial files
for stray in glob.glob(f"subtitles/{video_id}*.vtt"):
if os.path.exists(stray):
app.logger.info(f"Deleting stray temporary subtitle file: {stray}")
os.remove(stray)
finally:
# Guarantee cleanup runs even if subprocess or extraction fails
if latest_subtitle and os.path.exists(latest_subtitle):
try:
app.logger.info("Deleting temporary subtitle file: %s", os.path.basename(latest_subtitle))
os.remove(latest_subtitle)
except OSError:
app.logger.exception("Failed to delete temporary subtitle file")
elif not latest_subtitle:
# Fallback cleanup: If yt-dlp crashed halfway but still emitted partial files
for stray in glob.glob(f"subtitles/{video_id}*.vtt"):
if os.path.exists(stray):
try:
app.logger.info("Deleting stray temporary subtitle file: %s", os.path.basename(stray))
os.remove(stray)
except OSError:
app.logger.exception("Failed to delete stray temporary subtitle file")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 560 - 570, The cleanup in the finally block
(references: latest_subtitle, glob.glob, os.remove, app.logger) must be made
best-effort so any exception during os.remove does not propagate and turn a
successful response into a 500; wrap each file deletion (both the single
latest_subtitle branch and the stray-files loop) in a try/except that catches
Exception, logs the failure (use app.logger.error or app.logger.exception with
the file path and error), and continues without re-raising so cleanup failures
are non-fatal to the endpoint response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant