[Backend] Ensure subtitle file cleanup on failure by SAI-HARISH2007 · Pull Request #664 · AOSSIE-Org/EduAid

SAI-HARISH2007 · 2026-04-13T17:49:34Z

Addressed Issues:

Fixes file lifecycle issue in subtitle processing

Screenshots/Recordings:

Not applicable (backend reliability fix)

Additional Notes:

Summary

This PR ensures that temporary subtitle files are always cleaned up, even if processing fails.

Previously, subtitle files were deleted only on successful execution. If an error occurred during transcript extraction or subprocess execution, temporary files could remain on disk.

Changes

Wrapped subprocess execution and transcript extraction in a try/finally block
Ensured cleanup runs regardless of success or failure
Added safety handling for missing subtitle files
Added fallback cleanup for partially created files
Added logging for cleanup operations

Impact

This prevents:

Storage leaks due to leftover subtitle files
Accumulation of temporary files over time
Potential exposure of intermediate data

AI Usage Disclosure:

This PR contains AI-assisted code. I have read the AI Usage Policy and this PR complies with it. I have tested the changes locally and take full responsibility.

I identified the issue and designed the solution through manual analysis. AI tools were used minimally to refine the implementation and review edge cases.

Checklist

My PR addresses a single issue
My code follows project conventions
Documentation updated (not applicable)
Tests added (can be added if required)
No new warnings/errors introduced
Will share PR on Discord
Read contribution guidelines
Will address CodeRabbit comments
Template filled correctly

Summary by CodeRabbit

Release Notes

Documentation
- Added GSoC 2026 backend hardening and reliability proposal document.
Bug Fixes
- Enhanced subtitle file handling and cleanup in the transcript extraction endpoint for improved reliability.

coderabbitai · 2026-04-13T17:49:51Z

📝 Walkthrough

Walkthrough

A GSoC 2026 backend hardening proposal document was added outlining security and reliability improvements including file handling hardening, async task migration via Celery, and model consolidation. Additionally, the /getTranscript endpoint was updated to implement guaranteed cleanup of temporary subtitle files using try/finally blocks with improved logging.

Changes

Cohort / File(s)	Summary
GSoC Proposal Documentation `GSOC_PROPOSAL_DRAFT.md`	New proposal document detailing a comprehensive backend hardening roadmap covering security hardening (path traversal, input validation, shell injection prevention), reliability improvements (temp-file cleanup, per-request directory isolation, exception logging), async migration with Celery/Redis, rate limiting, and centralized model management.
Subtitle Cleanup Enhancement `backend/server.py`	Modified `/getTranscript` endpoint to wrap yt-dlp subtitle extraction and cleanup in a `try`/`finally` block, ensuring guaranteed deletion of temporary subtitle files. Added fallback cleanup logic for stray files and logging of deleted filenames via `app.logger.info`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 The files now cleanup with care,
No stray subtitles left there!
With try and finally bound tight,
Security shines ever bright,
Backend hardening takes its flight! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: ensuring subtitle file cleanup in the backend endpoint, particularly addressing cleanup on failure via try/finally logic.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Warning

⚠️ This pull request might be slop. It has been flagged by CodeRabbit slop detection and should be reviewed carefully.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/server.py`:
- Around line 560-570: The cleanup in the finally block (references:
latest_subtitle, glob.glob, os.remove, app.logger) must be made best-effort so
any exception during os.remove does not propagate and turn a successful response
into a 500; wrap each file deletion (both the single latest_subtitle branch and
the stray-files loop) in a try/except that catches Exception, logs the failure
(use app.logger.error or app.logger.exception with the file path and error), and
continues without re-raising so cleanup failures are non-fatal to the endpoint
response.
- Around line 542-570: Validate and sanitize video_id before using it in file
paths: ensure video_id matches an allowed pattern (e.g., only alphanumeric,
hyphen, underscore or the expected YouTube ID length) or produce a safe_filename
via re.sub to strip path separators and dangerous chars; then use that sanitized
id (e.g., safe_id) to construct exact paths instead of globbing all subtitles
(replace glob.glob("subtitles/*.vtt") with a direct check of
f"subtitles/{safe_id}.vtt" or create a per-request temp dir for subtitles named
with safe_id); update the subprocess.run call and the cleanup logic to reference
safe_id/latest_subtitle only, and when you must glob for stray files limit the
pattern to the sanitized prefix (e.g., f"subtitles/{safe_id}*.vtt") so no
unintended files can be read or deleted (apply changes around the subprocess.run
block, the latest_subtitle assignment, clean_transcript call, and the finally
cleanup).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f29e64e-49f1-4554-894e-31f017d4fe94

📥 Commits

Reviewing files that changed from the base of the PR and between 2038116 and e589f97.

📒 Files selected for processing (2)

GSOC_PROPOSAL_DRAFT.md
backend/server.py

coderabbitai · 2026-04-13T17:53:21Z

+    transcript_text = None
+    latest_subtitle = None

-    # Find the latest .vtt file in the "subtitles" folder
-    subtitle_files = glob.glob("subtitles/*.vtt")
-    if not subtitle_files:
-        return jsonify({"error": "No subtitles found"}), 404
-
-    latest_subtitle = max(subtitle_files, key=os.path.getctime)
-    transcript_text = clean_transcript(latest_subtitle)
-
-    # Optional: Clean up the file after reading
-    os.remove(latest_subtitle)
-
-    return jsonify({"transcript": transcript_text})
+    try:
+        subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download",
+                    "--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"],
+                   check=True, capture_output=True, text=True)
+
+        # Find the latest .vtt file in the "subtitles" folder
+        subtitle_files = glob.glob("subtitles/*.vtt")
+        if not subtitle_files:
+            return jsonify({"error": "No subtitles found"}), 404
+
+        latest_subtitle = max(subtitle_files, key=os.path.getctime)
+        transcript_text = clean_transcript(latest_subtitle)
+
+        return jsonify({"transcript": transcript_text})
+
+    finally:
+        # Guarantee cleanup runs even if subprocess or extraction fails
+        if latest_subtitle and os.path.exists(latest_subtitle):
+            app.logger.info(f"Deleting temporary subtitle file: {latest_subtitle}")
+            os.remove(latest_subtitle)
+        elif not latest_subtitle:
+            # Fallback cleanup: If yt-dlp crashed halfway but still emitted partial files
+            for stray in glob.glob(f"subtitles/{video_id}*.vtt"):
+                if os.path.exists(stray):
+                    app.logger.info(f"Deleting stray temporary subtitle file: {stray}")
+                    os.remove(stray)


⚠️ Potential issue | 🔴 Critical

Validate and scope video_id before using it in file paths/globs.

Line 547 and Line 567 use raw video_id in filesystem patterns. This allows wildcard/path manipulation and can delete or read unintended files. It also contributes to cross-request mixups because Line 551 scans all subtitle files globally.

🔒 Proposed fix

`@app.route`('/getTranscript', methods=['GET']) def get_transcript(): video_id = request.args.get('videoId') if not video_id: return jsonify({"error": "No video ID provided"}), 400 + if not re.fullmatch(r"[A-Za-z0-9_-]{11}", video_id): + return jsonify({"error": "Invalid video ID"}), 400 transcript_text = None latest_subtitle = None try: subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download", "--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"], check=True, capture_output=True, text=True) - # Find the latest .vtt file in the "subtitles" folder - subtitle_files = glob.glob("subtitles/*.vtt") + # Only consider files generated for this request/video + subtitle_files = glob.glob(f"subtitles/{video_id}*.vtt") if not subtitle_files: return jsonify({"error": "No subtitles found"}), 404

🧰 Tools

🪛 Ruff (0.15.9)

[error] 546-546: subprocess call: check for execution of untrusted input

(S603)

[error] 546-547: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 542 - 570, Validate and sanitize video_id before using it in file paths: ensure video_id matches an allowed pattern (e.g., only alphanumeric, hyphen, underscore or the expected YouTube ID length) or produce a safe_filename via re.sub to strip path separators and dangerous chars; then use that sanitized id (e.g., safe_id) to construct exact paths instead of globbing all subtitles (replace glob.glob("subtitles/*.vtt") with a direct check of f"subtitles/{safe_id}.vtt" or create a per-request temp dir for subtitles named with safe_id); update the subprocess.run call and the cleanup logic to reference safe_id/latest_subtitle only, and when you must glob for stray files limit the pattern to the sanitized prefix (e.g., f"subtitles/{safe_id}*.vtt") so no unintended files can be read or deleted (apply changes around the subprocess.run block, the latest_subtitle assignment, clean_transcript call, and the finally cleanup).

coderabbitai · 2026-04-13T17:53:21Z

+    finally:
+        # Guarantee cleanup runs even if subprocess or extraction fails
+        if latest_subtitle and os.path.exists(latest_subtitle):
+            app.logger.info(f"Deleting temporary subtitle file: {latest_subtitle}")
+            os.remove(latest_subtitle)
+        elif not latest_subtitle:
+            # Fallback cleanup: If yt-dlp crashed halfway but still emitted partial files
+            for stray in glob.glob(f"subtitles/{video_id}*.vtt"):
+                if os.path.exists(stray):
+                    app.logger.info(f"Deleting stray temporary subtitle file: {stray}")
+                    os.remove(stray)


⚠️ Potential issue | 🟠 Major

Do not let cleanup failures break the endpoint response.

If os.remove() raises in finally (permission/race/missing file), it can override normal responses with a 500. Cleanup should be best-effort and logged.

🛠️ Proposed fix

finally: # Guarantee cleanup runs even if subprocess or extraction fails if latest_subtitle and os.path.exists(latest_subtitle): - app.logger.info(f"Deleting temporary subtitle file: {latest_subtitle}") - os.remove(latest_subtitle) + try: + app.logger.info("Deleting temporary subtitle file: %s", os.path.basename(latest_subtitle)) + os.remove(latest_subtitle) + except OSError: + app.logger.exception("Failed to delete temporary subtitle file") elif not latest_subtitle: # Fallback cleanup: If yt-dlp crashed halfway but still emitted partial files for stray in glob.glob(f"subtitles/{video_id}*.vtt"): if os.path.exists(stray): - app.logger.info(f"Deleting stray temporary subtitle file: {stray}") - os.remove(stray) + try: + app.logger.info("Deleting stray temporary subtitle file: %s", os.path.basename(stray)) + os.remove(stray) + except OSError: + app.logger.exception("Failed to delete stray temporary subtitle file")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

finally:

# Guarantee cleanup runs even if subprocess or extraction fails

if latest_subtitle and os.path.exists(latest_subtitle):

app.logger.info(f"Deleting temporary subtitle file: {latest_subtitle}")

os.remove(latest_subtitle)

elif not latest_subtitle:

# Fallback cleanup: If yt-dlp crashed halfway but still emitted partial files

for stray in glob.glob(f"subtitles/{video_id}*.vtt"):

if os.path.exists(stray):

app.logger.info(f"Deleting stray temporary subtitle file: {stray}")

os.remove(stray)

finally:

# Guarantee cleanup runs even if subprocess or extraction fails

if latest_subtitle and os.path.exists(latest_subtitle):

try:

app.logger.info("Deleting temporary subtitle file: %s", os.path.basename(latest_subtitle))

os.remove(latest_subtitle)

except OSError:

app.logger.exception("Failed to delete temporary subtitle file")

elif not latest_subtitle:

# Fallback cleanup: If yt-dlp crashed halfway but still emitted partial files

for stray in glob.glob(f"subtitles/{video_id}*.vtt"):

if os.path.exists(stray):

try:

app.logger.info("Deleting stray temporary subtitle file: %s", os.path.basename(stray))

os.remove(stray)

except OSError:

app.logger.exception("Failed to delete stray temporary subtitle file")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 560 - 570, The cleanup in the finally block (references: latest_subtitle, glob.glob, os.remove, app.logger) must be made best-effort so any exception during os.remove does not propagate and turn a successful response into a 500; wrap each file deletion (both the single latest_subtitle branch and the stray-files loop) in a try/except that catches Exception, logs the failure (use app.logger.error or app.logger.exception with the file path and error), and continues without re-raising so cleanup failures are non-fatal to the endpoint response.

fix: ensure subtitle file cleanup on failure

e589f97

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Backend] Ensure subtitle file cleanup on failure#664

[Backend] Ensure subtitle file cleanup on failure#664
SAI-HARISH2007 wants to merge 1 commit into
AOSSIE-Org:mainfrom
SAI-HARISH2007:fix-subtitle-cleanup

SAI-HARISH2007 commented Apr 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Uh oh!

coderabbitai Bot Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

SAI-HARISH2007 commented Apr 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Screenshots/Recordings:

Additional Notes:

Summary

Changes

Impact

AI Usage Disclosure:

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SAI-HARISH2007 commented Apr 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading